Section: New Software and Platforms
Keywords: Deep learning - Policy Learning
Scientific Description: We created a software architecture that combines a state-of-the-art computer vision system with a policy learning framework. This system is able to perceive a visual scene, given by a still image, extract facts (“predicates”), and propose an optimal action to achieve a given goal. Both systems are chained into a pipeline that is trained by presenting images and demonstrating an optimal action. By providing this information, both the predicate recognition model and the policy learning model are updated.
Our architecture is based on the recent works of Lerer, A., Gross, S., & Fergus, R., 2016 ("Learning Physical Intuition of Block Towers by Example"). They created a large network able to identify physical properties of stacked blocks. Analogously our vision system utilizes the same network layout (without the image prediction auxiliary output), with an added output layer for predicates, based on the expected number and arity of predicates. The vision subsystem is not trained with a common cross-entropy or MSE loss function, but instead receives its loss form the policy learning subsystem. The policy learning module calculates the loss as optimal combination of predicates for the given expert action.
By using this combination of systems, the architecture as a whole requires significantly fewer data samples than other systems (which exclusively utilize neural networks). This makes the approach more feasible to real-life applciation with actual live demonstration.
Functional Description: The neural network consists of ResNet-50 (the currently best-performing computer vision system), with 50 layers, 2 layers for converting the output of ResNet to predicates and a varying amount of output neurons, corresponding to the estimated number of n-arity predicates. The network was pretrained on the ImageNet dataset. The policy learning module incorporates the ACE tree learning tool and a wrapper in Prolog.
Our example domain consists of 2-4 cubes colored in red, blue, green, and yellow and randomly stacked on top of each other in a virtual 3D environment. The dataset used for training and testing contains a total of 30000 elements, each with an image of the scene, the correct predicates, a list of blocks that are present and the corresponding expert action, that would lead to stacking the blocks to a tower.