EN FR
EN FR


Section: New Results

Task based world modeling and understanding

Hidden robot

Participants: John Thomas (Master student), Philippe Martinet, Paolo Salaris, Sébastien Briot (LS2N-ARMEN)

When robots want to execute a task, they require to have an adequate representation of the environment where they will evolve. In model based approach, it is classical to describe environment using Metric Map where the function of perception (localization) and control (Path or trajectory tracking) refer to Cartesian state. In sensor based control, the methodology "teaching by showing" has been developed during the last 30 years. The concept of sensory memory has been then introduced in order to represent the task to be executed in sensor space. This concept is used in order to represent the task directly in the sensor space for a particular set of sensors. In summary, building the representation of the task (or the environment) is building the sensory memory, defining a particular motion (or trajectory) is defining a particular occurrence of sensor features, and executing the task is done when a control is designed to perceive the same as stored in the sensory memory. This approach has shown great ability in terms of robustness. However, it is still difficult to analyze the singularities and to demonstrate the stability property for those approaches (mainly when it is necessary to control 6 degree of freedom). In 2013, Sébastien Briot and Philippe Martinet have studied the visual servoing scheme of a Gough-Stewart Platform [18] and shown that it exists an hidden robot in the controller that can be used to study the behaviour and properties of it. The Hidden robot allows to transform the analysis of the controller by viewing it as a parallel robot. Recently, this concept has been applied to study the singularities of the visual servoing scheme of points and of lines [19]. This work continues in the framwork of the ANR project SESAME.

The idea of the new initial work done in 2019, is to find a methodology to design a task by using the Hidden robot concept. Navigation of a mobile robot has been considered in a first time. The followed methodology considers a topological navigation framework where a successive interaction situation are modelled by using an hidden robot: in some words, navigation is done by using a set of successive hidden parallel robots holding the robot when moving. At least two main question have been identified: What is the structure of the virtual robot which fits to task to be done? and Where to fix (or How to select) the anchors of this virtual robots?

For the first question, the idea is, considering different kind of features, to define a virtual parallel robot based on virtual legs. These virtual legs are directly linked to the considered feature. We have studied two cases, distance and angle, considering that existing sensors allow us to obtain the corresponding extracted features. After the modeling of sensors features, different control laws have been investigated allowing to produce motion of the mobile plateform. The corresponding hidden robots and the properties have been studied.

For the second question, two methods have been investigated using selection matrix of features or weighted features. The main used criteria is the transmibility index which relates the faculty of motion transmission of the virtual parallel mecanism.

This work [52] is preliminary and on going. We already have obtained preliminary results in simulation allowing a mobile plateform to evolve in a dedicated environment. It was the work done by John Thomas under the supervision of Philippe Martinet and Paolo Salaris.

End to end navigation

Participants: Renato Martins (Post-doc), Patrick Rives

This research deals with the problem of end-to-end learning for navigation in dynamic and crowded scenes solely from visual information. We investigate the problem of navigating an unknown space to reach a target of interest, for instance “doors", exploring the possibilities given by data-driven based models in the context of ANR MOBI-Deep project around the guidance of visually impaired people. A successful agent navigation policy requires learning general relationships between the agent actions, safety rules and its surrounding environment. We started studying a simple guidance model (turn left, right or stop), whose guidance task is to remain inside a specific region of the scene (to avoid collusion). This is equivalent to take the action to stay in the center of a corridor (indoor scene) or road (outdoor scenario). We first evaluate a relatively small supervised net composed of sixteen ResNet convolutional layers. This model was trained with real images from the Udacity autonomous driving challenge, but presented limited generalization when tested in either non-structured scenes or in scenes with humans. In order to overcome these limitations, we plan to train an A3C agent (Asynchronous Actor-Critic Agent) to learn the action policies in a reinforcement learning scheme, using data acquired of virtual environments with crowds. We also plan to evaluate the use of inputs from different levels as: scene semantic segmentation; depth inference from monocular images; and human and object detection information in the learning scheme.

Semantization of scene

Participants: Mohammed Boussaha (PhD, IGN), E. Fernandez-Moral, R. Martins, Patrick Rives

The work carried out in the ANR PlaTINUM project concerns the semantic labeling of images [17] acquired by agents (autonomous vehicles or pedestrians) moving in an urban-like environment and their accurate localization and guidance. A semantic labeling based on a machine learning approach (CNN) was developed. A same methodology is used to semantize virtual images built from a textured 3D mesh representation of the environment and images from the camera handled by the agents. Several strategies have been studied to exploit complementary information, such as color and depth for improving the accuracy of semantization. Our results show that exploiting this complementarity requires to perfectly align the different sources of informations. We proposed a new approach to the problem of calibration of heterogeneous multi sensors systems [41], [44]. We also looked for evaluating a new metric to quantify the accuracy of semantization provided by the CNN by taking into account the boundaries of semantized objects during the learning step. As a consequence, we show that weighting the boundary pixels in the images allows to segment more clearly the navigable areas used by different agents such as pedestrians (sidewalks) and cars (road). The results of this research were published in [29], [30]. The CNN used for labeling imagesacquired by different image sensors (perspective and spherical) was pre-trained from public datasets with perspective images of urban-like environments (simulated or real). In the context of the Platinum ANR project, a fine-tuning was done with some spherical images acquired in Rouen by the IGN Stereopolis vehicle and then hand-labelled. A Docker version of the software has been made available on the project server in order to be used by the other partners.

A localization method has also been implemented to exploit information of color, depth and semantics (when this information is available). An estimation of the agent position (6DOV, rotation and translation) is computed thanks to a dense method that minimizes the geometric, photometric and semantic differences between a spherical view provided by a SIG (Système d'Information Géographique) data base hosted in a cloud server and the current view of the agent.

During the last year of the project, the methods developed in PlaTINUM were consolidated and validated on the data acquired in Rouen. As originally planned in the project, Inria enlisted the help of iXblue-division Robopec to integrate the various functions developed during the project. This software, called Perception360, will be from now the software platform for all perception developments in the Inria CHORALE team.

Optical Flow Estimation Using Deep Learning In Spherical Images

Participants: Haozhou Zhang (Master), Cédric Demonceaux (Vibot), Guillaume Allibert

In a complex environment such as in a forest, the autonomous navigation is a challenging problem due to many constraints such as the loss of GPS signals because dense and unstructured environments (branches, foliage, ...) reduce the visibility. Without GPS signals, a vision system with the ability to capture everything going on around you seems more valuable than ever and crucial to navigate in this environment. Spherical images offer great benefits over classical cameras wherever a wide field of view is essential.

The equirectangular projection is a popular representation of images taken by spherical cameras. In this projection, the latitude and longitude of the spherical images are projected to horizontal and vertical coordinates on a 2D plane. However, this equirectangular projection suffers from distortions, especially in polar regions. In this case, the density of features is no longer regular at different latitudes of the images. As a result, traditional image processing methods that have been used for perspectives images do not have good performance when they are applied to equirectangular images.

Optical flow estimation is a basic problem of computer vision [50]. It is generally used as input of algorithm for autonomous navigation. Given two successive images, it estimates the motion vector in 2D (in x and y direction) for each pixel from between the two input images. Optical flow is usually considered as a good approximation of the true physical motion mapped on the image plane. It provides a concise description of the direction and velocity of the motion. In [24] and [36], CNNs which are capable of solving the optical flow estimation problem as a supervised learning task are proposed and became the standard for optical flow estimation. However, the dataset used to train [24], [36] is only based on perspective images. Even if they can be used directly with spherical images as input, the high distortions coming from equirectangular projection drastically reduce the global performance of these networks. One possible way to solve this issue is to train the networks proposed in [24], [36] with spherical images. Unfortunately, these databases do not exist and generating them would be a long and costly process.

In the Master's Hoazhou [55], we have proposed a solution to overcome this issue in proposing an adaptation of FlowNet networks to deal with the distortions in the equirectangular projection of spherical images. The proposed approach lies a distortion aware convolution used as convolution layers in the network to deal with distortions in equirectangular images. The proposed networks allows the models to be trained by perspective images and be applied to spherical images using an adapted convolution which is coherent with the spherical image. This solution avoids training a large number of spherical images which is not available and costly to generate.