• The Inria's Research Teams produce an annual Activity Report presenting their activities and their results of the year. These reports include the team members, the scientific program, the software developed by the team and the new results of the year. The report also describes the grants, contracts and the activities of dissemination and teaching. Finally, the report gives the list of publications of the year.

• Legal notice
• Personal data

## Section: New Results

### Lifelong Autonomy

#### Foundations of Reinforcement Learning

##### $\rho$-POMDPs have Lipschitz-Continuous $ϵ$-Optimal Value Functions

Participant : Vincent Thomas.

Collaboration with Jilles Dibangoye (INSA Lyon).

Many state-of-the-art algorithms for solving Partially Observable Markov Decision Processes (POMDPs) rely on turning the problem into a “fully observable” problem—a belief MDP—and exploiting the piece-wise linearity and convexity (PWLC) of the optimal value function in this new state space (the belief simplex $\Delta$). This approach has been extended to solving $\rho$-POMDPs—i.e., for information-oriented criteria-when the reward $\rho$ is convex in $\Delta$. General $\rho$-POMDPs can also be turned into “fully observable” problems, but with no means to exploit the PWLC property. In this paper, we focus on POMDPs and $\rho$-POMDPs with $\lambda$ $\rho$-Lipschitz reward function, and demonstrate that, for finite horizons, the optimal value function is Lipschitz-continuous. Then, value function approximators are proposed for both upper-and lower-bounding the optimal value function, which are shown to provide uniformly improvable bounds. This allows proposing two algorithms derived from HSVI which are empirically evaluated on various benchmark problems.

Publication: [14]

##### Addressing Active Sensing Problem through MCTS

Participants : Vincent Thomas, Geremy Hutin.

The problem of active sensing is of paramount interest for building self awareness in robotic systems. It consists of a system to make decisions in order to gather information (measured through the entropy of the probability distribution over unknown variables) in an optimal way.

In the past, we have proposed an original formalism $\rho$-POMDP and new algorithms for representing and solving active sensing problems [33] by using point-based algorithms. This year, new approaches based on Monte-Carlo Tree Search algorithms (MCTS) and Partially Observable Monte-Carlo Planning (POMCP) [45] have been proposed to build the policies of an agent whose aim is to gather information.

#### Robot Learning

Our main objective is to design data-efficient trial-and-error learning algorithms (reinforcement learning) that can work with continuous states and continuous actions. The main use-case is robot damage recovery: a robot has to to discover new behaviors by trial-and-error without a diagnosis of the damage.

##### Adaptive and Resilient Soft Tensegrity Robots

Participant : Jean-Baptiste Mouret.

Collaboration with John Rieffel (Union College, USA).

Living organisms intertwine soft (e.g., muscle) and hard (e.g., bones) materials, giving them an intrinsic flexibility and resiliency often lacking in conventional rigid robots. The emerging field of soft robotics seeks to harness these same properties to create resilient machines. The nature of soft materials, however, presents considerable challenges to aspects of design, construction, and control—and up until now, the vast majority of gaits for soft robots have been hand-designed through empirical trial-and-error. In this contribution, we introduced an easy-to-assemble tensegrity-based soft robot capable of highly dynamic locomotive gaits and demonstrating structural and behavioral resilience in the face of physical damage. Enabling this is the use of a machine learning algorithm able to discover effective gaits with a minimal number of physical trials. These results lend further credence to soft-robotic approaches that seek to harness the interaction of complex material dynamics to generate a wealth of dynamical behaviors.

Publication: [10]

##### Bayesian Optimization with Automatic Prior Selection for Data-Efficient Direct Policy Search

Participants : Konstantinos Chatzilygeroudis, Jean-Baptiste Mouret.

One of the most interesting features of Bayesian optimization for direct policy search is that it can leverage priors (e.g., from simulation or from previous tasks) to accelerate learning on a robot. In this contribution, we are interested in situations for which several priors exist but we do not know in advance which one fits best the current situation. We tackle this problem by introducing a novel acquisition function, called Most Likely Expected Improvement (MLEI), that combines the likelihood of the priors and the expected improvement. We evaluate this new acquisition function on a transfer learning task for a 5-DOF planar arm and on a possibly damaged, 6-legged robot that has to learn to walk on flat ground and on stairs, with priors corresponding to different stairs and different kinds of damages. Our results show that MLEI effectively identifies and exploits the priors, even when there is no obvious match between the current situations and the priors.

Publication: [23]

##### Multi-objective Model-based Policy Search for Data-efficient Learning with Sparse Rewards

Participants : Rituraj Kaushik, Konstantinos Chatzilygeroudis, Jean-Baptiste Mouret.

The most data-efficient algorithms for reinforcement learning in robotics are model-based policy search algorithms, which alternate between learning a dynamical model of the robot and optimizing a policy to maximize the expected return given the model and its uncertainties. However, the current algorithms lack an effective exploration strategy to deal with sparse or misleading reward scenarios: if they do not experience any state with a positive reward during the initial random exploration, they are very unlikely to solve the problem. To address this challenge, we proposed a novel model-based policy search algorithm, Multi-DEX, that leverages a learned dynamical model to efficiently explore the task space and solve tasks with sparse rewards in a few episodes. To achieve this, we frame the policy search problem as a multi-objective, model-based policy optimization problem with three objectives: (1) generate maximally novel state trajectories, (2) maximize the cumulative reward and (3) keep the system in state-space regions for which the model is as accurate as possible. We then optimize these objectives using a Pareto-based multi-objective optimization algorithm. The experiments show that Multi-DEX is able to solve sparse reward scenarios (with a simulated robotic arm) in much lower interaction time than VIME, TRPO, GEP-PG, CMA-ES and Black-DROPS.

Publication: [18]

##### Using Parameterized Black-Box Priors to Scale Up Model-Based Policy Search for Robotics

Participants : Konstantinos Chatzilygeroudis, Jean-Baptiste Mouret.

Among the few model-based policy search algorithms, the recently introduced Black-DROPS algorithm exploits a black-box optimization algorithm to achieve both high data-efficiency and good computation times when several cores are used; nevertheless, like all model-based policy search approaches, Black-DROPS does not scale to high dimensional state/action spaces. In this paper, we introduce a new model learning procedure in Black-DROPS that leverages parameterized black-box priors to (1) scale up to high-dimensional systems, and (2) be robust to large inaccuracies of the prior information. We demonstrate the effectiveness of our approach with the “pendubot” swing-up task in simulation and with a physical hexapod robot (48D state space, 18D action space) that has to walk forward as fast as possible. The results show that our new algorithm is more data-efficient than previous model-based policy search algorithms (with and without priors) and that it can allow a physical 6-legged robot to learn new gaits in only 16 to 30 seconds of interaction time.

Publication: [12]

##### Data-efficient Neuroevolution with Kernel-Based Surrogate Models

Participants : Adam Gaier, Jean-Baptiste Mouret.

Collaboration with Alexander Asteroth (Hochschule Bonn-Rhein-Sieg, Germany)

Surrogate-assistance approaches have long been used in computationally expensive domains to improve the data-efficiency of optimization algorithms. Neuroevolution, however, has so far resisted the application of these techniques because it requires the surrogate model to make fitness predictions based on variable topologies, instead of a vector of parameters. Our main insight is that we can sidestep this problem by using kernel-based surrogate models, which require only the definition of a distance measure between individuals. Our second insight is that the well-established Neuroevolution of Augmenting Topologies (NEAT) algorithm provides a computationally efficient distance measure between dissimilar networks in the form of “compatibility distance”, initially designed to maintain topological diversity. Combining these two ideas, we introduce a surrogate-assisted neuroevolution algorithm that combines NEAT and a surrogate model built using a compatibility distance kernel. We demonstrate the data-efficiency of this new algorithm on the low dimensional cart-pole swing-up problem, as well as the higher dimensional half-cheetah running task. In both tasks the surrogate-assisted variant achieves the same or better results with several times fewer function evaluations as the original NEAT.

Publication: [17] (best paper, GECCO 2018, Complex System track)

##### Alternating Optimization and Quadrature for Robust Control

Participants : Konstantinos Chatzilygeroudis, Jean-Baptiste Mouret.

Collaboration with Shimon Whiteson (Oxford, UK).

Bayesian optimization has been successfully applied to a variety of reinforcement learning problems. However, the traditional approach for learning optimal policies in simulators does not utilise the opportunity to improve learning by adjusting certain environment variables — state features that are randomly determined by the environment in a physical setting but are controllable in a simulator. In this wwork, we consider the problem of finding an optimal policy while taking into account the impact of environment variables. We present alternating optimization and quadrature (ALOQ), which uses Bayesian optimization and Bayesian quadrature to address such settings. ALOQ is robust to the presence of significant rare events, which may not be observable under random sampling, but have a considerable impact on determining the optimal policy. The experimental results demonstrate that our approach learns more efficiently than existing methods.

Publication: [22]

##### Learning robust task priorities of QP-based whole-body torque-controllers

Participants : Marie Charbonneau, Serena Ivaldi, Valerio Modugno, Jean-Baptiste Mouret.

Generating complex whole-body movements for humanoid robots is now most often achieved with multi-task whole-body controllers based on quadratic programming. To perform on the real robot, such controllers often require a human expert to tune or optimize the many parameters of the controller related to the tasks and to the specific robot, which is generally reported as a tedious and time consuming procedure. This problem can be tackled by automatically optimizing some parameters such as task priorities or task trajectories, while ensuring constraints satisfaction, through simulation. However, this does not guarantee that parameters optimized in simulation will also be optimal for the real robot. As a solution, the present paper focuses on optimizing task priorities in a robust way, by looking for solutions which achieve desired tasks under a variety of conditions and perturbations. This approach, which can be referred to as domain randomization, can greatly facilitate the transfer of optimized solutions from simulation to a real robot. The proposed method is demonstrated using a simulation of the humanoid robot iCub for a whole-body stepping task.

Publication: [11]

#### Quality Diversity Algorithms

Quality diversity algorithms are a new kind of evolutionary algorithms that focuses on finding a large set of high-performing solutions (instead of the global optimum). We use them for design and as a step for data-efficient robot learning.

##### Data-Efficient Design Exploration through Surrogate-Assisted Illumination

Participants : Adam Gaier, Jean-Baptiste Mouret.

Collaboration with Alexander Asteroth (Hochschule Bonn-Rhein-Sieg, Germany)

Design optimization techniques are often used at the beginning of the design process to explore the space of possible designs. In these domains illumination algorithms, such as MAP-Elites, are promising alternatives to classic optimization algorithms because they produce diverse, high-quality solutions in a single run, instead of only a single near-optimal solution. Unfortunately, these algorithms currently require a large number of function evaluations, limiting their applicability. In this work, we introduce a new illumination algorithm, Surrogate-Assisted Illumination (SAIL), that leverages surrogate modeling techniques to create a map of the design space according to user-defined features while minimizing the number of fitness evaluations. On a 2-dimensional airfoil optimization problem SAIL produces hundreds of diverse but high-performing designs with several orders of magnitude fewer evaluations than MAP-Elites or CMA-ES. We demonstrate that SAIL is also capable of producing maps of high-performing designs in realistic 3-dimensional aerodynamic tasks with an accurate flow simulation. Data-efficient design exploration with SAIL can help designers understand what is possible, beyond what is optimal, by considering more than pure objective-based optimization.

Publication: [7]

##### Discovering the Elite Hypervolume by Leveraging Interspecies Correlation

Participants : Vassilis Vassiliades, Jean-Baptiste Mouret.

Evolution has produced an astonishing diversity of species, each filling a different niche. Algorithms like MAP-Elites mimic this divergent evolutionary process to find a set of behaviorally diverse but high-performing solutions, called the elites. Our key insight is that species in nature often share a surprisingly large part of their genome, in spite of occupying very different niches; similarly, the elites are likely to be concentrated in a specific "elite hypervolume" whose shape is defined by their common features. In this paper, we first introduce the elite hypervolume concept and propose two metrics to characterize it: the genotypic spread and the genotypic similarity. We then introduce a new variation operator, called “directional variation”, that exploits interspecies (or inter-elites) correlations to accelerate the MAP-Elites algorithm. We demonstrate the effectiveness of this operator in three problems (a toy function, a redundant robotic arm, and a hexapod robot).

Publication: [25]

##### Maintaining Diversity in Robot Swarms with Distributed Embodied Evolution

Participants : Amine Boumaza, François Charpillet.

We investigated how behavioral diversity can be maintained in evolving robot swarms by using distributed Embodied Evolution. In these approaches, each robot in the swarm runs a separate evolutionary algorithm, and populations on each robot are built through local communication when robots meet; therefore, genome survival results not only from fitness-based selection but also from spatial spread. To better understand how diversity is maintained in distributed embodied evolution, we propose a postanalysis diversity measure — global diversity (over the swarm), and local diversity (on each robot) —, on two swarm robotic tasks — navigation and item collection —, with different intensities of selection pressure, and compare the results of distributed embodied evolution to a centralized case. We conclude that distributed evolution intrinsically maintains a larger behavioral diversity when compared to centralized evolution, which allows for the search algorithm to reach higher performances, especially in the more challenging collection task.

Publication: [16]