EN FR
EN FR


Section: New Results

Decision Making

Searching for Information with MDPs

Participants : Mauricio Araya, Olivier Buffet, Vincent Thomas, François Charpillet.

In the context of Mauricio Araya's PhD and PostDoc, we are working on how MDPs – or related models – can search for information. This has led to various research directions, such as extending POMDPs so as to optimize information-based rewards, or actively learning MDP models. This year has begun with the defense of Mauricio's PhD thesis in February. Since then, we have kept extending Mauricio's work and are preparing journal submissions.

While we have done some progress in this field, there are no concrete outcomes to present concerning optimistic approaches for model-based Bayesian Reinforcement Learning. Concerning POMDPs with information-based rewards, Mauricio's PhD thesis presents strong theoretical results that allow – in principle – deriving efficient algorithms from state-of-the-art “point-based” POMDP solvers. This year we have put this idea into practice, implementing variants of PBVI, PERSEUS and HSVI.

Preliminary results have been published (in French) in JFPDA'13 [32] . A journal paper with complete theoretical and empirical results is under preparation.

Adaptive Management with POMDPs

Participant : Olivier Buffet.

Samuel Nicol, Iadine Chadès (CSIRO), Takuya Iwamura (Stanford University) are external collaborators.

In the field of conservation biology, adaptive management is about managing a system, e.g., performing actions so as to protect some endangered species, while learning how it behaves. This is a typical reinforcement learning task that could for example be addressed through Bayesian Reinforcement Learning.

This year, we have worked in the context of bird migratory pathways, in particular the East Asian-Australasian (EAA) flyway, which is modeled as a network whose nodes are land areas where birds need to stay for some time. An issue is that these land areas are threatened due to sea level rise. The adaptive management problem at hand is that of deciding in the protection of which land areas to invest money so as to preserve the migratory pathways as efficiently as possible.

The outcome of this work is a data challenge paper published at IJCAI'13 [27] , which presents the problem at hand, describes its POMDP model, gives empirical results obtained with state-of-the-art solvers, and challenges POMDP practitioners to find better solution techniques.

Solving decentralized stochastic control problems as continuous-state MDPs

Participants : Jilles Dibangoye, Olivier Buffet, François Charpillet.

External collaborators: Christopher Amato (MIT), Arnaud Doniec (EMD), Charles Bessonnet (Telecom Nancy), Joni Pajarinen (Aalto University).

Decentralized partially observable Markov decision processes (DEC-POMDPs) are rich models for cooperative decision-making under uncertainty, but are often intractable to solve optimally (NEXP-complete), even using efficient heuristic search algorithms. In this work, we present an efficient methodology to solving decentralized stochastic control problems formalized as a DEC-POMDP or its subclasses. This methodology is three-fold: (1) it converts the original decentralized problem into a centralized problem from the perspective of a solution method that can take advantage of the total data about the original problem that is available during the online execution phase; (2) it shows that the original and transformed problems are equivalent; (3) it solves the transformed problem using a centralized method and transfers the solution back to the original problem. We applied this methodology in various different decentralized stochastic control problems.

Our results include the application of this methodology over DEC-POMDPs [20] , [33] . We recast them into deterministic continuous-state MDPs, where states — called occupancy states — are probability distributions over states and action-observation histories of the original DEC-POMDPs. We also demonstrate the occupancy state is a sufficient statistic for optimally solving DEC-POMDPs. We further show the optimal value function is a piecewise-linear and convex function of the occupancy states. With these results as a background, we prove for the first time that POMDP (and more generally continuous-state MDP) solution methods can, at least in principle, apply in DEC-POMDPs. This work has been presented at IJCAI'2013 [20] and (in French) at JFPDA'2013 [33] , and an in-depth journal article is currently under preparation. We have already extended the results we obtained for general DEC-POMDPs in the case of transition- and observation-independent DEC-MDPs. Of particular interest, we demonstrated that the occupancy states can be further compressed into a probability distribution over the states — the first sufficient statistic in decentralized stochastic control problems that is invariant with time. This work has been presented at AAMAS'2013 [21] , and an in-depth journal article is currently under preparation.

We believe our methodology lays the foundation for further work on optimal as well as approximate solution methods for decentralized stochastic control problems in particular, and stochastic control problems in general.

Abstraction Pathologies in Markov Decision Processes

Participants : Manel Tagorti, Bruno Scherrer, Olivier Buffet.

Jörg Hoffmann, former member of MAIA, is an external collaborator (from Saarland University).

Abstraction is a common method to compute lower bounds in classical planning, imposing an equivalence relation on the state space and deriving the lower bound from the quotient system. It is a trivial and well-known fact that refined abstractions can only improve the lower bound. Thus, when we embarked on applying the same technique in the probabilistic setting, our firm belief was to find the same behavior there. We were wrong. Indeed, there are cases where every direct refinement step (splitting one equivalence class into two) yields strictly worse bounds. We give a comprehensive account of the issues involved, for two wide-spread methods to define and use abstract MDPs.

This work has been presented and published in the ICAPS-13 workshop on Heuristics and Search for Domain-Independent Planning (HSDIP) [29] and (in French) in JFPDA-13 [37] .

Evolutionary programming for Policies Space exploration

Participants : Amine Boumaza, Vincent Thomas.

Evolutionary Programming proposed by Fogel (initially introduced in 1966) is an approach to build an automaton optimizing a fitness function. Like other evolutionary algorithms, an initial population of automata is given, and the evolutionary programming algorithm will make this population evolve by progressively modifying automata (mutations) and keeping the most efficient ones in the next generation.

This process is close to the progressive construction by a policy iteration algorithm in a POMDP and we are currently investigating the links between these approaches.

This work has begun this year through an internship (Benjamin Bibler) and preliminary development has been made to solve the Santa Fe trail problem proposed by Koza (1992) which has become a benchmark to compare genetic and evolutionary programming approaches.

Evolutionary Learning of Tetris Policies

Participant : Amine Boumaza.

Learning Tetris controllers is an interesting and challenging problem due to the fact of the size of its search space where traditional machine learning methods do not work and the use of approximate methods is necessary (see 6.1.10 ). In this work we study the performance of a direct policy search algorithm namely the Covariance Matrix Adaptation Evolution Strategy (CMAES). We also proposed different techniques to reduce the learning time, one of which is racing. This approach concentrates the computation effort on promising policies and quickly disregards bad ones in order do reduce the computation time. This approach allowed to obtain policies of the same performance as those obtained without but at the fifth of the computation cost. The learned strategies are among the best performing players at this time scoring several millions of lines on average.

Evolutionary behavior learning

Participants : Amine Boumaza, François Charpillet, Iñaki Fernandèz.

Evolutionary Robotics (ER) deals with the design of agent behaviors using artificial evolution. Within this framework, the problem of learning optimal policies (or controllers) is treated as a policy search problem in the parameterized space of candidate policies. The search for the optimal policies in this context is driven by a fitness function that associates a value to the candidate policy by measuring its performance on the given task.

The work shown here describes the results of the master's thesis of Inãki Fernandèz which will be extended during a Ph.D. thesis started on october 2014.

  • Incremental policy learning with shaping. Several methods have been proposed to accelerate the search for optimal policy in evolutionary robotics. In this work, we investigated the use of incremental learning and, more precisely, shaping, a well-known technique in behavioral psychology. The main idea is to learn to solve simple tasks and then exploit the learned behaviors to tackle increasingly harder tasks.

    Our preliminary results show that the best performances are obtained either in the setups with shaping or in the control experiment where the task difficulty is maximal. Nevertheless, a closer look at the results indicates that the best controllers for the shaping setups are not obtained at the end of the evolution, but rather at an earlier stage. This means that, for these shaping techniques, the best controllers have learned to solve the task when its difficulty was at an easy level and their performance is maintained later when the task difficulty increases. Although this was unforeseen, the results seem promising and deserve further investigation.

  • Online evolutionary learning. As opposed to traditional evolutionary robotics which treat the learning problem as an off-line, centralized process, online onboard distributed evolutionary algorithms  [67] , [55] consider the learning process as executed at the agent level in a decentralized way. In this sense, each agent has its own controller or genome which is locally broadcasted from agent to agent and the best performing ones survive and spread. This gene-centered view of evolution is inspired from the theory introduced by Richard Dawkins: The selfish gene.

    The online aspect of the algorithms means that the agents are learning at the same time they are performing the task at hand. Another property that derives is that the agents are continuously learning which allows them to adapt to dynamically changing conditions and tasks. This is in opposition to the traditional view of evolutionary robotics (offline) where the outcome of evolution is tailored toward single task. Many challenging problems are raised in this framework and this thesis will address the problem of defining fitness functions that drive a swarm of agents to learn to solve a task. One other question is to study the dynamics of these algorithms both experimentally and theoretically using tools from distributed systems. Some promising work in this direction has been proposed  [54] .

Learning Bad Actions

Participant : Olivier Buffet.

Jörg Hoffmann, former member of MAIA, and Michal Krajňanský are external collaborators from Saarland University.

In classical planning, a key problem is to exploit heuristic knowledge to efficiently guide the search for a sequence of actions leading to a goal state.

In some settings, one may have the opportunity to solve multiple small instances of a problem before solving larger instances, e.g., trying to handle a logistics problem with small numbers of trucks, depots and items before moving to (much) larger numbers. Then, the small instances may allow to extract knowledge that could be reused when facing larger instances. Previous work shows that it is difficult to directly learn rules specifying which action to pick in a given situation. Instead, we look for rules telling which actions should not be considered, so as to reduce the search space. But this approach requires considering multiple questions: What are examples of bad (or non-bad) actions? How to obtain them? Which learning algorithm to use?

This research work is conducted as part of Michal Krajňanský's master of science (to be defended in early 2014). Early experiments show encouraging results, and we consider participating in the learning track of the international planning competition in 2014.

Complexity of the Policy Iteration algorithm

Participant : Bruno Scherrer.

We have this year improved the state-of-the-art upper bounds for the complexity of a standard algorithm for solving Markov Decision Processes: Policy Iteration.

Given a Markov Decision Process with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal γ-discounted optimal policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most Onm1-γlog11-γ iterations, improving by a factor O(logn) a result by Hansen et al. (2013), while Simplex-PI terminates after at most On2m1-γlog11-γ iterations, improving by a factor O(logn) a result by Ye (2011). Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor γ: given a measure of the maximal transient time τt and the maximal time τr to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most O˜n3m2τtτr iterations. This generalizes a recent result for deterministic MDPs by Post & Ye (2012), in which τtn and τrn. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that Simplex-PI and Howard's PI terminate after at most O˜(nm(τt+τr)) iterations.

These results were presented at the JFPDA national workshop [36] and at the NIPS 2013 international conference [28] .

Approximate Dynamic Programming and Application to the Game of Tetris

Participant : Bruno Scherrer.

Victor Gabillon and Mohammad Ghavamzadeh are external collaborators (from the Inria Sequel EPI). Matthieu Geist is an external collaborator (from Supélec Metz).

We present here three results: the first is a unified review of algorithms that are used to estimate a linear approximation of the value of some policy in a Markov Decision Process; the second concerns the analysis of a class of approximate dynamic algorithms for large scale Markov Decision Processes; the last is the successful application of similar dynamic programming algorithms on the Tetris domain.

In the framework of Markov Decision Processes, we have considered linear off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We have made a review of on-policy learning algorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. We have highlighted a systematic approach for adapting them to off-policy learning with eligibility traces. This lead to some known algorithms and suggested new extensions. This work has recently been accepted to JMLR and should be published at the beginning of 2014 [6] .

We have revisited the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration-a family of algorithms parametrized by a parameter λ-that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration. We develop the theory of this algorithm when it is used in an approximate form. This work was published in JMLR [7] .

Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. We have applied an algorithm we proposed in the past, called classification-based modified policy iteration (CBMPI), to the game of Tetris. Our experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10×10 and large 10×20 boards. Although the CBMPI's results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE. This work was presented at NIPS 2013 [26] .