Section: New Results
Decision Under Uncertainty
Participants : Lucian Busoniu, Alexandra Carpentier, Rémi Coulom, Victor Gabillon, Mohammad Ghavamzadeh, Sertan Girgin, JeanFrançois Hren, Alessandro Lazaric, Manuel Loth, OdalricAmbrym Maillard, Rémi Munos, Olivier Nicol, Philippe Preux, Daniil Ryabko.
Reinforcement learning and approximate dynamic programming
In the domain of reinforcement learning and approximate dynamic programming, we identify two main lines of research.
Links between Approximate Dynamic Programming and Statistical Learning Theory
The main objective here is to use tools from statistical learning theory to derive finitesample performance bounds for RL and ADP algorithms. The goal is to derive bounds on the performance of the policies induced by these algorithms in terms of the number of simulation data and the capacity and approximation power of the considered function and policy spaces. The results of this study allow us to have a better understanding of the functionality of these algorithms and help us to design them more efficiently. The main contributions to this research line in 2011 are:

Classificationbased Policy Iteration with a Critic [25] , [51] . In collaboration with Bruno Scherrer (INRIA Nancy  Grand Est, Team MAIA) we extended last year work on classificationbased policy iteration by adding a value function approximation component (critic) to rollout classificationbased policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the actionvalue function. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We presented a new RCPI algorithm, called direct policy iteration with critic (DPICritic), and provided its finitesample analysis when the critic is based on the LSTD method. We also empirically evaluated the performance of DPICritic and compared it with DPI and LSPI in two benchmark reinforcement learning problems.

FiniteSample Analysis of LeastSquares Policy Iteration [10] , [45] . We extended last year work on the finitesample analysis of leastsquares temporaldifference (LSTD) to the leastsquares policy iteration (LSPI) algorithm. In particular, we analyzed how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.

Speedy QLearning [16] , [48] . We introduce a new convergent variant of Qlearning, called speedy Qlearning, to address the problem of slow convergence in the standard form of the Qlearning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with $n$ stateaction pairs and the discount factor $\gamma $ only $T=O(log\left(n\right)/\left({\u03f5}^{2}{(1\gamma )}^{4}\right))$ steps are required for the SQL algorithm to converge to an $\u03f5$optimal actionvalue function with high probability. This bound has a better dependency on $1/\u03f5$ and $1/(1\gamma )$, and thus, is tighter than the best available result for Qlearning. Our bound is also superior to the existing results for both modelfree and modelbased instances of batch Qvalue iteration that are considered to be more efficient than the incremental methods like Qlearning.

Selecting the StateRepresentation in Reinforcement Learning [34] . The problem of selecting the right staterepresentation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order ${T}^{2/3}$ where $T$ is the horizon time.

Transfer from Multiple MDPs [32] . Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speedup RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them in the training set used to solve a target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.
RL in Highdimensional Spaces
The main objective here is to devise, analyze, implement, and experiment with RL algorithms whose sample and computational complexities do not grow rapidly with the dimension of the state space. We have tackled this problem from two different angles:

Exploiting the Regularities of the Problem [57] , [8] , [27] . In order to solve RL in high dimensions, we should exploit all the regularities of the problem in hand. Smoothness is the most common regularity. We continued our collaboration with Amir massoud Farahmand and Csaba Szepesvári at the university of Alberta, Canada, and Shie Mannor at Technion, Israel, on using regularization methods for automatic model selection for value function approximation in RL. We have devised and analyzed the first ${\ell}_{2}$regularized RL algorithms by adding ${\ell}_{2}$regularization to three wellknown ADP algorithms: fitted Qiteration, modified Bellman residual minimization, and leastsquares temporaldifference learning [57] , [8] . The designed algorithms work in both linear and reproducing kernel Hilbert spaces. Sparsity is another form of regularity that clearly plays a central role in the emerging theory of learning in high dimensions. We have worked on using ${\ell}_{1}$regularization in approximate dynamic programming and RL, which may also serve as a method for feature selection in value function approximation. We have derived finitesample performance bounds for an algorithm resulting from adding ${\ell}_{1}$penalty to the widelyused leastsquares temporaldifference learning (LSTD) algorithm [27] .

Random Projections [28] , [52] . We have looked into recent directions popularized in compressive sensing concerning the preservation of properties, such as norm or innerproduct, of high dimensional objects when projected on possibly much lower dimensional random subspaces. We have studied the popular LSTD algorithm when a space of low dimension is generated with a random projection from the highdimensional space, and derived performance bounds for the resulting algorithm [28] , [52] .
Planning and exploration vs. exploitation tradeoff
In the domain of planning and explorationexploitation algorithms, we identify two main lines of research.
Multiarm Bandit, Online Learning and Optimization

Active Learning in MultiArmed Bandit Problems [18] , [49] , [24] , [50] . This can be seen as an online allocation problem with several options and is closely related to the problem of optimal experimental design in statistics. The objective here is to allocate a fixed budget to a finite (or possibly infinite) number of options (arms) in order to achieve the best accuracy in estimating the quality of each option. In addition to having application in a number of different fields such as online advertisement and personalizing treatment, this problem is of specific importance in RL in which generating training data is usually expensive. In this framework, we have studied the following two problems: 1) estimating the mean values of all the arms uniformly well in a multiarmed bandit setting [18] , [49] , and 2) identifying the best arm in each of the bandits in a multibandit multiarmed setting [24] , [50] . For each problem, we have developed algorithms with theoretical guarantees.

Finite Time Analysis of Stratified Sampling for Monte Carlo [20] . We consider the problem of stratified sampling for MonteCarlo integration. We model this problem in a multiarmed bandit setting, where the arms represent the strata (an interval in the input domain), and the goal is to estimate a weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality to an ideal allocation that would know the standard deviations of the strata. We provide two regret analyses: a distributiondependent bound $O\left({n}^{3/2}\right)$ that depends on a measure of the disparity of the strata, and a distributionfree bound $O\left({n}^{4/3}\right)$ that does not.

Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness [36] . We consider a global optimization problem of a deterministic function f in a semimetric space, given a finite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to a semimetric. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of . We report a finitesample performance bound in terms of a measure of the quantity of nearoptimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semimetric under which f is smooth, and whose performance is almost as good as DOO optimallyfitted.

FiniteTime Analysis of Multiarmed Bandits Problems with KullbackLeibler Divergences [35] . We consider a KullbackLeiblerbased algorithm for the stochastic multiarmed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of Burnetas and Katehakis (1996). Our contribution is to provide a finitetime analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finitetime analyses (like UCBtype algorithms).

Adaptive bandits: Towards the best historydependent strategy [33] . We consider multiarmed bandit games with possibly adaptive opponents. We introduce models $\Theta $ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model. The regret is measured with respect to (w.r.t.) the best historydependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best historyclassbased strategy) for the best model. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When only one model is considered, we derive tractable algorithms achieving a tight regret (at time $T$) bounded by $O\left(\sqrt{TAC}\right)$, where $C$ is the number of classes. Now, when many models are available, all known algorithms achieving a nice regret $O\left(\sqrt{T}\right)$ are unfortunately not tractable and scale poorly with the number of models. Our contribution here is to provide tractable algorithms with regret bounded by ${T}^{2/3}{C}^{1/3}log{\left(\right\Theta \left\right)}^{1/2}$.

Pure Exploration in FinitelyArmed and ContinuousArmed Bandits [5] . We consider the framework of stochastic multiarmed bandit problems and study the possibilities and limitations of forecasters that perform an online exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. One of the main results in the case of a finite number of arms is a general lower bound on the simple regret of a forecaster in terms of its cumulative regret: the smaller the latter, the larger the former. Keeping this result in mind, we then exhibit upper bounds on the simple regret of some forecasters. The paper ends with a study devoted to continuousarmed bandit problems; we show that the simple regret can be minimized with respect to a family of probability distributions if and only if the cumulative regret can be minimized for it. Based on this equivalence, we are able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous meanpayoff functions.

XArmed Bandits [6] . We consider a generalization of stochastic bandits where the set of arms, X, is allowed to be a generic measurable space and the meanpayoff function is locally Lipschitz with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the meanpayoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by $\sqrt{n}$, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches.

Learning with Stochastic Inputs and Adversarial Outputs [11] . Most of the research in online learning is focused either on the problem of adversarial classification (i.e., both inputs and labels are arbitrarily chosen by an adversary) or on the traditional supervised learning problem in which samples are independent and identically distributed according to a stationary probability distribution. Nonetheless, in a number of domains the relationship between inputs and outputs may be adversarial, whereas input instances are i.i.d. from a stationary distribution (e.g., user preferences). This scenario can be formalized as a learning problem with stochastic inputs and adversarial outputs. In this paper, we introduce this novel stochasticadversarial learning setting and we analyze its learnability. In particular, we show that in binary classification, given a hypothesis space $H$ with finite VCdimension, it is possible to design an algorithm which incrementally builds a suitable finite set of hypotheses from $H$ used as input for an exponentially weighted forecaster and achieves a cumulative regret of order $\sqrt{nVC\left(H\right)logn}$ with overwhelming probability. This result shows that whenever inputs are i.i.d., it is possible to solve any binary classification problem using a finite VCdimension hypothesis space with a sublinear regret independently from the way labels are generated (either stochastic or adversarial). We also discuss extensions to multilabel classification, regression, learning from experts and bandit settings with stochastic side information, and application to games.

ICML ExplorationExploitation Challenge [65] , [63] . Olivier Nicol and Jérémie Mary won the ICML challenge on Exploration and Exploitation 2 organized by Cambridge on dataset provided by Adobe. The winning approach is based on ideas close to bayesian networks and Thomson sampling as Ad Predictor from Microsoft. These kind of succes emphases the need for better theoretical analysis of theses frameworks. The challenge was also a good occasion to think about the best way to evaluate online politics (this part also attracts interest from Orange Labs). A publication to JLMR is submitted.
Planning

Optimistic Planning for Sparsely Stochastic Systems [17] . We propose an online planning algorithm for finite action, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where each expansion exploits sparsity to add all possible successor states. Each state to expand is actively chosen to improve the knowledge about action quality, and this allows the algorithm to return a good action after a strictly limited number of expansions. More specifically, the active selection method is optimistic in that it chooses the most promising states first, so the novel algorithm is called optimistic planning for sparsely stochastic systems. We note that the new algorithm can also be seen as modelpredictive (recedinghorizon) control. The algorithm obtains promising numerical results, including the successful online control of a simulated HIV infection with stochastic drug effectiveness.

Optimistic Planning in Markov decision processes [46] . We review a class of online planning algorithms for deterministic and stochastic optimal control problems, modeled as Markov decision processes. At each discrete time step, these algorithms maximize the predicted value of planning policies from the current state, and apply the first action of the best policy found. An overall recedinghorizon algorithm results, which can also be seen as a type of modelpredictive control. The space of planning policies is explored optimistically, focusing on areas with largest upper bounds on the value or upper confidence bounds, in the stochastic case. The resulting optimistic planning framework integrates several types of optimism previously used in planning, optimization, and reinforcement learning, in order to obtain several intuitive algorithms with good performance guarantees. We describe in detail three recent such algorithms, outline the theoretical guarantees on their performance, and illustrate their behavior in a numerical example.
Applications
Management of ad campaigns on the web
More work has been dedicated to the topic aiming at optimizing ad campaigns on the web under realtime constraints, in a dynamic environment [9] .