SEQUEL - 2011 - Annual activity report

SEQUEL

SEQUEL - 2011

Project Team Sequel

Members

Overall Objectives

Scientific Foundations

Application Domains

Software

New Results

Contracts and Grants with Industry

Contracts and Grants with Industry

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Decision Under Uncertainty

Participants : Lucian Busoniu, Alexandra Carpentier, Rémi Coulom, Victor Gabillon, Mohammad Ghavamzadeh, Sertan Girgin, Jean-François Hren, Alessandro Lazaric, Manuel Loth, Odalric-Ambrym Maillard, Rémi Munos, Olivier Nicol, Philippe Preux, Daniil Ryabko.

Reinforcement learning and approximate dynamic programming

In the domain of reinforcement learning and approximate dynamic programming, we identify two main lines of research.

Links between Approximate Dynamic Programming and Statistical Learning Theory

The main objective here is to use tools from statistical learning theory to derive finite-sample performance bounds for RL and ADP algorithms. The goal is to derive bounds on the performance of the policies induced by these algorithms in terms of the number of simulation data and the capacity and approximation power of the considered function and policy spaces. The results of this study allow us to have a better understanding of the functionality of these algorithms and help us to design them more efficiently. The main contributions to this research line in 2011 are:

Classification-based Policy Iteration with a Critic [25] , [51] . In collaboration with Bruno Scherrer (INRIA Nancy - Grand Est, Team MAIA) we extended last year work on classification-based policy iteration by adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value function. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We presented a new RCPI algorithm, called direct policy iteration with critic (DPI-Critic), and provided its finite-sample analysis when the critic is based on the LSTD method. We also empirically evaluated the performance of DPI-Critic and compared it with DPI and LSPI in two benchmark reinforcement learning problems.
Finite-Sample Analysis of Least-Squares Policy Iteration [10] , [45] . We extended last year work on the finite-sample analysis of least-squares temporal-difference (LSTD) to the least-squares policy iteration (LSPI) algorithm. In particular, we analyzed how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.
Speedy Q-Learning [16] , [48] . We introduce a new convergent variant of Q-learning, called speedy Q-learning, to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with $n$ state-action pairs and the discount factor $γ$ only $T = O (l o g (n) / (ϵ^{2} {(1 - γ)}^{4}))$ steps are required for the SQL algorithm to converge to an $ϵ$ -optimal action-value function with high probability. This bound has a better dependency on $1 / ϵ$ and $1 / (1 - γ)$ , and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both model-free and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning.
Selecting the State-Representation in Reinforcement Learning [34] . The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order $T^{2 / 3}$ where $T$ is the horizon time.
Transfer from Multiple MDPs [32] . Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speed-up RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them in the training set used to solve a target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.

RL in High-dimensional Spaces

The main objective here is to devise, analyze, implement, and experiment with RL algorithms whose sample and computational complexities do not grow rapidly with the dimension of the state space. We have tackled this problem from two different angles:

Exploiting the Regularities of the Problem [57] , [8] , [27] . In order to solve RL in high dimensions, we should exploit all the regularities of the problem in hand. Smoothness is the most common regularity. We continued our collaboration with Amir massoud Farahmand and Csaba Szepesvári at the university of Alberta, Canada, and Shie Mannor at Technion, Israel, on using regularization methods for automatic model selection for value function approximation in RL. We have devised and analyzed the first $ℓ_{2}$ -regularized RL algorithms by adding $ℓ_{2}$ -regularization to three well-known ADP algorithms: fitted Q-iteration, modified Bellman residual minimization, and least-squares temporal-difference learning [57] , [8] . The designed algorithms work in both linear and reproducing kernel Hilbert spaces. Sparsity is another form of regularity that clearly plays a central role in the emerging theory of learning in high dimensions. We have worked on using $ℓ_{1}$ -regularization in approximate dynamic programming and RL, which may also serve as a method for feature selection in value function approximation. We have derived finite-sample performance bounds for an algorithm resulting from adding $ℓ_{1}$ -penalty to the widely-used least-squares temporal-difference learning (LSTD) algorithm [27] .
Random Projections [28] , [52] . We have looked into recent directions popularized in compressive sensing concerning the preservation of properties, such as norm or inner-product, of high dimensional objects when projected on possibly much lower dimensional random subspaces. We have studied the popular LSTD algorithm when a space of low dimension is generated with a random projection from the high-dimensional space, and derived performance bounds for the resulting algorithm [28] , [52] .

Planning and exploration vs. exploitation trade-off

In the domain of planning and exploration-exploitation algorithms, we identify two main lines of research.

Multi-arm Bandit, Online Learning and Optimization

Active Learning in Multi-Armed Bandit Problems [18] , [49] , [24] , [50] . This can be seen as an online allocation problem with several options and is closely related to the problem of optimal experimental design in statistics. The objective here is to allocate a fixed budget to a finite (or possibly infinite) number of options (arms) in order to achieve the best accuracy in estimating the quality of each option. In addition to having application in a number of different fields such as online advertisement and personalizing treatment, this problem is of specific importance in RL in which generating training data is usually expensive. In this framework, we have studied the following two problems: 1) estimating the mean values of all the arms uniformly well in a multi-armed bandit setting [18] , [49] , and 2) identifying the best arm in each of the bandits in a multi-bandit multi-armed setting [24] , [50] . For each problem, we have developed algorithms with theoretical guarantees.
Finite Time Analysis of Stratified Sampling for Monte Carlo [20] . We consider the problem of stratified sampling for Monte-Carlo integration. We model this problem in a multi-armed bandit setting, where the arms represent the strata (an interval in the input domain), and the goal is to estimate a weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality to an ideal allocation that would know the standard deviations of the strata. We provide two regret analyses: a distribution-dependent bound $O (n^{- 3 / 2})$ that depends on a measure of the disparity of the strata, and a distribution-free bound $O (n^{- 4 / 3})$ that does not.
Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness [36] . We consider a global optimization problem of a deterministic function f in a semi-metric space, given a finite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to a semi-metric. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of . We report a finite-sample performance bound in terms of a measure of the quantity of near-optimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semi-metric under which f is smooth, and whose performance is almost as good as DOO optimally-fitted.
Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences [35] . We consider a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of Burnetas and Katehakis (1996). Our contribution is to provide a finite-time analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finite-time analyses (like UCB-type algorithms).
Adaptive bandits: Towards the best history-dependent strategy [33] . We consider multi-armed bandit games with possibly adaptive opponents. We introduce models $Θ$ of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some model. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When only one model is considered, we derive tractable algorithms achieving a tight regret (at time $T$ ) bounded by $O (\sqrt{T A C})$ , where $C$ is the number of classes. Now, when many models are available, all known algorithms achieving a nice regret $O (\sqrt{T})$ are unfortunately not tractable and scale poorly with the number of models. Our contribution here is to provide tractable algorithms with regret bounded by $T^{2 / 3} C^{1 / 3} log {(| Θ |)}^{1 / 2}$ .
Pure Exploration in Finitely-Armed and Continuous-Armed Bandits [5] . We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. One of the main results in the case of a finite number of arms is a general lower bound on the simple regret of a forecaster in terms of its cumulative regret: the smaller the latter, the larger the former. Keeping this result in mind, we then exhibit upper bounds on the simple regret of some forecasters. The paper ends with a study devoted to continuous-armed bandit problems; we show that the simple regret can be minimized with respect to a family of probability distributions if and only if the cumulative regret can be minimized for it. Based on this equivalence, we are able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions.
X-Armed Bandits [6] . We consider a generalization of stochastic bandits where the set of arms, X, is allowed to be a generic measurable space and the mean-payoff function is locally Lipschitz with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by $\sqrt{n}$ , that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches.
Learning with Stochastic Inputs and Adversarial Outputs [11] . Most of the research in online learning is focused either on the problem of adversarial classification (i.e., both inputs and labels are arbitrarily chosen by an adversary) or on the traditional supervised learning problem in which samples are independent and identically distributed according to a stationary probability distribution. Nonetheless, in a number of domains the relationship between inputs and outputs may be adversarial, whereas input instances are i.i.d. from a stationary distribution (e.g., user preferences). This scenario can be formalized as a learning problem with stochastic inputs and adversarial outputs. In this paper, we introduce this novel stochastic-adversarial learning setting and we analyze its learnability. In particular, we show that in binary classification, given a hypothesis space $H$ with finite VC-dimension, it is possible to design an algorithm which incrementally builds a suitable finite set of hypotheses from $H$ used as input for an exponentially weighted forecaster and achieves a cumulative regret of order $\sqrt{n V C (H) l o g n}$ with overwhelming probability. This result shows that whenever inputs are i.i.d., it is possible to solve any binary classification problem using a finite VC-dimension hypothesis space with a sub-linear regret independently from the way labels are generated (either stochastic or adversarial). We also discuss extensions to multi-label classification, regression, learning from experts and bandit settings with stochastic side information, and application to games.
ICML Exploration-Exploitation Challenge [65] , [63] . Olivier Nicol and Jérémie Mary won the ICML challenge on Exploration and Exploitation 2 organized by Cambridge on dataset provided by Adobe. The winning approach is based on ideas close to bayesian networks and Thomson sampling as Ad Predictor from Microsoft. These kind of succes emphases the need for better theoretical analysis of theses frameworks. The challenge was also a good occasion to think about the best way to evaluate online politics (this part also attracts interest from Orange Labs). A publication to JLMR is submitted.

Planning

Optimistic Planning for Sparsely Stochastic Systems [17] . We propose an online planning algorithm for finite action, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where each expansion exploits sparsity to add all possible successor states. Each state to expand is actively chosen to improve the knowledge about action quality, and this allows the algorithm to return a good action after a strictly limited number of expansions. More specifically, the active selection method is optimistic in that it chooses the most promising states first, so the novel algorithm is called optimistic planning for sparsely stochastic systems. We note that the new algorithm can also be seen as model-predictive (receding-horizon) control. The algorithm obtains promising numerical results, including the successful online control of a simulated HIV infection with stochastic drug effectiveness.
Optimistic Planning in Markov decision processes [46] . We review a class of online planning algorithms for deterministic and stochastic optimal control problems, modeled as Markov decision processes. At each discrete time step, these algorithms maximize the predicted value of planning policies from the current state, and apply the first action of the best policy found. An overall receding-horizon algorithm results, which can also be seen as a type of model-predictive control. The space of planning policies is explored optimistically, focusing on areas with largest upper bounds on the value or upper confidence bounds, in the stochastic case. The resulting optimistic planning framework integrates several types of optimism previously used in planning, optimization, and reinforcement learning, in order to obtain several intuitive algorithms with good performance guarantees. We describe in detail three recent such algorithms, outline the theoretical guarantees on their performance, and illustrate their behavior in a numerical example.

Applications

Management of ad campaigns on the web

More work has been dedicated to the topic aiming at optimizing ad campaigns on the web under real-time constraints, in a dynamic environment [9] .

Previous |

Home | Next next