Section: Research Program
Decision-making Under Uncertainty
The phrase “Decision under uncertainty” refers to the problem of taking decisions when we do not have a full knowledge neither of the situation, nor of the consequences of the decisions, as well as when the consequences of decision are non deterministic.
We introduce two specific sub-domains, namely the Markov decision processes which models sequential decision problems, and bandit problems.
Reinforcement Learning
Sequential decision processes occupy the heart of the SequeL project; a detailed presentation of this problem may be found in Puterman's book [41] .
A Markov Decision Process (MDP) is defined as the tuple
In the MDP (
The history of the process up to time
We move from an MD process to an MD problem by formulating the goal of the agent, that is what the sought policy
where
In order to maximize a given functional in a sequential framework, one usually applies Dynamic Programming (DP) [35] , which introduces the optimal value function
-
We say that a policy
is optimal, if it attains the optimal values for any state , i.e., if for all . Under mild conditions, deterministic stationary optimal policies exist [36] . Such an optimal policy is written . -
We say that a (deterministic stationary) policy
is greedy with respect to (w.r.t.) some function (defined on ) if, for all ,where
is the set of that maximizes . For any function , such a greedy policy always exists because is finite.
The goal of Reinforcement Learning (RL), as well as that of dynamic programming, is to design an optimal policy (or a good approximation of it).
The well-known Dynamic Programming equation (also called the Bellman equation) provides a relation between the optimal value function at a state
The benefit of introducing this concept of optimal value function relies on the property that, from the optimal value function
In short, we would like to mention that most of the reinforcement learning methods developed so far are built on one (or both) of the two following approaches ( [47] ):
-
Bellman's dynamic programming approach, based on the introduction of the value function. It consists in learning a “good” approximation of the optimal value function, and then using it to derive a greedy policy w.r.t. this approximation. The hope (well justified in several cases) is that the performance
of the policy greedy w.r.t. an approximation of will be close to optimality. This approximation issue of the optimal value function is one of the major challenges inherent to the reinforcement learning problem. Approximate dynamic programming addresses the problem of estimating performance bounds (e.g. the loss in performance resulting from using a policy -greedy w.r.t. some approximation - instead of an optimal policy) in terms of the approximation error of the optimal value function by . Approximation theory and Statistical Learning theory provide us with bounds in terms of the number of sample data used to represent the functions, and the capacity and approximation power of the considered function spaces. -
Pontryagin's maximum principle approach, based on sensitivity analysis of the performance measure w.r.t. some control parameters. This approach, also called direct policy search in the Reinforcement Learning community aims at directly finding a good feedback control law in a parameterized policy space without trying to approximate the value function. The method consists in estimating the so-called policy gradient, i.e. the sensitivity of the performance measure (the value function) w.r.t. some parameters of the current policy. The idea being that an optimal control problem is replaced by a parametric optimization problem in the space of parameterized policies. As such, deriving a policy gradient estimate would lead to performing a stochastic gradient method in order to search for a local optimal parametric policy.
Finally, many extensions of the Markov decision processes exist, among which the Partially Observable MDPs (POMDPs) is the case where the current state does not contain all the necessary information required to decide for sure of the best action.
Multi-arm Bandit Theory
Bandit problems illustrate the fundamental difficulty of decision making in the face of uncertainty: A decision maker must choose between what seems to be the best choice (“exploit”), or to test (“explore”) some alternative, hoping to discover a choice that beats the current best choice.
The classical example of a bandit problem is deciding what treatment to give each patient in a clinical trial when the effectiveness of the treatments are initially unknown and the patients arrive sequentially. These bandit problems became popular with the seminal paper [42] , after which they have found applications in diverse fields, such as control, economics, statistics, or learning theory.
Formally, a K-armed bandit problem (
The name “bandit” comes from imagining a gambler playing with K slot machines. The gambler can pull the arm of any of the machines, which produces a random payoff as a result: When arm k is pulled, the random payoff is drawn from the distribution associated to k. Since the payoff distributions are initially unknown, the gambler must use exploratory actions to learn the utility of the individual arms. However, exploration has to be carefully controlled since excessive exploration may lead to unnecessary losses. Hence, to play well, the gambler must carefully balance exploration and exploitation. Auer et al. [34] introduced the algorithm UCB (Upper Confidence Bounds) that follows what is now called the “optimism in the face of uncertainty principle”. Their algorithm works by computing upper confidence bounds for all the arms and then choosing the arm with the highest such bound. They proved that the expected regret of their algorithm increases at most at a logarithmic rate with the number of trials, and that the algorithm achieves the smallest possible regret up to some sub-logarithmic factor (for the considered family of distributions).