Section: Scientific Foundations
Multi-armed bandit problems, prediction with limited feedback
We are interested in settings in which the feedback obtained on the predictions is limited, in the sense that it does not fully reveal what actually happened.
This is also a sequential problem in which some regret is to be minimized.
However, this problem is a stochastic problem: a large number of arms, possibly indexed by a continuous set like , is available. Each arm is associated with a fixed but unknown distribution. At each round, the player chooses an arm, a payoff is drawn at random according to the distribution that is associated with it, and the only feedback that the player gets is the value of this payoff. The key quantity to study this problem is the mean-payoff function , that indicates for each arm the expected payoff of the distribution that is associated with it. The target is to minimize the regret, i.e., ensure that the difference between the cumulative payoff obtained by the player and the one of the best arm is small.
A generalization of the regret: the approachability of sets
Approachability is the ability to control random walks. At each round, a vector payoff is obtained by the first player, depending on his action and on the action of the opponent player. The aim is to ensure that the average of the vector payoffs converges to some convex set. Necessary and sufficient conditions were obtained by Blackwell and others to ensure that such strategies exist, both in the full information and in the bandit cases.
Some of these results can be extended to the case of games with signals (games with partial monitoring), where at each round the only feedback obtained by the first player is a random signal drawn according to a distribution that depends on the action profile taken by the two players, while the opponent player still has a full monitoring.