SequeLis a joint project with the LIFL(UMR 8022 of CNRS, and University of Lille 1, and University of Lille 3) and the LAGIS(a joint lab of the École Centrale de Lille and the Lille 1 University).

SequeLmeans “Sequential Learning”. As such,
SequeLfocuses on the task of learning in artificial systems (either hardware, or software) that gather information along time. Such systems are named
*(learning) agents*(or learning machines) in the following. These data may be used to estimate some parameters of a model, which in turn, may be used for selecting actions in order to
perform some long-term optimization task.

For the purpose of model building, the agent needs to represent information collected so far in some compact form and use it to process newly available data.

The acquired data may result from an observation process of an agent in interaction with its environment (the data thus represent a perception). This is the case when the agent makes decisions (in order to attain a certain objective) that impact the environment, and thus the observation process itself.

Hence, in
SequeL, the term
**sequential**refers to two aspects:

The
**sequential acquisition of data**, from which a model is learned (supervised and non supervised learning),

the
**sequential decision making task**, based on the learned model (reinforcement learning).

Examples of sequential learning problems include:

tasks deal with the prediction of some response given a certain set of observations of input variables and responses. New sample points keep on being observed.

tasks deal with clustering objects, these latter making a flow of objects. The (unknown) number of clusters typically evolves during time, as new objects are observed.

tasks deal with the control (a policy) of some system which has to be optimized (see ). We do not assume the availability of a model of the system to be controlled.

In all these cases, we mostly assume that the process can be considered stationary for at least a certain amount of time, and slowly evolving.

We wish to have any-time algorithms, that is, at any moment, a prediction may be required/an action may be selected making full use, and hopefully, the best use, of the experience already gathered by the learning agent.

The perception of the environment by the learning agent (using its sensors) is generally neither the best one to make a prediction, nor to take a decision (we deal with Partially Observable Markov Decision Problem). So, the perception has to be mapped in some way to a better, and relevant, state (or input) space.

Finally, an important issue of prediction regards its evaluation: how wrong may we be when we perform a prediction? For real systems to be controlled, this issue can not be simply left unanswered.

To sum-up, in SequeL, the main issues regard:

the learning of a model: we focus on models that map some input space

the observation to state mapping,

the choice of the action to perform (in the case of sequential decision problem),

the performance guarantees,

the implementation of usable algorithms,

all that being understood in a
*sequential*framework.

Renowned for its work on the topic of exploration/exploitation trade-off, SequeL members have also successfully applied their research to the “Exploration and Exploitation Challenge” organized along at the International Conference on Machine Learning (ICML'11). SequeL's PhD students Olivier Nicol and Christophe Salperwyck ranked first and second respectively.

SequeLis primarily grounded on two domains:

the problem of decision under uncertainty,

statistical analysis and statistical learning, which provide the general concepts and tools to solve this problem.

To help the reader who is unfamiliar with these questions, we briefly present key ideas below.

The phrase “Decision under uncertainty” refers to the problem of taking decisions when we do not have a full knowledge neither of the situation, nor of the consequences of the decisions, as well as when the consequences of decision are non deterministic.

We introduce two specific sub-domains, namely the Markov decision processes which models sequential decision problems, and bandit problems.

Sequential decision processes occupy the heart of the SequeLproject; a detailed presentation of this problem may be found in Puterman's book .

A Markov Decision Process (MDP) is defined as the tuple

In the MDP (

The history of the process up to time

We move from an MD process to an MD problem by formulating the goal of the agent, that is what the sought policy

where

In order to maximize a given functional in a sequential framework, one usually applies Dynamic Programming (DP)
, which introduces the optimal value function

We say that a policy
*i.e.*, if

We say that a (deterministic stationary) policy

where

The goal of Reinforcement Learning (RL), as well as that of dynamic programming, is to design an optimal policy (or a good approximation of it).

The well-known Dynamic Programming equation (also called the Bellman equation) provides a relation between the optimal value function at a state

The benefit of introducing this concept of optimal value function relies on the property that, from the optimal value function

In short, we would like to mention that most of the reinforcement learning methods developed so far are built on one (or both) of the two following approaches ( ):

Bellman's dynamic programming approach, based on the introduction of the value function. It consists in learning a “good” approximation of the optimal value function,
and then using it to derive a greedy policy w.r.t. this approximation. The hope (well justified in several cases) is that the performance
**Approximate dynamic programming**addresses the problem of estimating performance bounds (
*e.g.*the loss in performance

Pontryagin's maximum principle approach, based on sensitivity analysis of the performance measure w.r.t. some control parameters. This approach, also called
**direct policy search**in the Reinforcement Learning community aims at directly finding a good feedback control law in a parameterized policy space without trying to approximate the
value function. The method consists in estimating the so-called
**policy gradient**,
*i.e.*the sensitivity of the performance measure (the value function) w.r.t. some parameters of the current policy. The idea being that an optimal control problem is replaced by a
parametric optimization problem in the space of parameterized policies. As such, deriving a policy gradient estimate would lead to performing a stochastic gradient method in order to
search for a local optimal parametric policy.

Finally, many extensions of the Markov decision processes exist, among which the Partially Observable MDPs (POMDPs) is the case where the current state does not contain all the necessary information required to decide for sure of the best action.

Bandit problems illustrate the fundamental difficulty of decision making in the face of uncertainty: A decision maker must choose between what seems to be the best choice (“exploit”), or to test (“explore”) some alternative, hoping to discover a choice that beats the current best choice.

The classical example of a bandit problem is deciding what treatment to give each patient in a clinical trial when the effectiveness of the treatments are initially unknown and the patients arrive sequentially. These bandit problems became popular with the seminal paper , after which they have found applications in diverse fields, such as control, economics, statistics, or learning theory.

Formally, a K-armed bandit problem (
*i.e.*, when the arm giving the highest expected reward is pulled all the time.

The name “bandit” comes from imagining a gambler playing with K slot machines. The gambler can pull the arm of any of the machines, which produces a random payoff as a result: When arm k
is pulled, the random payoff is drawn from the distribution associated to k. Since the payoff distributions are initially unknown, the gambler must use exploratory actions to learn the
utility of the individual arms. However, exploration has to be carefully controlled since excessive exploration may lead to unnecessary losses. Hence, to play well, the gambler must carefully
balance exploration and exploitation. Auer
*et al.*
introduced the algorithm UCB (Upper Confidence Bounds) that follows
what is now called the “optimism in the face of uncertainty principle”. Their algorithm works by computing upper confidence bounds for all the arms and then choosing the arm with the highest
such bound. They proved that the expected regret of their algorithm increases at most at a logarithmic rate with the number of trials, and that the algorithm achieves the smallest possible
regret up to some sub-logarithmic factor (for the considered family of distributions).

Many of the problems of machine learning can be seen as extensions of classical problems of mathematical statistics to their (extremely) non-parametric and model-free cases. Other machine learning problems are founded on such statistical problems. Statistical problems of sequential learning are mainly those that are concerned with the analysis of time series. These problems are as follows.

Given a series of observations

Alternatively, rather than making some assumptions on the data, one can change the goal: the predicted probabilities should be asymptotically as good as those given by the best reference predictor from a certain pre-defined set.

Given a series of observations of

The problem of hypothesis testing can also be studied in its general formulations: given two (abstract) hypothesis

The problem of clustering, while being a classical problem of mathematical statistics, belongs to the realm of unsupervised learning. For time series, this problem can be formulated as
follows: given several samples

Before detailing some issues of statistical learning, let us remind the definition of a few terms.

refers to a system capable of the autonomous acquisition and integration of knowledge. This capacity to learn from experience, analytical observation, and other means, results in a system
that can continuously self-improve and thereby offer increased efficiency and effectiveness. (source:
http://

is an approach to machine intelligence which is based on statistical modeling of data. With a statistical model in hand, one applies probability theory and decision theory to get an algorithm. This is opposed to using training data merely to select among different algorithms or using heuristics/“common sense” to design an algorithm.

Generally speaking, a kernel function is a function that maps a couple of points to a real value. Typically, this value is a measure of dissimilarity between the two points. Assuming a few properties on it, the kernel function implicitly defines a dot product in some function space. This very nice formal property as well as a bunch of others have ensured a strong appeal for these methods in the last 10 years in the field of function approximation. Many classical algorithms have been “kernelized”, that is, restated in a much more general way than their original formulation. Kernels also implicitly induce the representation of data in a certain “suitable” space where the problem to solve (classification, regression, ...) is expected to be simpler (non-linearity turns to linearity).

The fundamental tools used in SequeLcome from the field of statistical learning . We briefly present the most important for us to date, namely, kernel-based non parametric function approximation, and non parametric Bayesian models.

In statistics in general, and applied mathematics, the approximation of a multi-dimensional real function given some samples is a well-known problem (known as either regression, or interpolation, or function approximation, ...). Regressing a function from data is a key ingredient of our research, or to the least, a basic component of most of our algorithms. In the context of sequential learning, we have to regress a function while data samples are being obtained one at a time, while keeping the constraint to be able to predict points at any step along the acquisition process. In sequential decision problems, we typically have to learn a value function, or a policy.

Many methods have been proposed for this purpose. We are looking for suitable ones to cope with the problems we wish to solve. In reinforcement learning, the value function may have areas where the gradient is large; these are areas where the approximation is difficult, while these are also the areas where the accuracy of the approximation should be maximal to obtain a good policy (and where, otherwise, a bad choice of action may imply catastrophic consequences).

We particularly favor non parametric methods since they make quite a few assumptions about the function to learn. In particular, we have strong interests in

Numerous problems in signal processing may be solved efficiently by way of a Bayesian approach. The use of Monte-Carlo methods allows us to handle non–linear, as well as non–Gaussian,
problems. In their standard form, they require the formulation of probability densities in a parametric form. For instance, it is a common usage to use Gaussian likelihood, because it is
handy. However, in some applications such as Bayesian filtering, or blind deconvolution, the choice of a parametric form of the density of the noise is often arbitrary. If this choice is
wrong, it may also have dramatic consequences on the estimation quality. To overcome this shortcoming, one possible approach is to consider that this density must also be estimated from data.
A general Bayesian approach then consists in defining a probabilistic space associated with the possible outcomes of the
*object*to be estimated. Applied to density estimation, it means that we need to define a probability measure on the probability density of the noise : such a measure is called a
*random measure*. The classical Bayesian inference procedures can then been used. This approach being by nature non parametric, the associated frame is called
*Non Parametric Bayesian*.

In particular, mixtures of Dirichlet processes
provide a very powerful formalism. Dirichlet Processes are a
possible random measure and Mixtures of Dirichlet Processes are an extension of well-known finite mixture models. Given a mixture density

where

Given a set of observations, the estimation of the parameters of a mixture of Dirichlet processes is performed by way of a Monte Carlo Markov Chain (MCMC) algorithm. Dirichlet Process Mixture are also widely used in clustering problems. Once the parameters of a mixture are estimated, they can be interpreted as the parameters of a specific cluster defining a class as well. Dirichlet processes are well known within the machine learning community and its potential in statistical signal processing still need to be developped.

In the general multi-sensor multi-target Bayesian framework, an unknown (and possibly varying) number of targets whose states
*sets*and not vectors.

The random finite set theory provides a powerful framework to deal with these issues. Mahler's work on finite sets statistics (FISST) provides a mathematical framework to build
multi-object densities and derive the Bayesian rules for state prediction and state estimation. Randomness on object number and their states are encapsulated into random finite sets (RFS),
namely multi-target(state) sets

where:

SequeLaims at solving problems of prediction, as well as problems of optimal and adaptive control. As such, the application domains are very numerous.

The application domains have been organized as follows:

adaptive control,

signal analysis and processing,

functional prediction,

neuroscience.

Adaptive control is an important application of the research being done in SequeL. Reinforcement learning (RL) precisely aims at controling the behavior of systems and may be used in situations with more or less information available. Of course, the more information, the better, in which case methods of (approximate) dynamic programming may be used . But, reinforcement learning may also handle situations where the dynamics of the system is unknown, situations where the system is partially observable, and non stationary situations. Indeed, in these cases, the behavior is learned by interacting with the environment and thus naturally adapts to the changes of the environment. Furthermore, the adaptive system may also take advantage of expert knowledge when available.

Clearly, the spectrum of potential applications is very wide: as far as an agent (a human, a robot, a virtual agent) has to take a decision, in particular in cases where he lacks some information to take the decision, this enters the scope of our activities. To exemplify the potential applications, let us cite:

game softwares: in the 1990's, RL has been the basis of a very successful Backgammon program, TD-Gammon that learned to play at an expert level by basically playing a very large amount of games against itself. Today, various games are studied with RL techniques.

many optimization problems that are closely related to operation research, but taking into account the uncertainty, and the stochasticity of the environment: see the job-shop scheduling, or the cellular phone frequency allocation problems, resource allocation in general

we can also foresee that some progress may be made by using RL to design adaptive conversational agents, or system-level as well as application-level operating systems that adapt to their users habits.

More generally, these ideas fall into what adaptive control may bring to human beings, in making their life simpler, by being embedded in an environment that is made to help them, an idea phrased as “ambient intelligence”.

The sensor management problem consists in determining the best way to task several sensors when each sensor has many modes and search patterns. In the detection/tracking applications, the tasks assigned to a sensor management system are for instance:

detect targets,

track the targets in the case of a moving target and/or a smart target (a smart target can change its behavior when it detects that it is under analysis),

combine all the detections in order to track each moving target,

dynamically allocate the sensors in order to achieve the previous three tasks in an optimal way. The allocation of sensors, and their modes, thus defines the action space of the underlying Markov decision problem.

In the more general situation, some sensors may be localized at the same place while others are dispatched over a given volume. Tasking a sensor may include, at each moment, such choices as where to point and/or what mode to use. Tasking a group of sensors includes the tasking of each individual sensor but also the choice of collaborating sensors subgroups. Of course, the sensor management problem is related to an objective. In general, sensors must balance complex trade-offs between achieving mission goals such as detecting new targets, tracking existing targets, and identifying existing targets. The word “target” is used here in its most general meaning, and the potential applications are not restricted to military applications. Whatever the underlying application, the sensor management problem consists in choosing at each time an action within the set of available actions.

sequential decision processes are also very well-known in economy. They may be used as a decision aid tool, to help in the design of social helps, or the implementation of plants (see , for such applications).

Applications of sequential learning in the field of signal processing are also very numerous. A signal is naturally sequential as it flows. It usually comes from the recording of the output
of sensors but the recording of any sequence of numbers may be considered as a signal like the stock-exchange rates evolution with respect to time and/or place, the number of consumers at a
mall entrance or the number of connections to a web site. Signal processing has several objectives: predict , estimate, remove noise, characterize or classify. The signal is often considered as
sequential: we want to predict, estimate or classify a value (or a feature) at time

Signals may be processed in several ways. One of the best way is the time-frequency analysis in which the frequencies of each signal are analyzed with respect to time. This concept has been generalized to the time-scale analysis obtained by a wavelet transform. Both analysis are based on the projection of the original signal onto a well-chosen function basis. Signal processing is also closely related to the probability field as the uncertainty inherent to many signals leads to consider them as stochastic processes: the Bayesian framework is actually one of the main frameworks within which signals are processed for many purposes. However, there exists alternatives like belief functions. Belief functions were introduced by Demspter few decades ago and have been successfully used in the few past years in fields where probability had, during many years, no alternatives like in classification. Belief functions can be viewed as a generalization of probabilities which can capture both imprecision and uncertainty. Belief functions are also closely related to data fusion where once more they can be considered as a serious alternative to probabilities.

One of the current trends in machine learning aims at dealing with data that are functions, rather than points or vectors. Generally speaking, functions represent a behavior (of a person, of an apparatus, or of an algorithm, or a response of a system, ...).

One application of functional prediction which is particularly emphasized these days, is the understanding of client behavior, either in material shops, or in virtual shops on the web. This understanding may then be used for different ends, such as the management of stocks according to sales, the proposition of products according to those already bought, the “instantaneous” management of some resource in the shop (advisors, cashiers, instant promotions, personalized advertisement, ...).

Machine learning methods may be used for at least two means in neurosciences:

as in any other (experimental) scientific domain, the machine learning methods relying heavily on statistics, they may be used to analyse experimental data,

dealing with induction learning, that is the ability to generalize from facts which is an ability that is considered to be one of the basic components of “intelligence”, machine learning may be considered as a model of learning in living beings. In particular, the temporal difference methods for reinforcement learning has strong ties with various concepts of psychology (Thorndike's law of effect, and the Rescorla-Wagner law to name the two most well-known).

In 2011, SequeLcontinued the development of software for computer games (notably Go) and also developed two novel libraries for functional regression and data mining.

We developed three main softwares for computer games:

**Crazy Stone***is a top-level Go-playing program that has been developed by Rémi Coulom since 2005. Crazy Stone won several major international Go tournaments in the past. In 2011, its strength
improved to
5 danon the KGS Go Server. It is distributed as a commercial product by
Unbalance Corporation(Japan). 5-month work in 2011. URL:
http://
*

**Crazy Hanafuda***is a new program to play the Japanese game of Hanafuda. 3 weeks of work in 2011. Discussion are in progress for licensing it.*

**CLOP***is a tool for automatic parameter optimization of game-playing programs. Distributed as freeware (GPL). One month of work in 2011. Available at:
http://
remi.
coulom.
free.
fr/
CLOP/
*

A software package in C++ of algorithms for nonlinear functional data analysis using our operator-valued kernel framework (see sec.
) is under development. A beta-version of the software can be downloaded at:
https://

The aim of this library is to grow and be shared in our scientific community, and also to be a software resource for our group.

A fully stand-alone library for data mining has been developed, including many classical algorithms for supervised and non supervised learning. This library is available as an internal resource for the group.

The new results are organized in the following sections:

decision under uncertainty,

foundations of machine learning,

supervised learning,

signal processing (sensor networks),

other results.

In the domain of reinforcement learning and approximate dynamic programming, we identify two main lines of research.

The main objective here is to use tools from
*statistical learning theory*to derive finite-sample performance bounds for RL and ADP algorithms. The goal is to derive bounds on the performance of the policies induced by these
algorithms in terms of the number of simulation data and the capacity and approximation power of the considered function and policy spaces. The results of this study allow us to have a
better understanding of the functionality of these algorithms and help us to design them more efficiently. The main contributions to this research line in 2011 are:

**Classification-based Policy Iteration with a Critic
,
.**In collaboration with Bruno Scherrer (INRIA Nancy -
Grand Est, Team MAIA) we extended last year work on classification-based policy iteration by adding a value function approximation component (critic) to rollout classification-based
policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of
the rollout estimates of the action-value function. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance
of the RCPI algorithm. We presented a new RCPI algorithm, called
*direct policy iteration with critic*(DPI-Critic), and provided its finite-sample analysis when the critic is based on the LSTD method. We also empirically evaluated the
performance of DPI-Critic and compared it with DPI and LSPI in two benchmark reinforcement learning problems.

**Finite-Sample Analysis of Least-Squares Policy Iteration
,
.**We extended last year work on the finite-sample analysis
of least-squares temporal-difference (LSTD) to the least-squares policy iteration (LSPI) algorithm. In particular, we analyzed how the error at each policy evaluation step is propagated
through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.

**Speedy Q-Learning
,
.**We introduce a new convergent variant of Q-learning,
called speedy Q-learning, to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that
for an MDP with

**Selecting the State-Representation in Reinforcement Learning
.**The problem of selecting the right state-representation
in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least
one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of
the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an
algorithm that achieves that, with a regret of order

**Transfer from Multiple MDPs
.**Transfer reinforcement learning (RL) methods leverage on
the experience collected on a set of source tasks to speed-up RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them in the training
set used to solve a target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the
basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.

The main objective here is to devise, analyze, implement, and experiment with RL algorithms whose sample and computational complexities do not grow rapidly with the dimension of the state space. We have tackled this problem from two different angles:

**Exploiting the Regularities of the Problem
,
,
.**In order to solve RL in high dimensions, we should
exploit all the regularities of the problem in hand.
*Smoothness*is the most common regularity. We continued our collaboration with Amir massoud Farahmand and Csaba Szepesvári at the university of Alberta, Canada, and Shie Mannor at
Technion, Israel, on using regularization methods for automatic model selection for value function approximation in RL. We have devised and analyzed the first
*Sparsity*is another form of regularity that clearly plays a central role in the emerging theory of learning in high dimensions. We have worked on using
*least-squares temporal-difference learning*(LSTD) algorithm
.

**Random Projections
,
.**We have looked into recent directions popularized in
compressive sensing concerning the preservation of properties, such as norm or inner-product, of high dimensional objects when projected on possibly much lower dimensional random
subspaces. We have studied the popular LSTD algorithm when a space of low dimension is generated with a random projection from the high-dimensional space, and derived performance bounds
for the resulting algorithm
,
.

In the domain of planning and exploration-exploitation algorithms, we identify two main lines of research.

**Active Learning in Multi-Armed Bandit Problems
,
,
,
.**This can be seen as an online allocation problem with
several options and is closely related to the problem of
*optimal experimental design*in statistics. The objective here is to allocate a fixed budget to a finite (or possibly infinite) number of options (arms) in order to achieve the
best accuracy in estimating the quality of each option. In addition to having application in a number of different fields such as
*online advertisement*and
*personalizing treatment*, this problem is of specific importance in RL in which generating training data is usually expensive. In this framework, we have studied the following two
problems:
**1)**estimating the mean values of all the arms uniformly well in a multi-armed bandit setting
,
, and
**2)**identifying the best arm in each of the bandits in a multi-bandit multi-armed setting
,
. For each problem, we have developed algorithms with
theoretical guarantees.

**Finite Time Analysis of Stratified Sampling for Monte Carlo
.**We consider the problem of stratified sampling for
Monte-Carlo integration. We model this problem in a multi-armed bandit setting, where the arms represent the strata (an interval in the input domain), and the goal is to estimate a
weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality
to an ideal allocation that would know the standard deviations of the strata. We provide two regret analyses: a distribution-dependent bound

**Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
.**We consider a global optimization problem of a
deterministic function f in a semi-metric space, given a finite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to
a semi-metric. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO,
that requires the knowledge of . We report a finite-sample performance bound in terms of a measure of the quantity of near-optimal states. We then define a second algorithm, SOO, which
does not require the knowledge of the semi-metric under which f is smooth, and whose performance is almost as good as DOO optimally-fitted.

**Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences
.**We consider a Kullback-Leibler-based algorithm for the
stochastic multi-armed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of Burnetas
and Katehakis (1996). Our contribution is to provide a finite-time analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms
with finite-time analyses (like UCB-type algorithms).

**Adaptive bandits: Towards the best history-dependent strategy
.**We consider multi-armed bandit games with possibly
adaptive opponents. We introduce models

**Pure Exploration in Finitely-Armed and Continuous-Armed Bandits
.**We consider the framework of stochastic multi-armed
bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple
regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when
the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of
pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. One of the main results in the case of a finite
number of arms is a general lower bound on the simple regret of a forecaster in terms of its cumulative regret: the smaller the latter, the larger the former. Keeping this result in
mind, we then exhibit upper bounds on the simple regret of some forecasters. The paper ends with a study devoted to continuous-armed bandit problems; we show that the simple regret can
be minimized with respect to a family of probability distributions if and only if the cumulative regret can be minimized for it. Based on this equivalence, we are able to prove that the
separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff
functions.

**X-Armed Bandits
.**We consider a generalization of stochastic bandits where
the set of arms, X, is allowed to be a generic measurable space and the mean-payoff function is locally Lipschitz with respect to a dissimilarity function that is known to the decision
maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large
class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which
the behavior of the function is locally continuous with a known smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by

**Learning with Stochastic Inputs and Adversarial Outputs
.**Most of the research in online learning is focused
either on the problem of adversarial classification (i.e., both inputs and labels are arbitrarily chosen by an adversary) or on the traditional supervised learning problem in which
samples are independent and identically distributed according to a stationary probability distribution. Nonetheless, in a number of domains the relationship between inputs and outputs
may be adversarial, whereas input instances are i.i.d. from a stationary distribution (e.g., user preferences). This scenario can be formalized as a learning problem with stochastic
inputs and adversarial outputs. In this paper, we introduce this novel stochastic-adversarial learning setting and we analyze its learnability. In particular, we show that in binary
classification, given a hypothesis space

**ICML Exploration-Exploitation Challenge
,
.**Olivier Nicol and Jérémie Mary won the ICML challenge on
Exploration and Exploitation 2 organized by Cambridge on dataset provided by Adobe. The winning approach is based on ideas close to bayesian networks and Thomson sampling as Ad
Predictor from Microsoft. These kind of succes emphases the need for better theoretical analysis of theses frameworks. The challenge was also a good occasion to think about the best way
to evaluate online politics (this part also attracts interest from Orange Labs). A publication to JLMR is submitted.

**Optimistic Planning for Sparsely Stochastic Systems
.**We propose an online planning algorithm for finite
action, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning
tree by iteratively expanding states, where each expansion exploits sparsity to add all possible successor states. Each state to expand is actively chosen to improve the knowledge about
action quality, and this allows the algorithm to return a good action after a strictly limited number of expansions. More specifically, the active selection method is optimistic in that
it chooses the most promising states first, so the novel algorithm is called optimistic planning for sparsely stochastic systems. We note that the new algorithm can also be seen as
model-predictive (receding-horizon) control. The algorithm obtains promising numerical results, including the successful online control of a simulated HIV infection with stochastic drug
effectiveness.

**Optimistic Planning in Markov decision processes
.**We review a class of online planning algorithms for
deterministic and stochastic optimal control problems, modeled as Markov decision processes. At each discrete time step, these algorithms maximize the predicted value of planning
policies from the current state, and apply the first action of the best policy found. An overall receding-horizon algorithm results, which can also be seen as a type of model-predictive
control. The space of planning policies is explored optimistically, focusing on areas with largest upper bounds on the value or upper confidence bounds, in the stochastic case. The
resulting optimistic planning framework integrates several types of optimism previously used in planning, optimization, and reinforcement learning, in order to obtain several intuitive
algorithms with good performance guarantees. We describe in detail three recent such algorithms, outline the theoretical guarantees on their performance, and illustrate their behavior
in a numerical example.

More work has been dedicated to the topic aiming at optimizing ad campaigns on the web under real-time constraints, in a dynamic environment .

The problem of sequence prediction consists in forecasting, on each step of time

The realizable case of the sequence prediction problem is when the measure

We continue to obtain new results using the theoretical framework developed recently for studying time series generated by stationary ergodic time series. This year, new results obtained include a topological characterizing of composite hypotheses for which consistent tests exist, as well as new results on clustering.

An asymptotically consistent algorithm has been proposed for the problem of online clustering of time series. There is a growing body of time series samples, each of which grows with
time. On each time step, it is required to group these time series into
*unknown*stationary ergodic distributions. An algorithm is proposed that, for each fixed portion of samples, eventually (with probability 1) puts into the same group those and only
those samples that were generated by the same distribution. Empirical performance of the algorithm is evaluated on synthetic and real data.

**Sparse Recovery with Brownian Sensing
.**

We consider the problem of recovering the parameter

**Operator-valued Kernels for Nonlinear FDA
,
,
,
**Following the extension of RKHS to functional setting
, we further developed this work in
for functional supervised classification.

We introduced a set of rigorously defined operator-valued kernels that can be valuably applied to nonparametric operator learning when input and output data are continuous smooth
functions, and we have showed their use for solving the problem of minimizing a

The framework developed can also be applied when the input data are both discrete and continuous .

Our fully functional approach has been successfully applied to the problems of speech inversion and sound recognition , showing that the proposed framework is particularly relevant for audio signal processing applications where attributes are really functions and dependent of each other.

This work is done in collaboration with Francis Bach (INRIA, Sierra), Alain Rakotomamonjy and Stéphane Canu (LITIS, Rouen).

**Datum-wise representation
,
.**We consider supervised classification. We introduce the
concept of datum-wise representation for supervised classification
. While traditional approaches yield a “best” representation at
the data space level, that is, the same representation is used for all the data, we proposed the idea, as well as an algorithm, that yields the “best” representation for each data. Among
other appealing properties, this leads to sparse representation of each data, and an averaged sparser representation of each data in the data space. Along a classifier, the learning
algorithm produces a “representer”, that is a function that yields a representation given a data.

We further improved this approach to encompass various settings which are traditionally kept as different (cost-sensitive classification and different structured sparsity) .

**Iso-regularization descent
.**Manuel Loth has defended his PhD dissertation
where he has provided a detailed presentation and analysis of
his algorithm to solve the LASSO. This algorithm is very efficient. It is an active set algorithm that solves the LASSO by considering it a convex problem with linear constraints.

**Learning with few examples
,
.**Christophe Salperwyck has studied the performance of
various classifiers when few examples are available. This is an important point in incremental learning, and few studies have been devoted to this particular setting. Performance we are
accustomed to when the examples are quite numerous are severely disturbed in this setting. For more details, please see
,
.

**Incremental discretization
.**In incremental learning, discretization should be adaptive
in order to cope with the values of the attributes that are observed. This issue is currently under study by Christophe Salperwyck
.

The aim of this work is to manage a set of sensors to track vehicles or groups of people in land applications. Our work focuses on sensor management in the frame of the random finite sets where the Probability Hypothesis Density (PHD) is a well-known method for single-sensor multi-target tracking problems in a Bayesian framework, but the extension to the multi-sensor case seems to remain a challenge. We have proposed an extension of Mahler's work to the multi-sensor case by providing an expression of the true PHD multi-sensor data update equation. Then, based on the configuration of the sensors fields of view (FOVs), a joint partitioning of both the sensors and the state space provides an equivalent yet more practical expression of the data update equation, allowing a more effective implementation in specific FOV configurations ( ). This work is done in collaboration with Thales Communications. The multi-sensor / multi-target filtering problem by using PHD filtering methods are topics developed in the The PhD thesis of Emmanuel Delande. This PhD thesis entitled "Multi-sensor PHD filtering with application to sensor management" will be defended in December 2011. In addition to the different questions described above, see also and . Then, a new approach using operational objectives, related to the type of application, for sensor manager is proposed.

We have obtained a PICS (International Project for Scientific Cooperation) from the CNRS in 2008 for 3 years to work in cooperation with the Department of Civil and Environmental Engineering of the University of Waterloo (Canada). During this cooperation we have developed a belief functions based method to track the building materials on a construction site. ( ). Based on this cooperation, during 2011 a new common research project with the same department of the University of Waterloo has been built, and is actually submitted for funding. The topic of this project is the using of nonparametric Bayesian models in the area of Non-destructive Testing.

Today, Global Navigation Satellite Systems (GNSS) have penetrated the transport field through applications such as monitoring of containers. These applications do not necessarily request a high availability, integrity and accuracy of the positioning system. For safety applications (as complete guidance of autonomous vehicles), performances require to be more stringent. For, sensors may deliver very erroneous measurements because of such hard external conditions which reduce significantly the possibilities to receive direct signals. The consequences of environmental obstructions are unavailability of the service and reception of reflected signals that degrades in particular the accuracy of the positioning. Indeed, NLOS (Non Line Of Sight) signals, i.e. signals received after reflections on the surrounding obstacles, frequently occur in dense environments and degrade localization accuracy because of the delays observed on the propagation time measurement creating additional error on pseudorange estimation. In the previous years we have proposed new algorithms to improve the localization precision. This algorithm are based on two principles : a jump multimodel approach and a joint state - noise density estimation. We have focused this year on an approach using Dirichlet Process Mixture to track the noise density in urban canyon while estimating the position of the vehicle. Algorithm have been validated on real data collected in a French town : Belfort. Nicolas Viandier has defended his PhD on this subject on June 2011 . ( , , , ). These results will be presented to the Workshop Non Parametric Bayes at the NIPS Conference en Decembre 2011 ( ) and to the ICASSP 2011 Conference ( ).

The term "Internet of Things" has come to describe a number of technologies and research disciplines that enable the Internet to reach out into the real world of physical objects.
Technologies like RFID, short-range wireless communications, real-time localization and sensor networks are now becoming increasingly common, bringing the Internet of Things into commercial
use. In such applications the data sent by a
*thing*to another may generate an impulse noise in the reception channel of objects in the neighbourhood. The noise appearing in such applications can be considered as

Pierre Chainais arrived in SequeL in september 2010 with the purpose of a thematic evolution toward non parametric Bayesian approaches. This represents an important investment in very new directions on an emerging topic at the interface between machine learning and signal/image processing. Discussions have begun with Emmanuel Duflos and Philippe Vanheeghe on the use of non parametric Bayesian approaches to blind deconvolution of noisy natural images. The main objective is to use together the typical structure and sparsity of space-scale representations of images.

Pierre Chainais has continued working on several older projects. One of them deals with the segmentation of nanotubes in microscopic imaging , . B. Lebental at IFFSTAR works on the conception of new nano-sensors based on the use of carbon nanotubes to build a nano-membrane. P. Chainais has developed an image processing pipeline to analyse images of these nanomembranes so as to characterize their properties in a precise and objective manner. Among other properties, the histograms of orientations of the nanotubes is provided. This tool will be very useful since such nanosensors are becoming more and more common.

In solar astronomy , , we have proposed a tool for the virtual super-resolution of scale invariant textured images. The aim of this project was to provide astronomers with plausible high-resolution images to calibrate next generation spatial telescopes. In particular, our images can be used to optimize the compression algorithm to be embedded in a spatial telescope. In collaboration with M. Chevaldonné and J-M. Favreau (Université Clermont-Ferrand I), we work a software for texture synthesis on 3D surfaces based on multifractal processes. A first version of the software is under current development. More marginal is our work on the use of stochastic processes for the simulation of turbulent pressure fields in collaboration with M. Pachebat (Laboratoire de Mécanique et d'Acoustique de Marseille) and Nicolas Totaro (LVA, INSA Lyon).

Addressing Business develops a software to help their clients (companies) find new clients. Currently, this software is an information system, that helps the human decision maker.

The goal of this contract was to realize a first exploratory step towards using data mining techniques to handle, and possibly improve, this process. Confidentiality issues restrict the communication on this contract . However, this study has been very successful, and we look forward further collaboration with this company on this topic.

There has been various activities between SequeLand Orange Labs.

First, the collaboration around the PhD of Christophe Salperwyck has continued. Second, a CRE has been signed in 2011 to continue our work on web advertising, and more generally, collaborative filtering. On this topic, Sami Naamane has been hired in Fall 2011 as PhD student.

We worked on the next steps of the optimisation of thermal control of building with Effigenie. We presented a common project June and this start up won in summer the OSEO innovation prize (and the LMI one).

Boris Baldassari has been hired by Squoring Technology (Toulouse) as a PhD student in May 2011. He works on the use of machine learning to improve the quality of the software development process.

2011 is the last year of the “Ubiquitous Virtual Seller” project of the “Pôle de compétitivité Industries du Commerce” (PICOM). We have completed our contribution related to recommendation systems . This work was done mostly in collaboration with Becquet and Oxylane.

In collaboration with Emmanuel Duflos, Pierre Chainais has been involved in the supervising of a Master 2 student within a collaboration with the company Qualisteo on the pattern recognition in electric power consumption signals. A PhD grant is under study. The purpose is to learn how to identify the origin of the electric consumption of a house from the power consumption alone. This problem combines signal processing as well as machine learning questions. This project is still under discussion.

see .

The work on sensor management went on this year, focusing on the extension to the multisensor case of the PHD filter. This work is realized in the frame of the thesis of Emmanuel Delande (Grant DGA/CNRS) in collaboration with Thales Communication. The defense of this PhD thesis will be held in December 2011.

Title: Learning Algorithms, Models an sPArse representations for structured DAta

Type: National Research Agency (ANR-09-EMER-007)

Coordinator: INRIA Lille - Nord Europe (Mostrare)

Others partners: Laboratoire d'Informatique Fondamentale de Marseille, Laboratoire Hubert Curien ; Saint Etienne, Laboratoire d'Informatique de Paris 6.

See also:
http://

Activity Report: Philippe Preux has continued his collaboration with Ludovic Denoyer (assistant professor, Université de Paris 6), Gabriel Arnold-Dulac (PhD student), and Patrick Gallinari (professor, Université de Paris 6). This led to the work on datum-wise representation , .

Title: EXPLOration - EXPLOitation for efficient Resource Allocation with Applications to optimization, control, learning, and games

Type: National Research Agency

Coordinator: INRIA Lille - Nord Europe (SequeL, Rémi Munos)

Others partners: INRIA Saclay - Ile de France (TAO), HEC Paris (GREGHEC), Ecole Nationale des Ponts et Chaussées (CERTIS), Université Paris 5 (CRIP5), Université Paris Dauphine (LAMSADE).

Activity Report: We developed bandit algorithm for planning in Markov Decision Processes based on the optimism in the face of uncertainty principle.

Title: Brain computer co-adaptation for better interfaces

Type: National Research Agency

Duration: 2009-2013

Partners: INRIA Odyssee project (Maureen Clerc), the INSERM U821 team (Olivier Bertrand), the Laboratory of Neurobiology of Cognition (CNRS) (Boris Burle) and the laboratory of Analysis, topology and probabilities (CNRS and University of Provence) (Bruno Torresani).

Activity Report: In collaboration with Maureen Clerc and here student Joan Fruitet, we proposed a new Brain Computer interface procedure to select online a discriminative motor task based on a bandit algorithm. The efficient trading off between exploration (getting information about each motor tasks) and exploitation (selecting those that have highest classification rates) enables to reduce the time of the training session.

Title: Multifractal Analysis and Applications to Signal and Image Processing

Type: National Research Agency

Duration: 2011-2015

Partners : Univ. Paris-Est Créteil, Univ. Sciences et Technologies de Lille and INRIA (Lille=, ENST (Telechom ParisTech), Univ. Blaise Pascal (Clermont-Ferrand), and Univ. Bretagne Sud (Vannes), Statistical Signal Processing group at the Physics Department at the Ecole Normale Supérieure de Lyon, one researcher from the Math. Department of Institut National des Sciences Appliquees de Lyon and two researchers from the Laboratoire d'Analyse, Topologie et Probabilités (LAPT) of Aix-Marseille University.

Coordinator: Univ. Paris-Est-Créteil (S. Jaffard)

Activity Report: Ideas from the multifractal framework are the basis of our current work on the development of a new Bayesian approach to the blind deconvolution of noisy images.

INRIA Nancy - Grand Est, Team MAIA, France.

Bruno Scherrer
*Collaborator*

We have had collaboration on the topic of
*approximate dynamic programming and statistical learning*and published a conference paper
and a technical report
this year.

LITIS : Laboratoire d'Informatique, du Traitement de l'Information et des Systèmes.

Stéphane Canu
*Collaborator*

Emmanuel Duflos and Hachem Kadri are collaborating with Pr. Stéphane Canu on Functional RKHS.

Participants: the whole SequeLis involved

Title: Pattern Analysis, Statistical Modeling, and Computational Learning

Type: Cooperation (ICT), Network of Excellence (NoE)

Duration: March 2008 - February 2013

Coordinator: Univ. Southampton

Others partners: Many european organizations, universities, and research centers.

See also:
http://

Title: Sparse Reinforcement Learning in High Dimensions

Type: PASCAL-2 Pump Priming Programme

Duration: November 2009 - March 2012

Partners: INRIA Lille - Nord Europe, Shie Mannor (Technion, Israel)

Title: Composing Learning for Artificial Cognitive Systems

Type: Cooperation (ICT), Specific Targeted Research Project (STREP)

Duration: March 2011 - February 2015

Coordinator: University College of London

Partners: University College London, United Kingdom (John Shawe-Taylor, Stephen Hailes, David Silver, Yee Whye Teh), University of Bristol, United Kingdom (Nello Cristianini), Royal Holloway, United Kingdom (Chris Watkins), Radboud Universiteit Nijmegen, The Netherlands (Bert Kappen), Technische Universitat Berlin, Germany (Manfred Opper), Montanuniversitat Leoben, Austria (Peter Auer), Max-Planck Institute of Biological Cybernetics, Germany (Jan Peters).

See also:
http://

Title: New Paradigms for Preventing uncontrolled social influence in the future web

Type: FET-Open Young Explorer Scheme

Duration:
*Submitted*

Coordinator: Politecnico di Milano (Nicola Gatti)

Partners: University of Southampton, United Kingdom (Valentin Robu, Enrico Gerdin, Nick Jennings).

*Title*: Decision-making under Uncertainty with Applications to Reinforcement Learning, Control, and Games

*INRIA principal investigator*: Rémi Munos

*International Partner*:

*Institution*: University of Alberta (Canada)

*Laboratory*: Department of Computer Science

*Principal investigator*: Csaba Szepesvári

*Duration*: January 2010 - January 2013

*Website*:
http://

This associate team aims at bridging researchers from the SequeL team-project at INRIA Lille with the Department of Computing Science of the University of Alberta in Canada. Our common interest lies in machine learning, especially reinforcement learning, bandit algorithms and statistical learning with applications to control and computer games. The department of Computing Science at the University of Alberta is internationally renown as a leading research institute on these topics. The research work spans from theory to applications. Grounded on an already existing scientific collaboration, this associate team will make it easier to collaborate further between the two institutes, and thus strengthen this relationship. We foresee that the associate team will boost our collaboration, create new opportunities for financial support, and open-up a long-term fruitful collaboration between the two institutes. The collaboration will be through organizing workshops and exchanging researchers, postdoctoral fellows, and Ph.D. students between the two institutes.

University of Alberta, Edmonton, Alberta, Canada.

Prof. Csaba Szepesvari
*Collaborator*

We have been working on the topic of
*regularized reinforcement learning*over the last four years. This year, we have one journal paper submitted
and one that will be submitted soon
on this topic. We are also coordinators of an
*INRIA associate team program*with the university of Alberta.

Amir massoud Farahmand
*Collaborator*

We have been working on the topic of
*regularized reinforcement learning*over the last five years. This year, we have one journal paper submitted
and one that will be submitted soon
on this topic.

Technion - Israel Institute of Technology, Haifa, Israel.

Prof. Shie Mannor
*Collaborator*

We have been collaborating on the topic of
*Bayesian reinforcement learning*for the last six years, on the topic of
*regularized reinforcement learning*for the last four years, and on the topic of
*reinforcement learning in high dimensions*in the last two year. On the first topic, we have a journal paper (survey) in preparation
this year. On the second topic, we have one journal paper
under review
and one in preparation
this year. Finally, on the third topic, we were Co-PI's of
a
*PASCAL2 pump-priming program*that ended in June 2011.

University of Waterloo, Waterloo, Ontario, Canada.

Prof. Pascal Poupart
*Collaborator*

We have been collaborating on the topic of
*Bayesian reinforcement learning*in the last five years. This year, we have a journal paper in preparation
on this topic.

Politecnico di Milano, Italy.

Prof. Marcello Restelli
*Collaborator*

We have been working on the topic of
*transfer in reinforcement learning*over the last year. In particular, we have one conference paper
and a journal paper in preparation.

Prof. Nicola Gatti
*Collaborator*

We have started a collaboration on the topic of
*bandit mechanisms for sponsored-search auction*. This year, we have submitted a paper to AAMAS
and we have collaborated on a proposal for a Marie Curie
ITN and a Fet-Open Young Researcher proposal.

University of Southampton, United Kingdom.

Prof. Enrico Gerding
*Collaborator*

We have been working on the topic of
*learning and mechanism design*over the last year. In particular, we have collaborated on a proposal for a Marie Curie ITN and a Fet-Open Young Researcher proposal.

Brahim Chaib-Draa, from Université Laval, Québec.

His visit has been funded by Université de Lille 3 where he also taught.

Mohammad G. Azar, Ph.D. student at University of Nijmegen, The Netherlands.

Period: April 2011 - July 2011

He worked with Rémi Munos and Mohammad Ghavamzadeh on performance analysis of reinforcement learning algorithms. The outcome of this collaboration has been a conference paper and a technical report so far.

Matthew Hoffman, Ph.D. student at University of British Columbia, Canada.

Period: October 2010 - April 2011.

He worked with Alessandro Lazaric, Rémi Munos, and Mohammad Ghavamzadeh on our PASCAL2 Pump-Priming project on
*sparse reinforcement learning in high dimensions*. The outcome of this collaboration has been a conference paper
so far.

Sébastien Bubeck received the second prize for the best French Ph.D in Artificial Intelligence (AI prize 2011).

*R. Munos*co-organized with J.-Y. Audibert a tutorial on “Introduction to Bandits: Algorithms and Theory” (
https://

*R. Munos*co-organized the
*Machine Learning Summer School 2011*(MLSS'11) in Bordeaux (2 weeks of lectures for about 80 international students), with François Caron, Manuel Davy, Pierre Del Moral, Pierrick
Legrand, Manuel Lopes.

*R. Munos*co-organized (with Florence Forbes, Bernard Espiau et Monique Thonnat) the
*Journées INRIA autour de l'apprentissage statistique*, Décembre 2011.

*M. Ghavamzadeh*, Max Planck Institute for Intelligent Systems, Tübingen, Host: Prof. Jan Peters (June 2011).

*M. Ghavamzadeh*, University of Liège - Systems & Modeling Research Unit, Host: Prof. Damien Ernst (June 2011).

*M. Ghavamzadeh*, University of Waterloo - School of Computer Science, Host: Prof. Pascal Poupart (November 2010).

*M. Ghavamzadeh*, McGill University - School of Computer Science, Host: Prof. Joelle Pineau (November 2010).

*M. Ghavamzadeh*, University of Alberta - AI Seminar, Host: Prof. Csaba Szepesvári (November 2010).

*A. Lazaric*, University of Liège - Systems & Modeling Research Unit, Host: Prof. Damien Ernst (October 2011).

*R. Munos*, University of Liège, Department of Electical Engineering, February 2011.

*R. Munos*, ICAPS, workshop Monte-Carlo Tree Search, Freiburg, June 2011

*R. Munos*, Machine learning Summer School in Bordeaux, September 2011.

*R. Munos*, Oxford, department of Computer Science, November 2011

*
Participation to the program committees of international conferences
*

*E. Duflos and P. Vanheeghe*were members of the Fusion'2011 International Program Committee

*E. Duflos*is reviewing papers for the following journals : IEEE Transaction on Signal Processing, International Journal of Approximate Reasoning, Information Fusion.

*P. Vanheeghe*is reviewing papers for the journal : IEEE Transaction on Signal Processing.

*D. Ryabko*: UAI 2011.

*M. Ghavamzadeh*: International Joint Conference on Artificial Intelligence (IJCAI 2011), European Workshop on Reinforcement Learning (EWRL 2011), International Conference on
Artificial Neural Networks (ICANN 2011), National Conference on Artificial Intelligence (AAAI 2011).

*R. Munos*: Area chair for NIPS 2011.

*P. Preux*: ADPRL 2011, ICPRAM 2012, EGC 2011.

**International journal and conference reviewing activities***(in addition to the conferences in which we belong to the PC)*

*M. Ghavamzadeh*is Editorial Board Member of the Machine Learning Journal (MLJ, 2011-2014).

*M. Ghavamzadeh*: Annual Conference on Neural Information Processing Systems (NIPS 2011), International Conference on Artificial Intelligence and Statistics (AISTATS 2011),
Neurocomputing, Machine Learning Journal (MLJ), Journal of Machine Learning Research (JMLR), Journal of Artificial Intelligence Research (JAIR).

*D. Ryabko*: IEEE Trans. Inf. Th., NIPS 2011.

*R. Munos*: ADPRL 2011, AISTATS 2011, ALT 2011, CAP 2011, ICML 2011, IJCAI 2011.

*P. Preux*: NIPS 2011, CAP 2011,IEEE Trans. on Neural Networks, Revue d'Intelligence Artificielle.

*A. Lazaric*: AAAI 2011, AAMAS 2011 & 2012, ACC, ALT, COLT, ICML, IJCAI, Journal of Artificial Intelligence Research (JAIR), Journal of Machine Learning Research (JMLR), IEEE
Transactions on Automatic Controls (TAC).

*P. Chainais*: IEEE Trans. on Pattern Analysis and Machine Learning, Journal of Statistical Physics, Physica A.

*Emmanuel Duflos*was appointed Director of Research of the Ecole Centrale in Lille. He has also reviewed proposals for the ANR programs.

*M. Ghavamzadeh*is a grant proposal reviewer for the Natural Sciences and Engineering Research Council of Canada (NSERC).

*J. Mary*is expert for the “Ministère de l'Enseignement Supérieur et de la Recherche” on control of“Crédit Impôt Recherche”, member of the COS at Lille 3 for one assistant professor
in computer science, member of the COS at 3 for one assistant professor un computer science at École Centrale de Lille for one assistant professor in computer science.

*R. Munos*project evaluation for Research Foundation Flanders (FWO), Belgique, 2011, member of the evaluation committee for the Machine Learning, Université Libre de Bruxelles (ULB).
Member of the Comité de sélection Professeur 27ème section for Polytech Paris-Sud, 2011.

*P. Preux*: reviewer for the CNRS program PEPPI biology-mathematics-computer science, reviewer for the ANR program CONTINT, and the ANR program “blanc”, president of the committee of
selection (COS) at the University of Lille 3 for one assistant professor in computer science, member of the committee of selection (COS) at the École Centrale de Lille for one assistant
professor in computer science.

*D. Ryabko*is a member of COST-GRI evaluation committee.

*Philippe Vanheeghe*has reviewed proposals for Discovery Grant applications of the Natural Sciences and Engineering Research Council of Canada (NSERRC - CRSNG), as well as for the
ANR.

*D. Ryabko*is an examiner of the Ph.D. of K. Eltysheva.

*R. Munos*examiner of the Ph.D. of Louis Dorard (University College of London), Raphael Fonteneau (University of Liège), and member of PhD juries for Lei Yu (University Cergy
Pontoise) and Wassim Jouini (Supelec Rennes).

*R. Munos*is member of HDR Committee for Daniil Ryabko (INRIA Lille - Nord Europe) and Aurélien Garivier (Télécom PariTech), 2011.

*P. Preux*is member of the Ph.D. juries of Halem Benhabiles and Manuel Loth (Université de Lille 1)

*E. Duflos*was
*rapporteur*for the for PhD thesis of Michele Pace (INRIA Bordeaux), Pierre Neri (ENAC - University of Toulouse), Frédéric Faurie (University of Bordeaux), Sébastien Rougerie
(University of Toulouse) and the Habilitation à Diriger des Recherche of Frédéric Dambreville.

*R. Munos*is Vice Président du Comité des Projets at INRIA Lille-Nord Europe since September 2011.

*R. Munos*is member of the Commission d'Evaluation INRIA.

*R. Munos*is Président du jury d'admissibilité CR1-CR2 INRIA Lille - Nord Europe.

*R. Munos*is member of the jury d'admission DR2 INRIA en 2011.

*R. Munos*, Master: “Introduction to Reinforcement Learning”, 30 hours, M2, Master “Mathematiques, Vision, Apprentissage”, ENS Cachan.

*J. Mary*, Master : “Programmation web avancée et design pattern”, 32h eq TD, M2, Université de Lille 3, France.

*J. Mary*, Master : “Introduction à la Programmation R”, 32h eq TD, M1, Université de Lille 3, France.

*J. Mary*, Master : “Programmation R avancée”, 32h eq TD, M1, Université de Lille 3, France.

*P. Chainais*, Master: “Ondelettes et Applications”, 24h, niveau M1, Ecole Centrale de Lille, 2ème année.

*P. Chainais*, Master : “Décision et Apprentissage”, 24h, niveau M2, Ecole Centrale de Lille, 3ème année.

*P. Preux*, Master : “Décision dans l'incertain”, 40h, niveau M2 informatique, Lille 1.

*P. Preux*, Master : “Mathematiques, Informatique, Modélisation”, 72h, niveau M1 psychologie, Lille 3.

*E. Duflos*, Master : “Modélisation et Inférence Bayesienne”, 40h, niveau M2, Ecole Centrale de Lille, 3ème année.

*P. Vanheeghe*, Master : “Estimation, Identification, Observation”, 32h, niveau M2, Ecole Centrale de Lille, 3ème année.

HdR :
*Daniil Ryabko*, Learnability in Problems of Sequential Inference, Université de Lille 1, December 19, 2011,
.

PhD :
*Manuel Loth*,
*Active Set Algorithms for the LASSO*, Université de Lille 1, July 8, 2011, Philippe Preux,
.

PhD :
*Odalric Maillard*,
*Active Set Algorithms for the LASSO*, Université de Lille 1 / Université de Toulouse, October 3, 2011, Rémi Munos and Philippe Berthet,
.

PhD: Nicolas Viandier, June 2011 (see section 10, [4]) : encadrement Emmanuel Duflos, Juliette Marais (IFSTTAR).

PhD in progress :
*Boris Baldassari*, “Apprentissage automatique et développement logiciel”, Sep. 2011, encadrement : Ph. Preux.

PhD in progress :
*Victor Gabillon*, “Active Learning in Classification-based Policy Iteration”, Sep. 2009, encadrement : M. Ghavamzadeh, Ph. Preux.

PhD in progress :
*Azadeh Khaleghi*, “Unsupervised Learning of Sequential Data”, Sep. 2010, encadrement : D. Ryabko, Ph. Preux.

PhD in progress :
*Sami Naamane*, “Filtrage collaboratif adverse et dynamique”, Nov. 2011, encadrement : J. Mary, Ph. Preux.

PhD in progress :
*Olivier Nicol*, “Apprentissage par renforcement sous contrainte de ressources finies, dans un environnement non stationnaire, face à des flux de données massifs”, Nov. 2010,
encadrement : J. Mary, Ph. Preux.

PhD in progress :
*Christophe Salperwyck*, “Apprentissage incrémentale et sur flux de données” , Dec. 2009, encadrement : Ph. Preux.

PhD in progress :
*Amir Sani*, “Learning under uncertainty”, Oct. 2011, encadrement : R. Munos, A. Lazaric.

PhD in progress :
*Jean-François Hren*, “Prise de décision et planification optimiste”, Oct. 2007, encadrement : R. Munos.

PhD in progress :
*Alexandra Carpentier*, “Allocation adaptatives de ressources pour l'apprentissage actif”, Oct. 2007, encadrement : R. Munos.

PhD in progress :
*Emilie Kaufmann*, “Bayesian Bandits”, Oct. 2011, encadrement : R. Munos, O. Cappé, A. Garivier.