SequeLis a joint project with the LIFL(UMR 8022 of CNRS, and University of Lille 1, and University of Lille 3) and the LAGIS(a joint lab of the École Centrale de Lille and the Lille 1 University).

SequeLmeans
“Sequential Learning”. As such,
SequeLfocuses on
the task of learning in artificial systems (either hardware,
or software) that gather information along time. Such systems
are named
*(learning) agents*(or learning machines) in the
following. These data may be used to estimate some parameters
of a model, which in turn, may be used for selecting actions
in order to perform some long-term optimization task.

For the purpose of model building, the agent needs to represent information collected so far in some compact form and use it to process newly available data.

The acquired data may result from an observation process of an agent in interaction with its environment (the data thus represent a perception). This is the case when the agent makes decisions (in order to attain a certain objective) that impact the environment, and thus the observation process itself.

Hence, in
SequeL, the term
**sequential**refers to two aspects:

The
**sequential acquisition of data**, from which a model
is learned (supervised and non supervised learning),

the
**sequential decision making task**, based on the
learned model (reinforcement learning).

Examples of sequential learning problems include:

tasks deal with the prediction of some response given a certain set of observations of input variables and responses. New sample points keep on being observed.

tasks deal with clustering objects, these latter making a flow of objects. The (unknown) number of clusters typically evolves during time, as new objects are observed.

tasks deal with the control (a policy) of some system which has to be optimized (see ). We do not assume the availability of a model of the system to be controlled.

In all these cases, we mostly assume that the process can be considered stationary for at least a certain amount of time, and slowly evolving.

We wish to have any-time algorithms, that is, at any moment, a prediction may be required/an action may be selected making full use, and hopefully, the best use, of the experience already gathered by the learning agent.

The perception of the environment by the learning agent (using its sensors) is generally neither the best one to make a prediction, nor to take a decision (we deal with Partially Observable Markov Decision Problem). So, the perception has to be mapped in some way to a better, and relevant, state (or input) space.

Finally, an important issue of prediction regards its evaluation: how wrong may we be when we perform a prediction? For real systems to be controlled, this issue can not be simply left unanswered.

To sum-up, in SequeL, the main issues regard:

the learning of a model: we focus on models that map some input space to ,

the observation to state mapping,

the choice of the action to perform (in the case of sequential decision problem),

the performance guarantees,

the implementation of usable algorithms,

all that being understood in a
*sequential*framework.

This year, Sébastien Bubeck has defended his Ph.D. thesis, entitled “Bandits Games and Clustering Foundations” . Not only is it the first Ph.D. defence in SequeL, but it is also a highly successful one: Sébastien has been awarded a Gilles Kahn 2010 prize, a prize awarded by Specif to the best Ph.D. theses in Computer Science, in France (patronized by the Academy of Science). The thesis supervisor was Rémi Munos.

The research of Rémi Coulom on artificial Go players has received further recognition, in the form of two important international awards (see Section ). This work has been also highlighted in the popular science magazine “Pour la Science”, featuring on the cover .

SequeLis primarily grounded on two domains:

the problem of decision under uncertainty,

statistical analysis and statistical learning, which provide the general concepts and tools to solve this problem.

To help the reader who is unfamiliar with these questions, we briefly present key ideas below.

The phrase “Decision under uncertainty” refers to the problem of taking decisions when we do not have a full knowledge neither of the situation, nor of the consequences of the decisions, as well as when the consequences of decision are non deterministic.

We introduce two specific sub-domains, namely the Markov decision processes which models sequential decision problems, and bandit problems.

Sequential decision processes occupy the heart of the SequeLproject; a detailed presentation of this problem may be found in Puterman's book .

A Markov Decision Process (MDP) is defined as the tuple
where
is the state space,
is the action space,
Pis the probabilistic transition kernel, and
is the reward function. For the sake of simplicity,
we assume in this introduction that the state and action
spaces are finite. If the current state (at time
t) is
and the chosen action is
, then the Markov assumption means that the
transition probability to a new state
(at time
t+ 1) only depends on
(
x,
a). We write
p(
x^{'}|
x,
a)the corresponding transition
probability. During a transition
(
x,
a)
x
^{'}, a reward
r(
x,
a,
x^{'})is incurred.

In the MDP (
, each initial state
x_{0}and action sequence
a_{0},
a_{1}, ...gives rise to a sequence of states
x_{1},
x_{2}, ..., satisfying
and rewards
r_{t}itself is a random variable.r_{1},
r_{2}, ...defined by
r_{t}=
r(
x_{t},
a_{t},
x_{t+ 1}).

The history of the process up to time
tis defined to be
H_{t}= (
x_{0},
a_{0}, ...,
x_{t-1},
a_{t-1},
x_{t}). A policy
is a sequence of functions
_{0},
_{1}, ..., where
_{t}maps the space of possible histories at time
tto the space of probability distributions over the
space of actions
. To follow a policy means that, in each time step,
we assume that the process history up to time
tis
x_{0},
a_{0}, ...,
x_{t}and the probability of selecting an action
ais equal to
_{t}(
x_{0},
a_{0}, ...,
x_{t})(
a). A policy is called stationary
(or Markovian) if
_{t}depends only on the last visited state. In other
words, a policy
= (
_{0},
_{1}, ...)is called stationary if
_{t}(
x_{0},
a_{0}, ...,
x_{t}) =
_{0}(
x_{t})holds for all
t0. A policy is called deterministic
if the probability distribution prescribed by the policy
for any history is concentrated on a single action.
Otherwise it is called a stochastic policy.

We move from an MD process to an MD problem by
formulating the goal of the agent, that is what the sought
policy
has to optimize? It is very often formulated as
maximizing (or minimizing), in expectation, some functional
of the sequence of future rewards. For example, an usual
functional is the infinite-time horizon sum of discounted
rewards. For a given (stationary) policy
, we define the value function
of that policy
at a state
as the expected sum of discounted future rewards
given that we state from the initial state
xand follow the policy
:

where is the expectation operator and (0, 1)is the discount factor. This value function gives an evaluation of the performance of a given policy . Other functionals of the sequence of future rewards may be considered, such as the undiscounted reward (see the stochastic shortest path problems ) and average reward settings. Note also that, here, we considered the problem of maximizing a reward functional, but a formulation in terms of minimizing some cost or risk functional would be equivalent.

In order to maximize a given functional in a sequential
framework, one usually applies Dynamic Programming
(DP)
, which introduces the optimal
value function
V^{*}(
x), defined as the optimal
expected sum of rewards when the agent starts from a state
x. We have
. Now, let us give two definitions about
policies:

We say that a policy
is optimal, if it attains the optimal values
V^{*}(
x)for any state
,
*i.e.*, if
for all
. Under mild conditions, deterministic
stationary optimal policies exist
. Such an optimal policy is
written
^{*}.

We say that a (deterministic
stationary) policy
is greedy with respect to (w.r.t.) some function
V(defined on
) if, for all
,

where
is the set of
that maximizes
f(
a). For any function
V, such a greedy policy always exists because
is finite.

The goal of Reinforcement Learning (RL), as well as that of dynamic programming, is to design an optimal policy (or a good approximation of it).

The well-known Dynamic Programming
equation (also called the Bellman equation) provides a
relation between the optimal value function at a state
xand the optimal value function at the successors
states
x^{'}when choosing an optimal action: for all
,

The benefit of introducing this concept of optimal value
function relies on the property that, from the optimal
value function
V^{*}, it is easy to derive an optimal behavior by
choosing the actions according to a policy greedy w.r.t.
V^{*}. Indeed, we have the property that a policy greedy
w.r.t. the optimal value function is an optimal policy:

In short, we would like to mention that most of the reinforcement learning methods developed so far are built on one (or both) of the two following approaches ( ):

Bellman's dynamic programming
approach, based on the introduction of the value
function. It consists in learning a “good”
approximation of the optimal value function, and then
using it to derive a greedy policy w.r.t. this
approximation. The hope (well justified in several
cases) is that the performance
of the policy
greedy w.r.t. an approximation
Vof
V^{*}will be close to optimality. This approximation
issue of the optimal value function is one of the major
challenge inherent to the reinforcement learning
problem.
**Approximate dynamic programming**addresses the
problem of estimating performance bounds (
*e.g.*the loss in performance
resulting from using a policy
-greedy w.r.t. some approximation
V- instead of an optimal policy) in terms of the
approximation error
||
V
^{*}-
V||of the optimal value
function
V^{*}by
V. Approximation theory and Statistical Learning
theory provide us with bounds in terms of the number of
sample data used to represent the functions, and the
capacity and approximation power of the considered
function spaces.

Pontryagin's maximum principle
approach, based on sensitivity analysis of the
performance measure w.r.t. some control parameters.
This approach, also called
**direct policy search**in the Reinforcement
Learning community aims at directly finding a good
feedback control law in a parameterized policy space
without trying to approximate the value function. The
method consists in estimating the so-called
**policy gradient**,
*i.e.*the sensitivity of the performance measure
(the value function) w.r.t. some parameters of the
current policy. The idea being that an optimal control
problem is replaced by a parametric optimization
problem in the space of parameterized policies. As
such, deriving a policy gradient estimate would lead to
performing a stochastic gradient method in order to
search for a local optimal parametric policy.

Finally, many extensions of the Markov decision processes exist, among which the Partially Observable MDPs (POMDPs) is the case where the current state does not contain all the necessary information required to decide for sure of the best action.

Bandit problems illustrate the fundamental difficulty of decision making in the face of uncertainty: A decision maker must choose between what seems to be the best choice (“exploit”), or to test (“explore”) some alternative, hoping to discover a choice that beats the current best choice.

The classical example of a bandit problem is deciding what treatment to give each patient in a clinical trial when the effectiveness of the treatments are initially unknown and the patients arrive sequentially. These bandit problems became popular with the seminal paper , after which they have found applications in diverse fields, such as control, economics, statistics, or learning theory.

Formally, a K-armed bandit problem (
K2) is specified by K real-valued
distributions. In each time step a decision maker can
select one of the distributions to obtain a sample from it.
The samples obtained are considered as rewards. The
distributions are initially unknown to the decision maker,
whose goal is to maximize the sum of the rewards received,
or equivalently, to minimize the regret which is defined as
the loss compared to the total payoff that can be achieved
given full knowledge of the problem,
*i.e.*, when the arm giving the highest expected
reward is pulled all the time.

The name “bandit” comes from imagining a gambler playing
with K slot machines. The gambler can pull the arm of any
of the machines, which produces a random payoff as a
result: When arm k is pulled, the random payoff is drawn
from the distribution associated to k. Since the payoff
distributions are initially unknown, the gambler must use
exploratory actions to learn the utility of the individual
arms. However, exploration has to be carefully controlled
since excessive exploration may lead to unnecessary losses.
Hence, to play well, the gambler must carefully balance
exploration and exploitation. Auer
*et al.*
introduced the algorithm UCB
(Upper Confidence Bounds) that follows what is now called
the “optimism in the face of uncertainty principle”. Their
algorithm works by computing upper confidence bounds for
all the arms and then choosing the arm with the highest
such bound. They proved that the expected regret of their
algorithm increases at most at a logarithmic rate with the
number of trials, and that the algorithm achieves the
smallest possible regret up to some sub-logarithmic factor
(for the considered family of distributions).

Many of the problems of machine learning can be seen as extensions of classical problems of mathematical statistics to their (extremely) non-parametric and model-free cases. Other machine learning problems are founded on such statistical problems. Statistical problems of sequential learning are mainly those that are concerned with the analysis of time series. These problems are as follows.

Given a series of observations
it is required to predict the probability
distribution of the next outcome
x_{n+ 1}, before it is revealed and the process continues.
Different goals can be formulated in this setting. One can
either make some assumptions on the probability measure
that generates the sequence
, such as that the outcomes are independent and
identically distributed (i.i.d.), or that the sequence is a
Markov chain, that it is a stationary process, etc. More
generally, one can assume that the data is generated by a
probability measure that belongs to a certain set
. In these cases the goal is to have the discrepancy
between the predicted and the “true” probabilities to go to
zero, if possible, with guarantees on the speed of
convergence.

Alternatively, rather than making some assumptions on the data, one can change the goal: the predicted probabilities should be asymptotically as good as those given by the best reference predictor from a certain pre-defined set.

Given a series of observations of
generated by some unknown probability measure
, the problem is to test a certain given hypothesis
H_{0}about
, versus a given alternative hypothesis
H_{1}. There are many different examples of this problem.
Perhaps the simplest one is testing a simple hypothesis “
is Bernoulli i.i.d. measure with probability of 0
equals 1/2” versus “
is Bernoulli i.i.d. with the parameter different
from 1/2”. More interesting cases include the problems of
model verification: for example, testing that
is a Markov chain, versus that it is a stationary
ergodic process but not a Markov chain. In the case when we
have not one but several series of observations, we may
wish to test the hypothesis that they are independent, or
that they are generated by the same distribution.
Applications of these problems to a more general class of
machine learning tasks include the problem of feature
selection, the problem of testing that a certain behavior
(such pulling a certain arm of a bandit, or using a certain
policy) is better (in terms of achieving some goal, or
collecting some rewards) than another behavior, or than a
class of other behaviors.

The problem of hypothesis testing can also be studied in
its general formulations: given two (abstract) hypothesis
H_{0}and
H_{1}about the unknown measure that generates the data,
fund out whether it is possible to test
H_{0}against
H_{1}(with confidence), and if yes then how can one do
it.

The problem of clustering, while being a classical
problem of mathematical statistics, belongs to the realm of
unsupervised learning. For time series, this problem can be
formulated as follows: given several samples
, we wish group similar objects together. While this
is of course not a precise formulation, it can be made
precise if we assume that the samples were generated by
kdifferent distributions. Alternatively, one may
assume some specific model on the data, leading to
different formalizations of the problem.

Before detailing some issues of statistical learning, let us remind the definition of a few terms.

refers to a system capable of the autonomous acquisition and integration of knowledge. This capacity to learn from experience, analytical observation, and other means, results in a system that can continuously self-improve and thereby offer increased efficiency and effectiveness. (source: AAAI website)

is an approach to machine intelligence which is based on statistical modeling of data. With a statistical model in hand, one applies probability theory and decision theory to get an algorithm. This is opposed to using training data merely to select among different algorithms or using heuristics/“common sense” to design an algorithm.

Generally speaking, a kernel function is a function that maps a couple of points to a real value. Typically, this value is a measure of dissimilarity between the two points. Assuming a few properties on it, the kernel function implicitly defines a dot product in some function space. This very nice formal property as well as a bunch of others have ensured a strong appeal for these methods in the last 10 years in the field of function approximation. Many classical algorithms have been “kernelized”, that is, restated in a much more general way than their original formulation. Kernels also implicitly induce the representation of data in a certain “suitable” space where the problem to solve (classification, regression, ...) is expected to be simpler (non-linearity turns to linearity).

The fundamental tools used in SequeLcome from the field of statistical learning . We briefly present the most important for us to date, namely, kernel-based non parametric function approximation, and non parametric Bayesian models.

In statistics in general, and applied mathematics, the approximation of a multi-dimensional real function given some samples is a well-known problem (known as either regression, or interpolation, or function approximation, ...). Regressing a function from data is a key ingredient of our research, or to the least, a basic component of most of our algorithms. In the context of sequential learning, we have to regress a function while data samples are being obtained one at a time, while keeping the constraint to be able to predict points at any step along the acquisition process. In sequential decision problems, we typically have to learn a value function, or a policy.

Many methods have been proposed for this purpose. We are looking for suitable ones to cope with the problems we wish to solve. In reinforcement learning, the value function may have areas where the gradient is large; these are areas where the approximation is difficult, while these are also the areas where the accuracy of the approximation should be maximal to obtain a good policy (and where, otherwise, a bad choice of action may imply catastrophic consequences).

We particularly favor non parametric methods since they
make quite a few assumptions about the function to learn.
In particular, we have strong interests in
l_{1}-regularization, and the (kernelized-)LARS
algorithm.
l_{1}-regularization yields sparse solutions, and the
LARS approach produces the whole regularization path very
efficiently, which helps solving the regularization
parameter tuning problem.

Numerous problems in signal processing may be solved
efficiently by way of a Bayesian approach. The use of
Monte-Carlo methods allows us to handle non–linear, as well
as non–Gaussian, problems. In their standard form, they
require the formulation of probability densities in a
parametric form. For instance, it is a common usage to use
Gaussian likelihood, because it is handy. However, in some
applications such as Bayesian filtering, or blind
deconvolution, the choice of a parametric form of the
density of the noise is often arbitrary. If this choice is
wrong, it may also have dramatic consequences on the
estimation quality. To overcome this shortcoming, one
possible approach is to consider that this density must
also be estimated from data. A general Bayesian approach
then consists in defining a probabilistic space associated
with the possible outcomes of the
*object*to be estimated. Applied to density
estimation, it means that we need to define a probability
measure on the probability density of the noise : such a
measure is called a
*random measure*. The classical Bayesian inference
procedures can then been used. This approach being by
nature non parametric, the associated frame is called
*Non Parametric Bayesian*.

In particular, mixtures of Dirichlet processes
provide a very powerful
formalism. Dirichlet Processes are a possible random
measure and Mixtures of Dirichlet Processes are an
extension of well-known finite mixture models. Given a
mixture density
f(
x|
), and
, a Dirichlet process, we define a mixture of
Dirichlet processes as:

where
F(
x)is the density to be estimated.
The class of densities that may be written as a mixture of
Dirichlet processes is very wide, so that they really fit a
very large number of applications.

Given a set of observations, the estimation of the parameters of a mixture of Dirichlet processes is performed by way of a Monte Carlo Markov Chain (MCMC) algorithm. Dirichlet Process Mixture are also widely used in clustering problems. Once the parameters of a mixture are estimated, they can be interpreted as the parameters of a specific cluster defining a class as well. Dirichlet processes are well known within the machine learning community and its potential in statistical signal processing still need to be developped.

In the general multi-sensor multi-target Bayesian
framework, an unknown (and possibly varying) number of
targets whose states
x_{1}, ...
x_{n}are observed by several sensors which produce
a collection of measurements
z_{1}, ...,
z_{m}at every time step
k. Well-known models to this problem are track-based
models, such as the joint probability data association
(JPDA), or joint multi-target probabilities, such as the
joint multi-target probability density. Common difficulties
in multi-target tracking arise from the fact that the
system state and the collection of measures from sensors
are unordered and their size evolve randomly through time.
Vector-based algorithms must therefore account for state
coordinates exchanges and missing data within an unknown
time interval. Although this approach is very popular and
has resulted in many algorithms in the past, it may not the
optimal way to tackle the problem, since the sate and the
data are in fact
*sets*and not vectors.

The random finite set theory provides a powerful
framework to deal with these issues. Mahler's work on
finite sets statistics (FISST) provides a mathematical
framework to build multi-object densities and derive the
Bayesian rules for state prediction and state estimation.
Randomness on object number and their states are
encapsulated into random finite sets (RFS), namely
multi-target(state) sets
X= {
x_{1}, ...,
x_{n}}and multi-sensor (measurement) set
Zk= {
z_{1}, ...,
z_{m}}. The objective is then to propagate the
multitarget probability density
f_{k|
k}(
X|
Z(
k))by using the Bayesian set
equations at every time step
k:

where:

X= {
x_{1}, ...,
x_{n}}is a multi-target state, i.e. a finite
set of elements
x_{i}defined on the single-target space
;
x_{i}of a target is usually composed of its position,
its velocity, etc.

Z_{k+ 1}= {
z_{1}, ...,
z_{m}}is the current multi-sensor observation,
i.e. a collection of measures
z_{i}produced at time
k+ 1by all the sensors;

is the collection of observations up to time
k;

f_{k|
k}(
W|
Z^{(
k)})is the current
multi-target posterior density in state
W;

f_{k+ 1|
k}(
X|
W)is the current multi-target
Markov transition density, from state
Wto state
X;

f_{k+ 1}(
Z|
X)is the current
multi-sensor/multi-target likelihood function.

SequeLaims at solving problems of prediction, as well as problems of optimal and adaptive control. As such, the application domains are very numerous.

The application domains have been organized as follows:

adaptive control,

signal analysis and processing,

functional prediction,

neuroscience.

Adaptive control is an important application of the research being done in SequeL. Reinforcement learning precisely aims at controlling the behavior of systems and may be used in situations with more or less information available. Of course, the more information, the better, in which case methods of (approximate) dynamic programming may be used . But, reinforcement learning may also handle situations where the dynamics of the system is unknown, situations where the system is partially observable, and non stationary situations. Indeed, in these cases, the behavior is learned by interacting with the environment and thus naturally adapts to the changes of the environment. Furthermore, the adaptive system may also take advantage of expert knowledge when available.

Clearly, the spectrum of potential applications is very wide: as far as an agent (a human, a robot, a virtual agent) has to take a decision, in particular in cases where he lacks some information to take the decision, this enters the scope of our activities. To exemplify the potential applications, let us cite:

game softwares: in the 1990's, RL has been the basis of a very successful Backgammon program, TD-Gammon that learned to play at an expert level by basically playing a very large amount of games against itself;

Today, various games are studied with RL techniques.

many optimization problems that are closely related to operation research, but taking into account the uncertainty, and the stochasticity of the environment: see the job-shop scheduling, or the cellular phone frequency allocation problems, resource allocation in general

we can also foresee that some progress may be made by using RL to design adaptive conversational agents, or system-level as well as application-level operating systems that adapt to their users habits.

More generally, these ideas fall into what adaptive control may bring to human beings, in making their life simpler, by being embedded in an environment that is made to help them, an idea phrased as “ambiant intelligence”.

The sensor management problem consists in determining the best way to task several sensors when each sensor has many modes and search patterns. In the detection/tracking applications, the tasks assigned to a sensor management system are for instance:

detect targets,

track the targets in the case of a moving target and/or a smart target (a smart target can change its behavior when it detects that it is under analysis),

combine all the detections in order to track each moving target,

dynamically allocate the sensors in order to achieve the previous three tasks in an optimal way. The allocation of sensors, and their modes, thus defines the action space of the underlying Markov decision problem.

In the more general situation, some sensors may be localized at the same place while others are dispatched over a given volume. Tasking a sensor may include, at each moment, such choices as where to point and/or what mode to use. Tasking a group of sensors includes the tasking of each individual sensor but also the choice of collaborating sensors subgroups. Of course, the sensor management problem is related to an objective. In general, sensors must balance complex trade-offs between achieving mission goals such as detecting new targets, tracking existing targets, and identifying existing targets. The word “target” is used here in its most general meaning, and the potential applications are not restricted to military applications. Whatever the underlying application, the sensor management problem consists in choosing at each time an action within the set of available actions.

sequential decision processes are also very well-known in economy. They may be used as a decision aid tool, to help in the design of social helps, or the implementation of plants (see , for such applications).

Applications of sequential learning in the field of signal
processing are also very numerous. A signal is naturally
sequential as it flows. It usually comes from the recording
of the output of sensors but the recording of any sequence of
numbers may be considered as a signal like the stock-exchange
rates evolution with respect to time and/or place, the number
of consumers at a mall entrance or the number of connections
to a web site. Signal processing has several objectives:
predict , estimate, remove noise, characterize or classify.
The signal is often considered as sequential: we want to
predict, estimate or classify a value (or a feature) at time
tknowing the past values of the parameter of interest
or past values of data related to this parameter.

Signals may be processed in several ways. One of the best way is the time-frequency analysis in which the frequencies of each signal are analyzed with respect to time. This concept has been generalized to the time-scale analysis obtained by a wavelet transform. Both analysis are based on the projection of the original signal onto a well-chosen function basis. Signal processing is also closely related to the probability field as the uncertainty inherent to many signals leads to consider them as stochastic processes: the Bayesian framework is actually one of the main frameworks within which signals are processed for many purposes. However, there exists alternatives like belief functions. Belief functions were introduced by Demspter few decades ago and have been successfully used in the few past years in fields where probability had, during many years, no alternatives like in classification. Belief functions can be viewed as a generalization of probabilities which can capture both imprecision and uncertainty. Belief functions are also closely related to data fusion where once more they can be considered as a serious alternative to probabilities.

One of the current trends in machine learning aims at dealing with data that are functions, rather than points or vectors. Generally speaking, functions represent a behavior (of a person, of an apparatus, or of an algorithm, or a response of a system, ...).

One application of functional prediction which is particularly emphasized these days, is the understanding of client behavior, either in material shops, or in virtual shops on the web. This understanding may then be used for different ends, such as the management of stocks according to sales, the proposition of products according to those already bought, the “instantaneous” management of some resource in the shop (advisors, cashiers, instant promotions, personalized advertisement, ...).

Machine learning methods may be used for at least two means in neurosciences:

as in any other (experimental) scientific domain, the machine learning methods relying heavily on statistics, they may be used to analyze experimental data,

dealing with induction learning, that is the ability to generalize from facts which is an ability that is considered to be one of the basic components of “intelligence”, machine learning may be considered as a model of learning in living beings. In particular, the temporal difference methods for reinforcement learning has strong ties with various concepts of psychology (Thorndike's law of effect, and the Rescorla-Wagner law to name the two most well-known).

Crazy Stone, is a top-level Go-playing program that has been developed by Rémi Coulom since 2005. Crazy Stone won several major international Go tournaments in the past. In 2010, a license of Crazy Stone was sold to a Japanese company, Unbalance Corporation. Crazy Stone should be available for sale in Japan in 2011.

New results are organized in the following sections:

decision under uncertainty,

foundations of machine learning,

supervised learning,

unsupervised learning (clustering),

signal processing (sensor networks).

The main objective here is to use tools from statistical learning theory to derive finite-sample performance bounds for approximate dynamic programming (ADP) algorithms. The goal is to derive bounds on the performance of the policies induced by these algorithms in terms of the number of simulation data and the capacity and approximation power of the considered function and policy spaces. I believe that the results of this study allow us to have a better understanding of the functionality of these algorithms and help us to design them more efficiently. We derived the first performance bounds for linear function spaces for two widely-used ADP algorithms: least-squares temporaldifference learning , , and Bellman residual minimization . We also presented the first complete analysis of classification-based policy iteration algorithms, a relatively new and not well-studied class of ADP methods , , . These algorithms work without explicit value function representation, and define the evaluation of the policy at each iteration and the generation of the next policy together as a classification problem. We have also studied algorithmic methods that can improve sample efficiency in this class of algorithms .

In this work , , we consider the multi-task RL (MTRL) scenario in which the learner is provided with a number of MDPs with common state and action spaces. For any given policy, only a small number of samples can be generated in each MDP, which may not be enough to accurately evaluate the policy. In such a MTRL problem, it is necessary to identify classes of tasks with similar structure and to learn them jointly. We considered a particular class of MTRL problems in which the tasks share structure in their value functions. To allow the value functions to share a common structure, it is assumed that they are all sampled from a common prior.We adopted the Gaussian process temporal-difference value function model for each task, modeled the distribution over the value functions using a hierarchical Bayesian model, and developed solutions to the following problems: (i) joint learning of the value functions (multi-task learning), and (ii) efficient transfer of the information acquired in (i) to facilitate learning the value function of a newly observed task (transfer learning).

A primary goal here is to devise RL algorithms whose sample and computational complexities do not grow rapidly with the dimension of the state space. We have particularly looked into recent directions popularized in compressive sensing concerning the preservation of properties, such as norm or inner-product, of high dimensional objects when projected on possibly much lower dimensional random subspaces. We have derived and analyzed a least-squares policy iteration algorithm with random projections .

Mohammad Ghavamzadeh continued his collaboration with Yaakov Engel (Haifa Israel) on this topic , . In this work, we used Bayesian reasoning to develop more sample-efficient policy gradient and actor-critic algorithms. We proposed a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates, resulting in faster convergence than the conventional Monte-Carlo-based policy gradient and actor-critic algorithms. Moreover, estimates of the natural gradient and a measure of the uncertainty in the gradient estimates, namely, the gradient covariance, are provided at little extra cost.

Mohammad Ghavamzadeh continued his collaboration with
Amir massoud Farahmand and Csaba Szepesvári at the
university of Alberta, Canada, and Shie Mannor at
Technion, Israel, on using regularization methods for
automatic model selection for value function
approximation in RL. We have devised and analyzed the
first
_{2}-regularized RL algorithms by adding
_{2}-regularization to three well-known ADP
algorithms: fitted Q-iteration, modified Bellman residual
minimization, and least-squares temporal-difference
learning
,
. The designed algorithms
work in both linear and reproducing kernel Hilbert
spaces.

The first months of work on this project during the Fall of 2009 led us to the conclusion that an important work on the recommendation of products was required to help such a virtual seller. So, we focused on this issue, considering that the problem of recommendation systems is a very general setting that can be tailored to solve many problems (the problem of ad selection on the web for instance, see below), and that our group has to get more acquainted with this domain of research.

In 2010, this work has been mostly of technological nature. We designed a recommendation system to recommend products on a commercial website. This recommendation system comes as a plug-in for the Firefox web browser than can be enabled at will, and that automatically embed recommended products on product pages.

See also the contract section (Sec. ) of the report for specific details about the contract itself.

We continued the work initiated in 2009 work on the selection of displayed ads on web pages. We have been able to propose algorithms that significantly improve the resolution of this problem , . In particular, we have shown that optimizing advertisement display, handling finite budgets and finite lifetimes in a dynamic and non-stationary setting, is feasible within realistic computational time constraints (such as serving several dozens of ads per second). We have also given some insights in what can be gained by handling these constraints, depending on the properties of the advertisements to display.

Furthermore, Orange provided us some web log files related to the ad service. We have begun to mine these web logs to get more accurate figures about real data, and the real behavior of human beings facing ads on web pages. However, due to the enormous size of these logs, the work has not gone as far as we wished. We have actually acquired a computer with a very large main memory (256 Gbytes) to handle such large datasets. In the coming years, we wish to work on very large datasets (tera-bytes and more), so that this work is our first step towards working with very large datasets.

The problem of sequence prediction consists in
forecasting, on each step of time
n, the probabilities of the next outcome of the
observed sequence of data
. In the most general formulation of the problem, we
assume that we are given a set
of probability measures (on the space of infinite
sequences). We can then assume that the sequence is
generated by an unknown measure
that belongs to
.

This general formulation is motivated by the diversity of sequential prediction problems: they include analysis of biological, financial, textual or web-generated data, to mention a few. Naturally, one has to have different models for these problems, and therefore one is interested in finding a general procedure for constructing a predictor, given only some weak probabilistic constraints on the data; this is formalized by saying that the data-generating process comes from a known but arbitrary family .

It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family.

The realizable case of the sequence prediction problem is when the measure belongs to an arbitrary but known class of process measures. The non-realizable case is when is completely arbitrary, but the prediction performance is measured with respect to a given set of process measures. We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists, and finding these solutions. In this work we show that if the quality of prediction is measured by total variation distance, then these problems coincide, while if it is measured by expected average KL divergence, then they are different. For some of the formalizations we also show that when a solution exists, it can be obtained as a Bayes mixture over a countable subset of . As an illustration to the general results obtained, we show that a solution to the non-realizable case of the sequence prediction problem exists for the set of all finite-memory processes, but does not exist for the set of all stationary processes.

We have developed a new theoretical framework that has allowed us to solve some classical problems of mathematical statistics in a radically more general setting. Namely, the setting is that the data is generated by a stationary ergodic process (or processes, depending on the problem), and no assumptions of independence, mixing rates, etc., as well as no parametric assumptions, are made. The obtained results include a general hypothesis testing procedure, a consistent change point estimator, and a consistent classification procedure . Previous results on these problems concerned only much more restricted settings (e.g. i.i.d. data).

We have shown that consistent homogeneity testing is impossible in this setting, which means that given two growing samples of data which are only known to be generated by stationary ergodic processes, one cannot in general tell whether they are generated by the same or by different process distributions, even in the weakest asymptotic setting, and even if the processes are binary-valued. This is particularly remarkable in view of our result that establishes a consistent change point estimator. This also solves an open problem about discrimination between ergodic processes .

Revisiting Osborne's papers on the resolution of the LASSO problem , M. Loth proposed a new algorithm named the Iso-Regularization descent, to solve this problem . This algorithm is currently the most efficient to be known; in particular, it is more efficient than the cyclic coordinate descent , and computes the regularization path of the LASSO as efficiently as the LARS. This algorithm is also able to solve other regularized problems, such as the grouped LASSO, and the elastic net problem. A complete presentation of this algorithm will appear in Manuel Loth's PhD dissertation, in the early 2011, and will be submitted to a journal.

The work led by Hachem Kadri on functional regression has made much progress in 2010. In our work, “functional” regression means that we consider regression problems, and more generally supervised learning problems, in which observations, and response(s) are functions. The usual approach to this problem is to consider vectors instead of functions. A kernel approach for this purpose means functions acting on functions, that is, operators; moreover, to be valid, these operators should respect some properties. Exhibiting such operators that respect those properties is difficult, but a basic requirement if we want to use such an approach in any application. Different functional kernels have been exhibited, and different algorithms to solve the minimization problem have been proposed . The multi-task setting has also been investigated, that is the setting is which more than one functional response has to be made. An application to speech inversion has been tackled. To compare our functional approach with the more traditional vector-based approach, we have recently written describing the resolution of such a speech inversion task, that exhibit state of the art performance .

As a follow-up to the successful work performed in 2009 related to computer graphics, an experimental work has been performed to investigate the potential of the ECON algorithm at representing photometric solids , with regards to a neural network approach.

Jérémie Mary made an informal collaboration with an INSERM lab (ERI-12) of the University of Amiens about picture analysis of some cells in order to detect the effect in the cellular mobility of the muscles (based on the vinculine and actine observation). Another work was conducted with the Lab of psychology of the University of Lille 3 on the analysis of human gesture by Geoffrey Megardon under supervision of J. Mary.

Some software has been developed by Jérémie Mary and Antoine Chamot to optimize thermal models which can be used with the software Energy Plus. It achieved a drop in error rate of 40% versus full human modelling, on the data collected by Effigenie (see also Section ).

We have applied the approach to statistical analysis of time series, described in section , to the problem of clustering time series samples. Thus, we have considered the problem of clustering for the case when each data point is a sample generated by a stationary ergodic process. We proposed a very natural asymptotic notion of consistency, and showed that simple consistent algorithms exist, under most general non-parametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribution. With this notion of consistency, clustering generalizes such classical statistical problems as homogeneity testing and process classification. We showed that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary ergodic (no parametric or Markovian assumptions, no assumptions of independence, neither between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assumptions on the mixing rates of the processes. (again, no parametric or independence assumptions). In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent.

An implementation of the “magic distance” of Daniil Ryabko (the empirical estimate of distributional distance) has been made by Jérémie Mary. The clustering process works well on some artificial ergodic data, and we are looking for some real ergodic data (ideally non-Markovian and continuous) to test the developed algorithm on.

The aim of this work is to manage a set of sensors to track vehicles or groups of people in land applications. Our work focuses on sensor management in the frame of the random finite sets where the Probability Hypothesis Density (PHD) is a well-known method for single-sensor multi-target tracking problems in a Bayesian framework, but the extension to the multi-sensor case seems to remain a challenge. We have proposed an extension of Mahler's work to the multi-sensor case by providing an expression of the true PHD multi-sensor data update equation. Then, based on the configuration of the sensors' fields of view (FOVs), a joint partitioning of both the sensors and the state space provides an equivalent yet more practical expression of the data update equation, allowing a more effective implementation in specific FOV configurations ( ). This work is done in collaboration with Thales Communications. Beside this main point we have finalized the optimization of the detection step of a radar ( ).

This work is done in collaboration with Prof Carl Haas of the University of Waterloo (Canada) and is a continuation of previous research : how can we automatically track the building materials on a construction site? This is a real problem because a lot of time (hence of money) is lost to find these materials that have often been moved away. The ability to detect dislocations automatically for tens of thousands of items can ultimately improve project performance significantly. The proposed solution is to equip each piece with a RFID tag and each people working on the construction site with a RFID receiver, a GPS for the localization, and a transmitter. We have obtained a PICS (International Project for Scientific Cooperation) from the CNRS in 2008 for 3 years to work on this. During the two past years, we have developed a belief functions based method to track the materials. In 2010 we have focused on dislocation detection performances by tuning the basic belief masses. ROC curves obtained on experimental data show a real improvement for the low false alarm rate ( ).

Today, Global Navigation Satellite Systems (GNSS) have penetrated the transport field through applications such as monitoring of containers. These applications do not necessarily request a high availability, integrity and accuracy of the positioning system. For safety applications (as complete guidance of autonomous vehicles), performances require to be more stringent. For, sensors may deliver very erroneous measurements because of such hard external conditions which reduce significantly the possibilities to receive direct signals. The consequences of environmental obstructions are unavailability of the service and reception of reflected signals that degrades in particular the accuracy of the positioning. Indeed, NLOS (Non Line Of Sight) signals, i.e. signals received after reflections on the surrounding obstacles, frequently occur in dense environments and degrade localization accuracy because of the delays observed on the propagation time measurement creating additional error on pseudorange estimation. In the previous years we have proposed new algorithms to improve the localization precision. This algorithm are based on two principles : a jump multimodel approach and a joint state - noise density estimation. We have focused this year on an approach using Dirichlet Process Mixture to track the noise density in urban canyon while estimating the position of the vehicle. Algorithm have been validated on real data ( , , , ).

The term "Internet of Things" has come to describe a
number of technologies and research disciplines that enable
the Internet to reach out into the real world of physical
objects. Technologies like RFID, short-range wireless
communications, real-time localization and sensor networks
are now becoming increasingly common, bringing the Internet
of Things into commercial use. In such applications the
data sent by a
*thing*to another may generate an impulse noise in the
reception channel of objects in the neighborhood. The noise
appearing in such applications can be considered as
-stable. In this context, we've tackled the problem
of interference mitigation in ad hoc networks. In such
context, the multiple access interference (MAI) is known to
be of an impulsive nature. Therefore, the conventional
Gaussian assumption can not be considered to model this
type of interference. Contrariwise, it can be accurately
modeled by stable distributions. Here, this issue is
addressed within an Orthogonal Frequency Division
Multiplexing (OFDM) transmission link assuming a symmetric
-stable model for the signal distortion due to MAI.
We have proposed a method for the joint estimation of the
transmitted multicarrier signal and the noise
parameters.Based on sequential Monte Carlo (SMC) methods,
the proposed scheme allows the online estimation using a
Raoblackwellized particle filter. These results have been
submitted to the ICASSP Conference at the day of the
writing of this report.

A contracted, and funded, collaboration has begun this Fall between SequeL and a company (SME) named “Addressing Business” located in Roubaix. The aim of this contract is to design and implement a software prototype. For confidentiality and competitiveness reasons, we will not detail this collaboration further than mentioning that it is related to recommendation systems, however in a far from academic setting (data are large and complex, there is a sequential aspect in the data, ...).

Two contracts are living with Orange Labs.

First, there is an on-going CIFRE contract, funding a PhD on sequential supervised learning (Ch. Salperwyck, 2009-2012).

Second, there is a one-year CRE (externalized research contract) that has been negotiated and signed in the late 2010. This contract deals with the study of sequential machine learning under constraints, with application to the ad selection problem.

The work of last yearof Jeremie Mary about adaptive quizz has been polished and is now used in production.

Effigenie is a future start up (should be created in January 2011), which plans to sell a solution to optimize thermal control of a building with respect to their planned utilization and the weather. Some preliminary tests on real building during winter 2010 allow us to expect around 20% of energetic consumption. The approach used needs a good thermal modelling of the building. This is a problem as having such a modelling is time consuming and needs a human expert. So Jeremie Mary conducted a work based on the optimization day after day of a rough model. Using this kind of approach we were able to optimize models and to reach an error of prediction 40% lower than a hand-made model. The model is quite hard to optimize because there is more than one hundred variables and each evaluation needs several seconds. For next year we plan to have again better optimization making some more local adjustments and to test the new model in winter. Another promising development is to use RL methods in order to control the building without having to build a model. Such a solution would be a fast and very low-cost method to have better efficiency over thermal control.

Unbalance Corporation is a Japanese company who bought a license of Crazy Stone (see section ) in 2010. Unbalanceis specialized in selling games.

SequeLis taking part in a project named “Ubiquitous Virtual Seller” (VVU) of the Pôle de Compétitivité “Industrie du Commerce” (PICOM). See more details in Section .

SequeLis taking
part in a project named “Ubiquitous Virtual Seller” (VVU)
of the Pôle de Compétitivité “Industrie du Commerce”
(PICOM). This project has begun on Sep. 1
^{st}, 2009 and will last 2 years. The VVU project
involves three computer science laboratories (Laboratoire
d'Informatique Fondamentale de Lille, INRIA Lille Nord
Europe, and Mines de Douai), a marketing school
(Skema-Lille), and private companies (Becquet, Oxylane,
France Telecom, Artificial Solutions, Nextstage). In this
project, we are funded by the Région-Nord Pas de Calais,
and the FEDER; funding is mostly for a post-doc over a
period of 18 months. The work involves a close
collaboration with other computer science teams at the
Laboratoire d'Informatique Fondamentale de Lille, and the
Mines de Douai. See sec.
for more details about 2010
activities on this contract.

This was the first year of this ANR project. In participating to this project, our goal is at least two-fold: getting acquainted with the management of large datasets, both at the fundamental level, and at the technical level; getting acquainted with working or more complex data than mere real vectors, or real functions, such as qualitative data, and data such as trees, or graphs. The underlying assumption is also that data comes as a flow. Noteworthy, our work on ad selection with Orange labs led us to handle very large web log files, which goes along the same line of work of very large streams of data. The aforementioned contract with Addressing Business (see Sec. ) is also perfectly compatible with this policy.

Olivier Nicol is beginning his PhD in this context. Furthermore, Ph. Preux is co-advising Gabriel Arnold-Dulac who begins his PhD in the Malire team of the LIP'6, under P. Gallinari and L. Denoyer's supervision (both participate to the Lampada project).

The work on sensor management went on this year, focusing on the extension to the multisensor case of the PHD filter. This work is realized in the frame of the thesis of Emmanuel Delande (Grant DGA/CNRS) in collaboration with Thales Communication.

MaBI stands for “Machine Learning for Brain Computer Interfaces”; the scientific coordinator of this ARC project is Stéphane Canu. Members of SequeL involved: Rémi Munos, Daniil Ryabko, Philippe Vanheeghe, and Emmanuel Duflos. The ARC MaBI started in 2010 for 2 years.

EXPLO-RA, acronym for EXPLOration - EXPLOitation for efficient Resource Allocation with Applications to optimization, control, learning, and games, is an ongoing, 3 years ANR-funded project which started in 2009. This is a collaboration between 2 INRIA project teams (SequeL and TAO), HEC Paris (GREGHEC), Les Ponts (CERTIS), Paris 5 (CRIP5), and the Université Paris Dauphine (LAMSADE). Rémi Munos is the coordinator.

Brain computer co-adaptation for better interfaces project, which started in the end of 2009 (for 4 years). This is in collaboration with the INRIA Odyssee project (Maureen Clerc), the INSERM U821 team (Olivier Bertrand), the Laboratory of Neurobiology of Cognition (CNRS) (Boris Burle) and the laboratory of Analysis, topology and probabilities (CNRS and University of Provence) (Bruno Torresani). Rémi Munos is the SequeL coordinator.

Emmanuel Duflos and Hachem Kadri are collaborating with Pr. Stéphane Canu on Functional RKHS.

In 2009, SequeLhas joined the Pascal-2 European network of excellence dedicated to machine learning. SequeLhas created a new node of this NoE in collaboration with the EPI Mostrare, and Stéphane Canu's group in Rouen. R. Munos is the head of this node.

Sparse Reinforcement Learning in High Dimensions, with Shie Mannor (Technion, Israel), Mohammad Ghavamzadeh and Rémi Munos. This is a 2 year project that started in November 2009.

The title of the joint team is Decision-making under Uncertainty with Applications to Reinforcement Learning, Control, and Games. The coordinators from INRIA side are Mohammad Ghavamzadeh and Rémi Munos. The coordinator from University of Albertaside is Csaba Szepesvári. Other collaborators in University of Alberta are Prof. Richard Sutton and Amir-massoud Farahmand.

A “Programme Interdisciplinaire de Coopération Scientifique” (PICS) is running over the period 2008–2010 which concerns Ph. Vanheeghe, and E. Duflos, in relation with the Centre for Pavement and Transportation Technology (CPATT), headed by prof. Carl Haas at the University of Waterloo, Canada.

The optimal use of the data provided by the sensors must necessarily lie within a dynamic process suitable to control the acquisition of information. This project proposes to define principles and methods for the management of multisensor systems in the frame of civil engineering. This work, requires the development of specific methodological tools. These tools will be tested on a real civil engineering application, the characterization of new materials for highway pavement. Multisensor management being integrated in this Canadian, very ambitious, civil engineering project. The Canadian team will carry out the instrumentation and the validation, whereas the definition of the tools and method will be carried out in tight partnership and controlled by the French team.

Mohammad Ghavamzadeh collaborates with Yaakov Engel on
the topic of
*regularized reinforcement learning*over the last four
years. This year, we have two journal papers on this topic
that will be submitted soon
,
.

Mohammad Ghavamzadeh has been also working with Prof.
Shie Mannor, on the topic of
*Bayesian reinforcement learning*for the last five
years, on the topic of
*regularized reinforcement learning*for the last three
years, and on the topic of
*reinforcement learning in high dimensions*in the last
year. On the first topic, we have a journal paper (survey)
in preparation
this year. On the second topic,
we have two journal papers in preparation
,
this year. Finally, on the
third topic, we are Co-PI's of a
*PASCAL2 pump-priming program*.

Prof. Pascal Poupart and Mohammad Ghavamzadeh have been
collaborating on the topic of
*Bayesian reinforcement learning*in the last four
years. This year, we have a journal paper in
preparation
on this topic.

Emmanuel Duflos and Philippe Vanheeghe have visited twice Prof. Carl Hass at the University of Waterloo (Canada) from September 5th to September 10th and from December 4th to December 10th.

Rémi Coulom has been working with Shih-Chieh Huang, a PhD student from the Department of Computer Science and Information Engineering, National Taiwan Normal University. Shih-Chieh Huang's main advisor is Professor Shun-Shii Lin, and he is co-advised by Rémi Coulom. In 2010 they worked on simulation balancing and time management . Shih-Chieh Huang also won the gold medal of the Computer Olympiad (see award section).

Rémi Coulom has been working with Łukasz Lew is a PhD student at the Warsaw University, in Poland. Łukasz's main advisor is Professor Krzysztof Diks. Łukasz visited Sequel for one month in April. During his visit, he wrote a paper with Rémi Coulom about Monte-Carlo search of combinatorial games .

Awards:

Sébastien Bubeck's Ph.D. thesis, entitled “Bandits Games and Clustering Foundations” has been awarded a Gilles Kahn 2010 prize, ranking second. This is a prize awarded by Specif to the best Ph.D. theses in Computer Science, in France (patronized by the Academy of Science). The thesis supervisor was Rémi Munos.

Gold medal at the Computer Olympiad: The Go-program developed by Shih-Chieh Huang under the supervision of Rémi Coulom, Erica, won the gold medal in the 2010 Computer Olympiad in Kanazawa, Japan. The Computer Olympiad is regarded as the most important international computer-Go tournament. All the major commercial and academic programs participated.

2008 ICGA Journal Award. This award is given every year by the ICGA (International Computer Games Association) for the best paper of a first-time author, published in the ICGA Journal. The award for year 2008 was given in 2010 to a paper by Rémi Coulom (that was actually published in the December 2007 issue).

Victor Gabillon, Jeremie Mary and Philippe Preux have received a best paper award at the conference “Extraction et Gestion des Connaissances” for their work on ad selection problem on Internet portals.

Alessandro Lazaric gives a tutorial at AAMAS 2010: Reinforcement Learning and Beyond

participation to the program committees of international conferences:

R. Coulom: CG'2010: International Conference on Computers and Games, Kanazawa, Japan

R. Coulom: TAAI'2010: Technologies and Applications of Artificial Intelligence, Taipei, Taiwan

E. Duflos and Ph. Vanheeghe: Fusion'2010 workshop on the Theory of Belief Function (Brest, April 1-2, 2010),

E. Duflos: workshop on the Theory of Belief Function (Brest, April 1-2, 2010),

M. Ghavamzadeh: European Conference on Machine Learning (ECML 2010), International Conference on Machine Learning (ICML 2010)

R. Munos: Area chair for NIPS 2010

Ph. Preux: ICML 2010, CAP 2010, EGC 2010.

D. Ryabko: UAI 2010.

GDR ISIS : following a request of
Jean-Yves Tourneret (in charge of theme A in the GDR
ISIS), Emmanuel Duflos organized a one-day workshop in
June 2010 :
*Advances in Signal Processing and Data Fusion for
Localisation*. 68 researchers attended to this
workshop. This workshop was co-organized with the GT2 of
the GDR Robotics (with Roland Chapuis : lASMEA).

Invited talks:

R. Munos is Invited speaker in (in addition to the conferences) Journées MAS 2010 (Modélisation Aléatoire et Statistique), SMILE 2010 (Statistical Machine Learning in Paris), NIPS 2010 Workshop (Learning and planning from batch time series data).

O. Maillard is Invited speaker in GDR ISIS "Apprentissage et parcimonie".

M. Ghavamzadeh gives an invited talk at University of Alberta - AI Seminar, Host : Prof. Csaba Szepesvári (2010).

D. Ryabko gives an invited talk at GdR ISIS workshop “Journée spéciale Stéganographie et Stéganalyse”.

Jeremie Mary has given an invited talk at Bilab (ENST Paris) and SMILE 2010 (Statistical Machine Learning in Paris).

international journal and conference reviewing activities (in addition to the conferences in which we belong to the PC):

E. Duflos: IEEE Transaction on Signal Processing, International Journal of Approximate Reasonning, Information Fusion.

M. Ghavamzadeh: Annual Conference on Neural Information Processing Systems (NIPS 2010), Neurocomputing, Machine Learning Journal (MLJ), Journal of Machine Learning Research (JMLR), Journal of Artificial Intelligence Research (JAIR),

R. Munos: Machine Learning, IEEE Transactions on Automatic Control, Revue d'Intelligence Artificielle, ALT 2010, ICML 2010.

Ph. Preux: NIPS 2010, STACS 2010, ECML 2010

D. Ryabko: IEEE Trans. Inf. Th., NIPS 2010.

Evaluation activities, expertise

E. Duflos and Ph. Vanheeghe have reviewed proposals for the ANR programs

Ph. Preux has reviewed project proposals for the ECOS-Sud program (France), ANRT (France), and the IWT (Belgium)

Ph. Preux is expert for the AERES. He expertized masters in computer science, as well as a laboratory.

Ph. Preux is member of the “Gilles Kahn”/Specif jury which elects an outstanding PhD in computer science of the year.

Ph. Preux served as president of the committee of an INRIA-Université de Lille 3 chair in computer science/statistical learning (Spring 2010), the committee to recruit an assistant professor in statistical learning (Fall 2010), and member of the committee to recruit 4 assistant professors in computer science at the Université d'Artois (Spring 2010).

R. Munos has been a member of the following committees:

Membre jury de recrutement DR2 INRIA, 2010

Membre jury de recrutement CR2 INRIA Lille, 2010

Scientific organizer of the INRIA evaluation seminar, theme Optimisation, apprentissage et méthodes statistiques, March 2010.

Membre du Comité d'animation du domaine thématique INRIA, Mathématiques appliquées, calcul et simulation.

Délégué (titre Commission d'Evaluation) pour la création des projets INRIA: CLASSIC, SIERRA

Recommendation letter for promotion of a Senior Lecturer position (kept anonymous) in Technion, Israel

Referee in PASCAL2 Programme Pump Priming programme

participation to PhD and HDR jurys:

R. Munos was a Rapporteur for PhD thesis of: Jia Yuan Yu (Mc Gill University), Olga Kozlova (Université Paris 6), Christophe Thiery (INRIA Nancy), Sarah Philippi (Telecom ParisTech).

R. Munos was a Member of HDR Committee: Jean-Yves Audibert (Ecole Nationale des Ponts et Chaussées).

Emmanuel Duflos was
*rapporteur*for the for PhD thesis of Gregory
Mallet (INSA Rouen) and Adrien Chen (ENAC)

We list the classes that are related to the research activities in SequeLthat were going on in 2010.

Rémi Munos teaches a class in reinforcement learning in the M2 “Mathematics-Vision-Learning” (MVA) at the ENS-Cachan.

Philippe Preux teaches:

in the Master 2 MIASHS (Maths and computer science for humanities): 2 data mining classes (data mining, and web mining)

in a Master 2 of psychology (neuro-cognitive processes), and in a Master 1 of psychology (analysis of behavior): machine learning, reinforcement learning, models of adaptive behavior, models of learning in animal (incl. human beings).

Jérémie Mary is head of the speciality “Informatique et Documents” of the Master MIASHS.

Jérémie Mary has followed 4 M2 students and 2 M1 in their internship in external societies.

Otherwise, each of the 4 professors and assistant professors of the SequeLteam teaches 192 hours per year. Taught classes include machine learning, data mining, and signal processing classes.