SequeLis a joint project with the LIFL(UMR 8022 of CNRS and University of Lille 1 and University of Lille 3) and the LAGIS(UMR 8021 of the École Centrale of Lille 1 and the University of Lille 1).

SequeLmeans “Sequential Learning”. As such,
SequeLfocuses on the task of learning in artificial systems (either hardware, or software) that gather information along time. Such systems are named
*(learning) agents*in the following

For the purpose of model building, the agent needs to gather information collected so far in some compact representation and combine it to newly available data.

The acquired data may result from an observation process of an agent in interaction with its environment (the data thus represent a perception). This is the case when the agent makes decisions (in order to fulfill a certain goal) that impact the environment thus the observation process itself.

Hence, in
SequeL, the term
**sequential**refers to two aspects:

The
**sequential acquisition of data**, from which a model is learned (supervised and non supervised learning),

the
**sequential decision making task**, based on the learned model (reinforcement learning).

We exemplify these various problems:

tasks deal with the prediction of some response given a certain set of observations of input variables and responses. New sample points keep on being observed.

tasks deal with clustering objects, these latter making a flow of objects. The (unknown) number of clusters typically evolves during time, as new objects are observed.

tasks deal with the control (a policy) of some system which has to be optimized (see ). We do not assume the availability of a model of the system to be controlled.

In all these cases, we assume that the process can be considered stationary for at least a certain amount of time, and slowly evolving.

We wish to have any-time algorithms, that is, at any moment, a prediction may be required/an action may be selected making full use, and hopefully, the best use, of the experience already gathered by the learning agent.

The perception of the environment by the learning agent (using its sensors) is generally neither the best one to make a prediction, nor to take a decision (we deal with Partially Observable Markov Decision Problem). So, the perception has to be mapped in some way to a better, and relevant, state (or input) space.

Finally, an important issue of prediction regards its evaluation: how wrong may we be when we perform a prediction? For real systems to be controlled, this issue can not be simply left unanswered.

To sum-up, in SequeL, the main issues regard:

the learning of a model: we focus on models than map some input space to ,

the observation to state mapping,

the choice of the action to perform (in the case of sequential decision problem),

the bounding of the performance,

the implementation of usable algorithms,

all that being understood in a
*sequential*framework.

A major activity of the year is the incubation of the Predict & Control spin-off under the responsability of Pierre-Arnaud Coquelin.

Predict & Control is officialy created in december 2007. A significant part of the year has been dedicated to the creation of the spin-off, the definition of its relations with SequeLand INRIA, the search for scientific advisors, and the search of private societies to work with.

Predict & Control aims at providing expertize in machine learning, targeting particularly the commerce fields. This is the topic of one of the “pôles de compétitivité” of the region Nord-Pas de Calais. The expertize may span from a feasability study, to the design and realization of a software to solve a particular problem.

SequeLis primarily grounded on two domains:

Markov decision problems which provide the general setting of the problem we want to solve,

statistical learning which provide the general concepts and tools to solve this problem.

We briefly present key ideas below.

Sequential decision problems occupy the heart of the SequeLproject .

A Markov Decision Process is defined as the tuple
where
is the state space,
is the action space,
Pis the probabilistic transition kernel, and
is the reward function. For the sake of simplicity, we assume in this introduction that the state and action spaces are finite. If the current state (at time
t) is
and the chosen action is
, then the Markov assumption means that the transition probability to a new state
(at time
t+ 1) only depends on
(
x,
a). We write
p(
x^{'}|
x,
a)the corresponding transition probability. During a transition
(
x,
a)
x
^{'}, a reward
r(
x,
a,
x^{'})is incurred.

In the MDP (
, each initial state
x_{0}and action sequence
a_{0},
a_{1}, ...gives rise to a sequence of states
x_{1},
x_{2}, ..., satisfying
and rewards
r_{t}itself is a random variable.r_{1},
r_{2}, ...defined by
r_{t}=
r(
x_{t},
a_{t},
x_{t+ 1}).

The history of the process up to time
tis defined to be
H_{t}= (
x_{0},
a_{0}, ...,
x_{t-1},
a_{t-1},
x_{t}). A policy
is a sequence of functions
_{0},
_{1}, ..., where
_{t}maps the space of possible histories at time
tto the space of probability distributions over the space of actions
. To follow a policy means that, in each time step, we assume that the process history up to time
tis
x_{0},
a_{0}, ...,
x_{t}and the probability of selecting an action
ais equal to
_{t}(
x_{0},
a_{0}, ...,
x_{t})(
a). A policy is called stationary (or Markovian) if
_{t}depends only on the last visited state. In other words, a policy
= (
_{0},
_{1}, ...)is called stationary if
_{t}(
x_{0},
a_{0}, ...,
x_{t}) =
_{0}(
x_{t})holds for all
t0. A policy is called deterministic if the probability distribution prescribed by
the policy for any history is concentrated on a single action. Otherwise it is called a stochastic policy.

The goal of the Markov Decision Problem is to find a policy
that maximizes in expectation some functional of the sequence of future rewards. For example, an usual functional is the infinite-time horizon sum of discounted rewards. For a given
(stationary) policy
, we define the value function
of that policy
at a state
as the expected sum of discounted future rewards given that we state from the initial state
xand follow the policy
:

where is the expectation operator and (0, 1)is the discount factor. This value function gives an evaluation of the performance of a given policy . Other functionals of the sequence of future rewards may be considered, such as the undiscounted reward (see the stochastic shortest path problems ) and average reward settings. Note also that, here, we considered the problem of maximizing a reward functional, but a formulation in terms of minimizing some cost or risk functional would be equivalent.

In order to maximize a given functional in a sequential framework, one usually applies Dynamic Programming (DP)
, which introduces the optimal value function
V^{*}(
x), defined as the optimal expected sum of rewards when the agent starts from a state
x. We have
. Now, let us give two definitions about policies:

We say that a policy
is optimal, if it attains the optimal values
V^{*}(
x)for any state
,
*i.e.*, if
for all
. Under mild conditions, deterministic stationary optimal policies exist
. Such an optimal policy is written
^{*}.

We say that a (deterministic stationary) policy
is greedy with respect to (w.r.t.) some function
V(defined on
) if, for all
,

where
is the set of
that maximizes
f(
a). For any function
V, such a greedy policy always exists because
is finite.

The goal of Reinforcement Learning (as well as that of dynamic programming) is to design an optimal policy (or a good approximation of it).

The well-known Dynamic Programming equation (also called the Bellman equation) provides a relation between the optimal value function at a state
xand the optimal value function at the successors states
x^{'}when choosing an optimal action: for all
,

The benefit of introducing this concept of optimal value function relies on the property that, from the optimal value function
V^{*}, it is easy to derive an optimal behavior by choosing the actions according to a policy greedy w.r.t.
V^{*}. Indeed, we have the property that a policy greedy w.r.t. the optimal value function is an optimal policy:

In short, we would like to mention that most of the reinforcement learning methods developed so far are built on one (or both) of the two following approaches ( ):

Bellman's dynamic-programming approach, based on the introduction of the value function. It consists in learning a “good” approximation of the optimal value function, and
then using it to derive a greedy policy w.r.t. this approximation. The hope (well justified in several cases) is that the performance
of the policy
greedy w.r.t. an approximation
Vof
V^{*}will be close to optimality. This approximation issue of the optimal value function is one of the major challenge inherent to the RL problem.
**Approximate dynamic programming**addresses the problem of estimating performance bounds (
*e.g.*the loss in performance
resulting from using a policy
-greedy w.r.t. some approximation
V- instead of an optimal policy) in terms of the approximation error
||
V
^{*}-
V||of the optimal value function
V^{*}by
V. Approximation theory and Statistical Learning theory provide us with bounds in terms of the number of sample data used to represent the functions, and the capacity and
approximation power of the considered function spaces.

Pontryagin's maximum principle approach, based on sensitivity analysis of the performance measure w.r.t. some control parameters. This approach, also called
**direct policy search**in the RL community aims at directly finding a good feedback control law in a parameterized policy space without trying to approximate the value function. The
method consists in estimating the so-called
**policy gradient**,
*i.e.*the sensitivity of the performance measure (the value function) w.r.t. some parameters of the current policy. The idea being that an optimal control problem is replaced by a
parametric optimization problem in the space of parameterized policies. As such, deriving a policy gradient estimate would lead to performing a stochastic gradient method in order to search
for a local optimal parametric policy.

Machine learning refers to a system capable of the autonomous acquisition and integration of knowledge. This capacity to learn from experience, analytical observation, and other means, results in a system that can continuously self-improve and thereby offer increased efficiency and effectiveness. (source: AAAI website)

An approach to machine intelligence which is based on statistical modeling of data. With a statistical model in hand, one applies probability theory and decision theory to get an
algorithm. This is opposed to using training data merely to select among different algorithms or using heuristics/“common sense” to design an algorithm. (source:
http://

Generally speaking, a kernel function is a function that maps a couple of points to a real value. Typically, this value is a measure of dissimilarity between the two points. Assuming a few properties on it, the kernel function implicitly defines a dot product in some function space. This very nice formal property as well as a bunch of others have ensured a strong appeal for these methods in the last 10 years in the field of function approximation. Many classical algorithms have been “kernelized”, that is, restated in a much more general way than their original formulation. Kernels also implicitly induce the representation of data in a certain “suitable” space where the problem to solve (classification, regression, ...) is expected to be simpler (non-linearity turns to linearity).

The fundamental tools used in SequeLcome from the field of statistical learning . We briefly present the most important for us to date, namely, kernel-based non parametric function approximation, sequential Monte-Carlo methods, and non parametric Bayesian models.

In SequeL, the model to be learned is a real-valued function defined in a multi-dimension space.

Many methods have been proposed for this purpose. We are looking for suitable ones to cope with the problems we wish to solve. In reinforcement learning, the value function may have areas where the gradient is large; these are areas where the approximation is difficult, while these are also the areas where the accuracy of the approximation should be maximal to obtain a good policy (and where, otherwise, a bad choice of action may imply catastrophic consequences).

For the moment, we consider non parametric methods since they do not make any assumptions about the function to learn. Locally weighted regression have yielded efficient methods to learn a policy in reinforcement learning, as well as good performance in regression settings. The kernelized version gives us a wide ability to handle sample points and combine them to obtain the approximation. To keep computation times of practical interest, a sparse representation is sought.

We currently devote a lot of efforts to LARS-like approximators , that we have fitted into the reinforcement learning framework .

Sequential Monte-Carlo (or particle filtering, see ) methods are currently used for various purposes in SequeL:

the estimation of the state of the agent given its current observation as well as its history;

the estimation of parameters of a model.

Numerous problems in signal processing may be solved efficiently by way of a Bayesian approach. The use of Monte-Carlo methods let us handle non linear, as well as non Gaussian problems. In their standard form, they require the formulation of densities of probability in their parametric form. For instance, it is a common usage to use Gaussian likelihood, because it is handy.

However, in some applications such as Bayesian filtering, or blind deconvolution, the choice of a parametric form of the density of the noise is often arbitrary. If this choice is wrong, it may also have dramatic consequences on the estimation.

To overcome this shortcoming, non parametric methods provide an other approach to this problem. In particular, mixtures of Dirichlet processes provide a very powerful formalism.

Mixtures of Dirichlet Processes are an extension of finite mixture models. Given a mixture density
, and
, a Dirichlet process
U_{k}are distributed along a
*base distribution*
, and where weights follow a certain
*stick breaking*law with parameter
.

A mixture of Dirichlet processes is fully parameterized by the mixture density, as well as the parameters of , that is and .

The class of densities that may be written as a mixture of Dirichlet processes is very wide, so that these are really fit to very large amount of applications.

Given a set of observations, the estimation of the parameters of a mixture of Dirichlet processes is performed by way of a
*Monte Carlo Markov Chain (MCMC)*algorithm.

SequeLaims at solving problems of prediction, as well as problems of optimal and adaptive control. As such, the application domains are very numerous.

The application domains have been organized as follows:

adaptive control,

signal analysis and processing,

functional prediction,

sensor management problem,

neurosciences.

Adaptive control is an important potential application of the research being done in SequeL. Reinforcement learning precisely aims at controling the behavior of systems and may be used in situations with more or less information available. Of course, the more information, the better, in which case methods of (approximate) dynamic programming may be used . But, reinforcement learning may also handle situations where the dynamics of the system is unknown, situations where the system is partially observable, and non stationary situations. Indeed, in these cases, the behavior is learned by interacting with the environment and thus naturally adapts to the changes of the environment. Furthermore, the adaptive system may also take advantage of expert knowledge when available.

Clearly, the spectrum of potential application is very wide: as far as an agent (a human, a robot, a virtual agent) has to take a decision, in particular in cases where he lacks some information to take the decision, this enters the scope of our activities.

Applications of sequential learning in the field of signal processing are also very numerous. A signal is naturally sequential as it flows.

The signal may be mono-channel, audio, or visio, or magnetic, or more generally electro-magnetic (
*e.g.*, RFID, or Bluetooth, or wifi, or signals sent by GPS satellites), or else. There might also be several (multi-channel) signals of different nature.

One of the current trends in machine learning aim at dealing with data that are functions, rather than points or vectors. Generally speaking, functions represent a behavior (of a person, of an apparatus, or of an algorithm, or a response of a system, ...).

One application of functional prediction which is particularly emphasized these days is the understanding of client behavior, either in material shops, or in virtual shops on the web. This understanding may then be used for different ends, such as the management of stocks according to sales, the proposition of products according to those already bought, the “instantaneous” management of some resource in the shop (advisors, cashiers, instant promotions, personalized advertisement, ...).

The sensor management problem consists in determining the best way to task several sensors when each sensor has many modes and search patterns. In the detection/tracking applications, the tasks assigned to a sensor management system are for instance:

detect targets,

track the targets in the case of a moving target and/or a smart target (a smart target can change its behavior when it detects that it is under analysis),

combine all the detections in order to track each moving target,

dynamically allocate the sensors in order to achieve the previous three tasks in an optimal way. The allocation of sensors, and their modes, thus defines the action space of the underlying Markov decision problem.

In the more general situation, some sensors may be localized at the same place while others are dispatched over a given volume. Tasking a sensor may include, at each moment, such choices as where to point and/or what mode to use. Tasking a group of sensors includes the tasking of each individual sensor but also the choice of collaborating sensors subgroups. Of course, the sensor management problem is related to an objective. In general, sensors must balance complex trade-offs between achieving mission goals such as detecting new targets, tracking existing targets, and identifying existing targets. The word “target” is used here in its most general meaning, and the potential applications are not restricted to military applications. Whatever the underlying application, the sensor management problem consists in choosing at each time an action within the set of available actions.

Machine learning methods may be used for at least two means in neurosciences:

as in any other (experimental) scientific domain, the machine learning methods relying heavily on statistics, they may be used to analyse experimental data,

dealing with induction learning, that is the ability to generalize from facts which is an ability that is considered to be one of the basic components of “intelligence”, machine learning may be considered as a model of learning in living beings. In particular, the temporal difference methods for reinforcement learning has strong ties with various concepts of psychology (Thorndike's law of effect, and the Rescorla-Wagner law to name the two most well-known).

Some software has begun to be developped in SequeL. Different threads are followed. For the moment, this software is yet in a rather crude form. We have begun to make it available through our website and via the INRIA forge. It will be developped further in the coming years in its functionalities, as well as in accessibility for general users (including GUIs, documentation, examples, tutorials, ...). This software falls under two varieties: either the implementation of research level algorithms, or the implementation of software tools to make research easier.

SMO (Sequential Minimal Optimization) is a numerical optimizer of quadratic programming problems. It does not require the creation of large matrices, allowing thus considering problems with a few millions samples (like, for instance, the automatic transcription of speech problem, see ). However, it is relatively slow. Stéphane Rossignol worked on an optimized and fast C version of SMO for the 1 class SVM problem. His software will be made available online on the SequeLwebsite.

The kernel of a simulator of electronic radar has been developped as part of Thomas Huguerre's internship. This kernel has been developped in C++. More work is on-going to make it usable for general users, and make it available through our website.

Crazystone is an award-winning Go software player, designed and developed by Rémi Coulom.

Being a research tool related to high worldwide competition, it is no longer freely available.

Brennus is a poker bot, that is a program designed to play Poker against other programs, interacting via the Internet (Brennus may play against human players as well). This is the first release of this program. Brennus is related to a new track of research in SequeLand the result of Raphaël Maîtrepierre research for his masters thesis, and now his PhD.

New results are organized is the following sections:

reinforcement learning,

sensor management problem,

signal processing.

We have worked on several aspects of reinforcement learning and optimal control, including the use of function approximation to represent the value function or the policy. We have worked in collaboration mainly with Csaba Szepesvári (University of Alberta, Canada), András Antos (Hungarian Academy of Sciences), Jean-Yves Audibert (CERTIS, Ecole des Ponts et Chaussées), Guillaume Deffuant and Sophie Martin (Cemagref, Clermont-Ferrand), Hasnaa Zidani (ENSTA), Olivier Bokanowski (Paris VII), Olivier Teytaud (LRI, Orsay). This work can be summarized as follows:

**Establishing links between statistical learning and reinforcement learning**. Performance bounds on the policies deduced by approximate dynamic programming methods (such as
approximate value iteration, approximate policy iteration) when using sampling devices are established in terms of the capacity (using VC dimension, covering numbers) of the function
space considered in the approximations. See
,
,

**Analysis of dynamic programming using
L_{p}-norms**. This work extends usual analysis in
-norm to

**Policy gradient estimation in continuous time**. This method allows to search directly for a locally optimal controller in a class of parameterized policies, in the case of
continuous-time state-dynamics
. An application of this method to a control problem in finance has been worked on
.

**Analysis of the exploration-exploitation tradeoff using variance estimate**. We investigate the multi-armed bandit framework using new deviation inequalities that takes into account
the variance estimate. This results in a great sharpening of the regret bounds. See
,
,
.

**Use of bandit algorithms for performing tree search.**We investigate the recursive use of bandit algorithms for designing efficient tree exploration policies. The resulting methods
explore the tree in an asymmetric way expanding and exploring first the most promising branches. We analyzed the UCT algorithm
(UCB algorithm
applied to trees) and use it for tree-search in the game of go (see sec.
below). With Pierre-Arnaud Coquelin, we further investigated improved bandit algorithms for tree search, providing
performance guarantees
. Several master students have done an internship on related topics (Jean-Francois Hren
and Amine Chouia
) and are currently beginning their PhD on several extensions of this domain. Yizao Wang has done his second
year master internship at SequeL in collaboration with Jean-Yves Audibert, from the CERTIS (ENPC), on bandit algorithms applied to tree search when using variance estimates
. This domain stands as one of our priviledged research directions.

We use several ideas from the ultra-bee schemes used for transport equation with discountinuous solutions and the dynamic programming approach combined with function approximation to approximate the viability kernel of a viability problem .

This is a joint work between Pierre-Arnaud Coquelin and Romain Deguest at Centre de Mathématiques Appliquées de l'École Polytechnique (CMAP).

We have proposed a numerical method to estimate the gradient of a Feynman-Kac flow. The idea is to achieve a sensitivity analysis along a Markov chain canonically associated to the Feynman-Kac model. One can use classical methods, such as likelihood ratio method or infinitesimal perturbation analysis, to estimate the gradient. We have proposed more efficient algorithms, i.e that have a lower variance, to estimate this gradient.

The range of applications is quite broad: maximum likelihood parameter estimation in Hidden Markov Model, Direct Policy search in Partially Observable Markov Decision Process, Policy optimization in risk sensitive cost Markov decision process problem and sensitivity analysis of rare event with respect to some parameter of the model .

After 2006 work on the adaptation of the kernelized-LARS algorithm to fit the reinforcement learning problem , which we named the “equi-gradient descent” algorithm (EGD). A major advance in 2007 has been to extend the traditionnal LARS approach to an infinite number of features. This opens-up many applications of EGD to model fitting, such as radial basis funciton networks, neural networks, wavelets, ... beyond the reinforcement learning scope. Current work goes on on this point.

We have also proposed a unified view of many algorithms (TD, residual TD, iLSTD, LSTD, LSPE and our equi-gradient TD algorithm), showing that they all fit into a single, and simple, formalism .

After the 2006 major breakthrough in go realized by Rémi Coulom's Crazystone program, the latter has evolved further. He won a bronze medal in the
9×9game and a silver medal in the
19×19game at the 12
^{th}Computer Olympiads in Amsterdam.

The main research advance in 2007 has consisted in incorporating domain knowledge into his Go-playing program by supervised learning from expert games. This is a follow-up on his previous research on Monte-Carlo tree search. This new technique lead to huge strength improvements. Even on the large 19x19 board, his program is now stronger than the strongest classical non-Monte-Carlo programs , .

In parallel, an other track was followed in the team by Rémi Munos who collaborated with Yizao Wang, first year master student at CMAP, École Polytechnique, Sylvain Gelly, PhD student, and Olivier Teytaud at INRIA TAO. The resulting program MoGo is currently the world best computer-go program , .

We began a work on games with incomplete information. The keypoint is to adapt some techniques used in to balance exploration and exploitation, and take advantage of some theoretical bounds on bandit problems to create new solutions to adapt the computer strategie to its opponent, during the game. We chose poker to apply these ideas. Poker blends interesting topics of research (partially observable problem, time-varying problem) to some high economics interests.

Raphaël Maîtrepierre, Jérémie Mary, and Rémi Munos created Brennus, a computer Texas Hold'em poker player
,
. This new bots, based on really new techniques in the field of poker games, perfomed quite well: it defeated
all opponents of the 2006 challenge and ranked 8th/17 at the AAAI poker challenge, held in august 2007, in Vancouver

An activity is also going on regarding the practical application of reinforcement learning, based on recently published ideas, and the combination of various approaches. In particular, we have worked on the following lines:

investigate natural gradient to obtain better approximation in value-based approach, as well as in a policy search approach.

investigate the transfer of knowledge, learning a task in a simple setting and using what has been learnt in a larger problem setting.

investigate automatic feature selection to find better representations of the problem that make it easier to learn the value function or the policy.

This work requires some efforts to obtain experimental results and to be able to draw experimental guidelines for the practitionner. This work is on-going and preliminary results are available in . We use the GRID 5000 to speed-up the experimental work.

This is a joint work between Pierre-Arnaud Coquelin, Andrea Brovelli and Driss Boussaoud of the “Institut de Neurosciences Cognitives de la Méditerranée” (INCM). The goal was to find the neural learning model used by a monkey during a sequential associative learning task. We developed an approach based on maximum likelihood estimation in Hidden Markov Model. The results were very interesting and the approach is quite novel in this field , .

This is a special case of a sensor management problem. here, we suppose that all the sensors have the same type: Electronically Scanned Array (ESA) radars. These radars are in a context of target detection, and target tracking. This is a common application in sensor management and typically a military

First results have been developed in the framework of the scheduling of radars in a multitarget environment. The scheduling is based on the modeling of the probability of detection of a
target. The detection process has been improved in order to maximize this probability. This optimization leads to a specific form for the probability of detection which allows analytical
derivation of scheduling strategies for one radar in a multitarget environment: if the radar has to spend a given time
Tto detect
N(
Nis suppposed to be known) targets, how much time must it allocate to each target if the criterion to optimize is the sum of the probabilities of detection? Some results have been
extended to the multisensor/multitarget environment (
).

A method based on the modeling of the probability of detection of the radar and on the posterior probability of presence of a target in a radar resolution cell (knowing the past actions
and the past measures)
P_{p}has been developped (the context is supposed to be multitarget, and monosensor). After each analysis of a direction, the posterior probability of presence of a target in a cell is
updated. The presence of a target in a cell being modeled by a Bernouilli random variable of probability
P_{p}, the choice of the next action is based on the minimization of the variance of this random variable. A first algorithm, very simple, although not very realistic has first been
proposed: the targets are supposed not to move during the observation. This algorithm was then modified to take into account the movement of the target which is supposed to be markovian with
respect to the cells
,
.

The sensor management problem (SMP) is a practical application to which we have decided to pay a large effort in SequeL. We want to consider the SMP as a particular reinforcement learning problem. Hence, the first step in this effort is to formulate the SMP as some Markov decision problem.

The use of classical methods of Q-learning for ESA radars has also been evaluated. This method was recently described in . This method works quite well when the cells number is not to large which is not the case with an ESA radar , .

A new and original approach consisting in deriving the optimal parameterized policy based on stochastic gradient estimation has also be developed . Two different technics, namely the Infinitesimal Approximation (IPA) and the Likelihood Ratio (LR), have been used to address the problem. This work is based on the PhD results of Pierre-Arnaud Coquelin.

Time varying clustering with first order stationary Pitman-Yor processes . There is a need to develop models to cluster evolving data, where the number and the composition of the clusters may evolve and adapt sequentially. We have developped a new class of Pitman-Yor processes which have a given fixed marginal distribution at each time, and a given cluster dynamic model.

Maximum likelihood in latent variable models. For this kind of problems, the EM algorithm is the well known and efficient approach used by many researchers. However, the EM algorithm is gradient-based, meaning that it converges to local optima. We have developped a Monte Carlo algorithm to be used whenever the EM approach fails. This applies a Sequential Monte Carlo strategy which can be, to some extent, related to genetic algorithms with the major difference that convergence results hold in our approach.

With E. Jackson (PhD student) and A. Doucet (Professor at the U. of British Columbia), we have investigated Bayesian functional clustering. Given observations obtained by sampling different functions at random locations (and different from one function to an other), we have developed an algorithm that clusters these functions into coherent groups. For instance, each observation may be a signal (in one or more dimensions), such that the sampling instants are different from one signal to an other. We have then modeled the underlying signals by using Gaussian processes and the clustering itself is performed by using a Dirichlet mixture process . The target applications concern the clustering of expression data of messenger-RNA, as well as sampling data originating from geostatistics.

Certain electrical devices include a set of electric cables which, under certain circumstances, may give rise to electrical arcs (a typical example is the command circuitry of an airplane). This is basically a problem of detecting ruptures in a multichannel environment.

We have designed a Bayesian algorithm to detect these ruptures which models the signals on each cable individually, by an autoregressive model and that assumes a correlation between the instants of rupture in the different cables .

This work is done in collaboration with Prof Carl Haas of the University of Waterloo (Canada). This collaboration is related to a problem appearing in civil engineering: how can we automatically localize the building materials on a construction site? This is a real problem because a lot of time (hence of money) is lost to look for these materials that have often been moved away. The proposed solution is to equipped each piece with a RFID tag and each people working on the construction site with a RFID receiver, a GPS for the localization, and a transmitter. We then learn sequentially the position of the pieces using the incoming detection information send automatically by the transmitter to a central processor when the workforces walk near these pieces and detect them. RFID systems and localization systems as GPS allow to treat such a problem in the more general context of randomly distributed communication nodes localization. When the nodes are moving the problem is still more complicated. Our work shows how the Transferable Belief Model can be used to learn the position of the communication nodes and to detect potential movements. This study also shows how to deal with the computation.

This work has also been applied for land vehicle localization. The vehicle is equipped with three sensors, including a GPS sensor.

This work is done in collaboration with Juliette Marais, junior researcher at INRETS and Fleury Donnay NAHIMANA, a PhD Student supervised by Emmanuel Duflos and Juliette Marais.

Lots of Global Navigation Satellite System (GNSS) applications deal today with transportation. However, main transport applications, either by rail or road, are used in dense urban areas or, to the least, in suburban areas. In either one, the conditions of reception of every available satellite signals are not ideal. The consequences of environmental obstructions are unavailability of the service and multipath reception that degrades, in particular, the accuracy of the positioning. In order to enhance GNSS performances, several research axes can be found in the literature that can deal with multi-sensors uses, electronic enhancement or receiver processing. We focus here on the multisensor approach where each satellite is considered as a sensor.

Today most of the GNNS receivers, like the well-known GPS, consider that the received noise is gaussian and use a Kalman filter. This assumption is false in urban canyon and we must find new models for the noise and derive new methods to estimate the position in an accurate way from the signals send by the satellite and froml all other information sent by each satellite. Such a problem is all the more a typical one since the future Gallileo constellation will provide the receivers with information as the integrity of the signals, leading to new services for industry.

This problem can be modelled in the framework of the sensor management problem each satellite being considered as a sensor with several
*modes*. Moreover the receiver being generally in movement, it is necessary to estime with respect to time the non stationnary noise probability density function in the same time as we
estimate the position.

We have shown that in narrow urban canyons the noise resulting from multipath is multimodal and can be modelled in a first approximation by a gaussian Mixture Model leading to a non linear and non gaussian estimation process . When urban canyons are large, reception conditions are near those that can be found outside the town which means a gaussian noise. When moving the overall localization process must therefore be modelled by a Jump Markov System. An estimation process based on a particle filter has been implemented. Results of simulation show an improvement of the performances with respect to the classical receiver.

Stéphane Rossignol (post-doc fellowship) and Manuel Davy, in collaboration with France Télécom, continued to work on the automatic transcription of speech using kernel methods (mostly, the 1 class SVM) , . Their works focused on the classification problem. This is a particularly complicated problem, due to the facts that it comprises a few thousands classes and that these classes are very unbalanced (some classes are represented by only a few data; others by millions). Stéphane Rossignol has demonstrated that it is possible to train a 1-class SVM on the hugest classes in a relatively short amount of time. He has shown as well that the dissimilarity measures used in order to classify a sample in one of the trained 1-class SVM are effective even for this extremely unbalanced problem. On a small test dataset, he has shown that the obtained performance are at least as good as the performance obtained using the existing France Télécom system. In the first half of 2007, Stéphane Rossignol and Manuel Davy began to evaluate and improve the classification models they built in 2006. From June 2007 to September 2007, Michel Moyart has been a trainee of the ISEN engineer school on this topic . He improved the code, allowing the evaluations to be performed on a huge set of test data and the improvement procedure to be fast enough. These evaluations and improvements are on their way.

Stéphane Rossignol and Manuel Davy, within the framework of the contract “d'aide au transfert technologique” number 510416 between the CNRS and a company manufacturing test instruments for the professional audio industry, worked on the detection and characterization of loudspeakers flaws in the production line by using kernel methods . They worked as well on the test signal and on the TFR to use in order to effectively underline the characteristics of each flaw. And they worked on the overall methodology to follow: some flaws are hardly separable, requesting thus an additional feature extraction step; furthermore, the flaws are slightly evolving in time; and so forth. These various problems request the system to be as flexible as possible. Stéphane Rossignol and Manuel Davy have demonstrated that it is possible to use the 1-class SVM technique in order to discriminate between the flaws. A completely functional system is on the way.

Stéphane Rossignol worked on the automatic characterization of environmental sounds using kernel methods. First, concerning the segmentation and indexing of musical audio signals, he improved the performance of the techniques he developed during his Ph-D thesis and afterwards. Using the KCD (Kernel Change Detection) technique in order to underline the transitions between notes allows to reduce the number of false alarms by 80 %. Second, he worked with Asma Rabaoui on the application of the One-class SVM technique to Audio Surveillance Systems. Nine classes of environmental sounds (gunshots, cries, barkings, etc.) have been effectively discriminated using the One-class SVM technique. More than 90 % of good classification rate has been obtained , , .

Bayesian supervised classification with generative models. In classification problems, it sometimes happens that a model that describes the data generation process is available (this is called a generative model). Learning a bayesian classifier comes down to learning the posterior distribution of the model parameters, which requires to define prior distributions over these paramters. However, simple choices like the Gaussian are often inaccurate. Therefore, we propose to use a Dirichlet Process mixture prior so as to gain more learning flexibility while not ending up with an overly complex model. This has been applied to radar altimetry data classification. To be more specific, satellites like Topex-Poseidon carry radars to measure very precisely their height. However, the response of the earth surface to the electromagnetic pulse carries more information than just the altitude: its shape depends on the type of surface it hits (ocean, forest, etc.) and it is interesting to obtain this extra information .

As an INRIA team, SequeLhas not signed any contract by way of the INRIA. However, various works in 2007 have been done under contract, such as France Telecom, Auchan, and others (confidentiality required).

This work was described in sec. .

In 2007, we have dealt with a large-scale application of functional prediction, involving state-of-the-art supervised and non supervised learning methods. More precisely, we have worked with the group Auchan which is a major international group which operates more that 150 Hypermarkets worldwide. Among other important issues, the prediction of the number of customers reaching the cashiers at a given time of the day has been worked on in 2007. For each day, this is a functional prediction problem (number of open cashiers against time slot of each particular day), which may be seen as non-stationary, because the customer habits evolve.

Emmanuel Duflos takes part in the PEPSAT regional project, as the coordinator for the LAGIS. PEPSAT deals with global navigation satellite systems and is supported by the Région Nord-Pas de Calais.

This project is headed by Prof. S. Canu with the INSA-Rouen. It deals with the study of kernel methods for signal processing.

In 2007, Manuel Davy and Alain Rakotomamojy (LITIS, Rouen) have worked on the regularization path of the 1-class SVM .

A two years ARC project named CODA (for “Optimal control of an anaerobic digestor”) in collaboration with the INRA laboratory in Narbonne, the INRIA project-team COMORE in Sophia-Antipolis, and the spin-off Naskeo Environment, started in 2007. A post-doc fellow (Djalel Mazouni) has been hired for one year.

Several approaches for solving this partially observable Markov decision problem have been developed by two PhD students, Pierre-Arnaud Coquelin , using a sensitivity analysis combined with particle filtering approach, and Robin Jaulmes , using a Bayesian setting.

We refer the interested reader to the website
`http://
sequel.
futurs.
inria.
fr/
munos/
arc-coda`for more information, and up-to-date information.

Philippe Vanheeghe has visited the Centre for Pavement and Transportation Technology (CPATT), headed by prof. Carl Haas at the University of Waterloo, Canada, from Feb. 23
^{rd}to Mar. 11
^{th}, 2007. This deals with sensor management in order to locate building materials in building areas using RFID tags.

Martin Zinkevich spent a week in Villeneuve d'Ascq, from Oct 22th to Oct 27th to work on Poker software, with Jérémie Mary and Raphaël Maitrepierre. M. Zinkevich has chaired the 2007 AAAI Poker challenge. He used to be a member of the University of Alberta at Edmonton. He is now with Yahoo Research in the Silicon Valley.

Rémi Coulom has been invited by Takeshi Ito, University of Electro-Communications, Tokyo, from 6 to 9 Nov, 2007, with regards to Crazystone, his Go playing software. He
gives an invited talk at the 12
^{th}Game Programming Workshop.

Stepan Albrecht, University of Bohemia, is visiting Manuel Davy during 3 months, with regards to music analysis.

Manuel Davy is associate editor for the IEEE Transactions on Signal Processing review. He has also reviewed papers for IEEE Trans. on Signal Processing, IEEE Signal Processing Letter, Speech communications, Signal Processing, IEEE Trans. on Circuits and Systems I.

Emmanuel Duflos and Philippe Vanheeghe have organized two special sessions on the sensor management problem in the Fusion 2007 onference, held in July in Quebec City, Canada.

Emmanuel Duflos is working towards bringing the FUSION 2010 conference in Lille. Accordingly, he sent a formal proposition to the International Society Of Information Fusionto organize the FUSION Conference in Lille in 2010. The organization is supported at the moment by the Ecole Centrale de Lille and INRIA-Futurs. The proposition must be improved in 2008. The final decision will be taken in July 2008.

Emmanuel Duflos is also involved in the organization of the 5
^{th}Computational Engineering in Systems Applications conference which will be held in 2009, in South Korea.

Rémi Munos is member of the scientific board of the Journal of Machine Learning Research, Machine Learning Journal, Artificial Intelligence Journal, Revue d'Intelligence Artificielle. He has been a member of the PC of the 2006 conferences Neural Information Processing Systems, International Conference on Machine Learning, Conférence Francophone sur l'Apprentissage Automatique.

Rémi Munos was an invited speaker at the Workshop on Reinforcement Learning, held in Tübingen, in July 2007. He was also invited at the “Diffusion des savoirs” seminar (Ecole Normale Supérieure), January 2007

Rémi Munos is Associate Researcher with CREA (Centre de Recherche en Epistémologie Appliquée), École Polytechnique, since September 2007

Rémi Munos is Co-chair of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007.

Philippe Preux is a member of the program committee of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, the 2007 European Conference on Machine Learning, Artificial Evolution, Reconnaissance des Formes et Intelligence Artificielle 2008, Extraction et Gestion des Connaissances 2008.

Philippe Preux, Rémi Coulom, together with Samuel Delepoulle, have edited a special issue of the “Revue d'Intelligence Artificielle” on Markov Decision Processes

We list the courses that are related to the research activities in SequeLthat happened in 2007.

Rémi Munos teaches a class in reinforcement learning in the M2 “Mathematics-Vision-Learning” (MVA) at the ENS-Cachan; he also teaches a cognitive science class in M1 at the EHESS (Paris).

Philippe Preux teaches in the M2 of computer science at the University of Lille a class on reinforcement learning.

Stéphane Rossignol gives a class (16 hours) in Data Mining to the students of the Research Master “Génie Industriel” of the Ecole Centrale de Lille.

Otherwise, each of the 5 professors and assistant professors of the SequeLteam teaches 192 hours per year, mostly at master level. Taught classes include machine learning, data mining, and signal processing classes.