Section: Research Program

Statistical Learning and Bayesian Analysis

Before detailing some issues in these fields, let us remind the definition of a few terms.

Machine learning

refers to a system capable of the autonomous acquisition and integration of knowledge. This capacity to learn from experience, analytical observation, and other means, results in a system that can continuously self-improve and thereby offer increased efficiency and effectiveness.

Statistical learning

is an approach to machine intelligence that is based on statistical modeling of data. With a statistical model in hand, one applies probability theory and decision theory to get an algorithm. This is opposed to using training data merely to select among different algorithms or using heuristics/“common sense” to design an algorithm.

Bayesian Analysis

applies to data that could be seen as observations in the more general meaning of the term. These data may not only come from classical sensors but also from any device recording information. From an operational point of view, like for statistical learning, uncertainty about the data is modeled by a probability measure thus defining the so-called likelihood functions. This last one depends upon parameters defining the state of the world we focus on for decision purposes. Within the Bayesian framework the uncertainty about these parameters is also modeled by probability measures, the priors that are subjective probabilities. Using probability theory and decision theory, one then defines new algorithms to estimate the parameters of interest and/or associated decisions. According to the International Society for Bayesian Analysis (source: http://bayesian.org ), and from a more general point of view, this overall process could be summarize as follows: one assesses the current state of knowledge regarding the issue of interest, gather new data to address remaining questions, and then update and refine their understanding to incorporate both new and old data. Bayesian inference provides a logical, quantitative framework for this process based on probability theory.

Kernel method.

Generally speaking, a kernel function is a function that maps a couple of points to a real value. Typically, this value is a measure of dissimilarity between the two points. Assuming a few properties on it, the kernel function implicitly defines a dot product in some function space. This very nice formal property as well as a bunch of others have ensured a strong appeal for these methods in the last 10 years in the field of function approximation. Many classical algorithms have been “kernelized”, that is, restated in a much more general way than their original formulation. Kernels also implicitly induce the representation of data in a certain “suitable” space where the problem to solve (classification, regression, ...) is expected to be simpler (non-linearity turns to linearity).

The fundamental tools used in SequeL come from the field of statistical learning [45] . We briefly present the most important for us to date, namely, kernel-based non parametric function approximation, and non parametric Bayesian models.

Non-parametric methods for Function Approximation

In statistics in general, and applied mathematics, the approximation of a multi-dimensional real function given some samples is a well-known problem (known as either regression, or interpolation, or function approximation, ...). Regressing a function from data is a key ingredient of our research, or to the least, a basic component of most of our algorithms. In the context of sequential learning, we have to regress a function while data samples are being obtained one at a time, while keeping the constraint to be able to predict points at any step along the acquisition process. In sequential decision problems, we typically have to learn a value function, or a policy.

Many methods have been proposed for this purpose. We are looking for suitable ones to cope with the problems we wish to solve. In reinforcement learning, the value function may have areas where the gradient is large; these are areas where the approximation is difficult, while these are also the areas where the accuracy of the approximation should be maximal to obtain a good policy (and where, otherwise, a bad choice of action may imply catastrophic consequences).

We particularly favor non parametric methods since they make quite a few assumptions about the function to learn. In particular, we have strong interests in l1-regularization, and the (kernelized-)LARS algorithm. l1-regularization yields sparse solutions, and the LARS approach produces the whole regularization path very efficiently, which helps solving the regularization parameter tuning problem.

Nonparametric Bayesian Estimation

Numerous problems may be solved efficiently by a Bayesian approach. The use of Monte-Carlo methods allows us to handle non–linear, as well as non–Gaussian, problems. In their standard form, they require the formulation of probability densities in a parametric form. For instance, it is a common usage to use Gaussian likelihood, because it is handy. However, in some applications such as Bayesian filtering, or blind deconvolution, the choice of a parametric form of the density of the noise is often arbitrary. If this choice is wrong, it may also have dramatic consequences on the estimation quality. To overcome this shortcoming, one possible approach is to consider that this density must also be estimated from data. A general Bayesian approach then consists in defining a probabilistic space associated with the possible outcomes of the object to be estimated. Applied to density estimation, it means that we need to define a probability measure on the probability density of the noise: such a measure is called a random measure. The classical Bayesian inference procedures can then been used. This approach being by nature non parametric, the associated frame is called Non Parametric Bayesian.

In particular, mixtures of Dirichlet processes [44] provide a very powerful formalism. Dirichlet Processes are a possible random measure and Mixtures of Dirichlet Processes are an extension of well-known finite mixture models. Given a mixture density f(x|θ), and G(dθ)=k=1ωkδUk(dθ), a Dirichlet process, we define a mixture of Dirichlet processes as:

F ( x ) = Θ f ( x | θ ) G ( d θ ) = k = 1 ω k f ( x | U k ) (4)

where F(x) is the density to be estimated. The class of densities that may be written as a mixture of Dirichlet processes is very wide, so that they really fit a very large number of applications.

Given a set of observations, the estimation of the parameters of a mixture of Dirichlet processes is performed by way of a Monte Carlo Markov Chain (MCMC) algorithm. Dirichlet Process Mixture are also widely used in clustering problems. Once the parameters of a mixture are estimated, they can be interpreted as the parameters of a specific cluster defining a class as well. Dirichlet processes are well known within the machine learning community and their potential in statistical signal processing still need to be developed.

Random Finite Sets for multisensor multitarget tracking

In the general multi-sensor multi-target Bayesian framework, an unknown (and possibly varying) number of targets whose states x1,...xn are observed by several sensors which produce a collection of measurements z1,...,zm at every time step k. Well-known models to this problem are track-based models, such as the joint probability data association (JPDA), or joint multi-target probabilities, such as the joint multi-target probability density. Common difficulties in multi-target tracking arise from the fact that the system state and the collection of measures from sensors are unordered and their size evolve randomly through time. Vector-based algorithms must therefore account for state coordinates exchanges and missing data within an unknown time interval. Although this approach is very popular and has resulted in many algorithms in the past, it may not be the optimal way to tackle the problem, since the sate and the data are in fact sets and not vectors.

The random finite set theory provides a powerful framework to deal with these issues. Mahler's work on finite sets statistics (FISST) provides a mathematical framework to build multi-object densities and derive the Bayesian rules for state prediction and state estimation. Randomness on object number and their states are encapsulated into random finite sets (RFS), namely multi-target(state) sets X={x1,...,xn} and multi-sensor (measurement) set Zk={z1,...,zm}. The objective is then to propagate the multitarget probability density fk|k(X|Z(k)) by using the Bayesian set equations at every time step k:

f k + 1 | k ( X | Z ( k ) ) = f k + 1 | k ( X | W ) f k | k ( W | Z ( k ) ) δ W f k + 1 | k + 1 ( X | Z ( k + 1 ) ) = f k + 1 ( Z k + 1 | X ) f k + 1 | k ( X | Z ( k ) ) f k + 1 ( Z k + 1 | W ) f k + 1 | k ( W | Z ( k ) ) δ W (5)


  • X={x1,...,xn} is a multi-target state, i.e. a finite set of elements xi defined on the single-target space 𝒳; (The state xi of a target is usually composed of its position, its velocity, etc.)

  • Zk+1={z1,...,zm} is the current multi-sensor observation, i.e. a collection of measures zi produced at time k+1 by all the sensors;

  • Z(k)=tkZt is the collection of observations up to time k;

  • fk|k(W|Z(k)) is the current multi-target posterior density in state W;

  • fk+1|k(X|W) is the current multi-target Markov transition density, from state W to state X;

  • fk+1(Z|X) is the current multi-sensor/multi-target likelihood function.

Although equations (5 ) may seem similar to the classical single-sensor/single-target Bayesian equations, they are generally intractable because of the presence of the set integrals. For, a RFS Ξ is characterized by the family of its Janossy densities jΞ,1(x1), jΞ,2(x1,x2)... and not just by one density as it is the case with vectors. Mahler then introduced the PHD, defined on single-target state space. The PHD is the quantity whose integral on any region S is the expected number of targets inside S. Mahler proved that the PHD is the first-moment density of the multi-target probability density. Although defined on single-state space X, the PHD encapsulates information on both target number and states.