Section: Scientific Foundations
Statistical learning
Before detailing some issues of statistical learning, let us remind the definition of a few terms.
 Glossary
 Machine learning
refers to a system capable of the autonomous acquisition and integration of knowledge. This capacity to learn from experience, analytical observation, and other means, results in a system that can continuously selfimprove and thereby offer increased efficiency and effectiveness. (source: http://www.aaai.org/AITopics/html/machine.html AAAI website)
 Statistical learning
is an approach to machine intelligence which is based on statistical modeling of data. With a statistical model in hand, one applies probability theory and decision theory to get an algorithm. This is opposed to using training data merely to select among different algorithms or using heuristics/“common sense” to design an algorithm.
 Kernel method
Generally speaking, a kernel function is a function that maps a couple of points to a real value. Typically, this value is a measure of dissimilarity between the two points. Assuming a few properties on it, the kernel function implicitly defines a dot product in some function space. This very nice formal property as well as a bunch of others have ensured a strong appeal for these methods in the last 10 years in the field of function approximation. Many classical algorithms have been “kernelized”, that is, restated in a much more general way than their original formulation. Kernels also implicitly induce the representation of data in a certain “suitable” space where the problem to solve (classification, regression, ...) is expected to be simpler (nonlinearity turns to linearity).
The fundamental tools used in SequeL come from the field of statistical learning [73] . We briefly present the most important for us to date, namely, kernelbased non parametric function approximation, and non parametric Bayesian models.
Kernel methods for non parametric function approximation
In statistics in general, and applied mathematics, the approximation of a multidimensional real function given some samples is a wellknown problem (known as either regression, or interpolation, or function approximation, ...). Regressing a function from data is a key ingredient of our research, or to the least, a basic component of most of our algorithms. In the context of sequential learning, we have to regress a function while data samples are being obtained one at a time, while keeping the constraint to be able to predict points at any step along the acquisition process. In sequential decision problems, we typically have to learn a value function, or a policy.
Many methods have been proposed for this purpose. We are looking for suitable ones to cope with the problems we wish to solve. In reinforcement learning, the value function may have areas where the gradient is large; these are areas where the approximation is difficult, while these are also the areas where the accuracy of the approximation should be maximal to obtain a good policy (and where, otherwise, a bad choice of action may imply catastrophic consequences).
We particularly favor non parametric methods since they make quite a few assumptions about the function to learn. In particular, we have strong interests in ${l}_{1}$regularization, and the (kernelized)LARS algorithm. ${l}_{1}$regularization yields sparse solutions, and the LARS approach produces the whole regularization path very efficiently, which helps solving the regularization parameter tuning problem.
Non–parametric Bayesian models
Numerous problems in signal processing may be solved efficiently by way of a Bayesian approach. The use of MonteCarlo methods allows us to handle non–linear, as well as non–Gaussian, problems. In their standard form, they require the formulation of probability densities in a parametric form. For instance, it is a common usage to use Gaussian likelihood, because it is handy. However, in some applications such as Bayesian filtering, or blind deconvolution, the choice of a parametric form of the density of the noise is often arbitrary. If this choice is wrong, it may also have dramatic consequences on the estimation quality. To overcome this shortcoming, one possible approach is to consider that this density must also be estimated from data. A general Bayesian approach then consists in defining a probabilistic space associated with the possible outcomes of the object to be estimated. Applied to density estimation, it means that we need to define a probability measure on the probability density of the noise : such a measure is called a random measure. The classical Bayesian inference procedures can then been used. This approach being by nature non parametric, the associated frame is called Non Parametric Bayesian.
In particular, mixtures of Dirichlet processes [72] provide a very powerful formalism. Dirichlet Processes are a possible random measure and Mixtures of Dirichlet Processes are an extension of wellknown finite mixture models. Given a mixture density $f\left(x\right\theta )$, and $G\left(d\theta \right)={\sum}_{k=1}^{\infty}{\omega}_{k}{\delta}_{{U}_{k}}\left(d\theta \right)$, a Dirichlet process, we define a mixture of Dirichlet processes as:
$F\left(x\right)={\int}_{\Theta}f\left(x\right\theta )G\left(d\theta \right)=\sum _{k=1}^{\infty}{\omega}_{k}f\left(x\right{U}_{k})$  (4) 
where $F\left(x\right)$ is the density to be estimated. The class of densities that may be written as a mixture of Dirichlet processes is very wide, so that they really fit a very large number of applications.
Given a set of observations, the estimation of the parameters of a mixture of Dirichlet processes is performed by way of a Monte Carlo Markov Chain (MCMC) algorithm. Dirichlet Process Mixture are also widely used in clustering problems. Once the parameters of a mixture are estimated, they can be interpreted as the parameters of a specific cluster defining a class as well. Dirichlet processes are well known within the machine learning community and its potential in statistical signal processing still need to be developped.
Random Finite Sets for multisensor multitarget tracking
In the general multisensor multitarget Bayesian framework, an unknown (and possibly varying) number of targets whose states ${x}_{1},...{x}_{n}$ are observed by several sensors which produce a collection of measurements ${z}_{1},...,{z}_{m}$ at every time step $k$. Wellknown models to this problem are trackbased models, such as the joint probability data association (JPDA), or joint multitarget probabilities, such as the joint multitarget probability density. Common difficulties in multitarget tracking arise from the fact that the system state and the collection of measures from sensors are unordered and their size evolve randomly through time. Vectorbased algorithms must therefore account for state coordinates exchanges and missing data within an unknown time interval. Although this approach is very popular and has resulted in many algorithms in the past, it may not the optimal way to tackle the problem, since the sate and the data are in fact sets and not vectors.
The random finite set theory provides a powerful framework to deal with these issues. Mahler's work on finite sets statistics (FISST) provides a mathematical framework to build multiobject densities and derive the Bayesian rules for state prediction and state estimation. Randomness on object number and their states are encapsulated into random finite sets (RFS), namely multitarget(state) sets $X=\{{x}_{1},...,{x}_{n}\}$ and multisensor (measurement) set $Zk=\{{z}_{1},...,{z}_{m}\}$. The objective is then to propagate the multitarget probability density ${f}_{kk}\left(X\rightZ\left(k\right))$ by using the Bayesian set equations at every time step $k$:
$\begin{array}{c}{f}_{k+1k}\left(X\right{Z}^{\left(k\right)})=\int {f}_{k+1k}\left(X\rightW){f}_{kk}\left(W\right{Z}^{\left(k\right)})\delta W\hfill \\ \hfill {f}_{k+1k+1}\left(X\right{Z}^{(k+1)})=\frac{{f}_{k+1}\left({Z}_{k+1}\rightX){f}_{k+1k}\left(X\right{Z}^{\left(k\right)})}{\int {f}_{k+1}\left({Z}_{k+1}\rightW){f}_{k+1k}\left(W\right{Z}^{\left(k\right)})\delta W}\end{array}$  (5) 
where:

$X=\{{x}_{1},...,{x}_{n}\}$ is a multitarget state, i.e. a finite set of elements ${x}_{i}$ defined on the singletarget space $\mathcal{X}$; (The state ${x}_{i}$ of a target is usually composed of its position, its velocity, etc.)

${Z}_{k+1}=\{{z}_{1},...,{z}_{m}\}$ is the current multisensor observation, i.e. a collection of measures ${z}_{i}$ produced at time $k+1$ by all the sensors;

${Z}^{\left(k\right)}={\bigcup}_{t\u2a7dk}{Z}_{t}$ is the collection of observations up to time $k$;

${f}_{kk}\left(W\right{Z}^{\left(k\right)})$ is the current multitarget posterior density in state $W$;

${f}_{k+1k}\left(X\rightW)$ is the current multitarget Markov transition density, from state $W$ to state $X$;

${f}_{k+1}\left(Z\rightX)$ is the current multisensor/multitarget likelihood function.
Although equations (5 ) may seem similar to the classical singlesensor/singletarget Bayesian equations, they are generally intractable because of the presence of the set integrals. For, a RFS $\Xi $ is characterized by the family of its Janossy densities ${j}_{\Xi ,1}\left({x}_{1}\right)$, ${j}_{\Xi ,2}({x}_{1},{x}_{2})...$ and not just by one density as it is the case with vectors. Mahler then introduced the PHD, defined on singletarget state space. The PHD is the quantity whose integral on any region $S$ is the expected number of targets inside $S$. Mahler proved that the PHD is the firstmoment density of the multitarget probability density. Although defined on singlestate space X, the PHD encapsulates information on both target number and states. The Probability Hypothesis Density is a wellknown method for singlesensor multitarget tracking problems in a Bayesian framework, but the extension to the multisensor case seems to remain a challenge.