SEQUEL - 2011 - Annual activity report

SEQUEL

SEQUEL - 2011

Project Team Sequel

Members

Overall Objectives

Scientific Foundations

Application Domains

Software

New Results

Contracts and Grants with Industry

Contracts and Grants with Industry

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: Scientific Foundations

Statistical learning

Before detailing some issues of statistical learning, let us remind the definition of a few terms.

Machine learning: refers to a system capable of the autonomous acquisition and integration of knowledge. This capacity to learn from experience, analytical observation, and other means, results in a system that can continuously self-improve and thereby offer increased efficiency and effectiveness. (source: http://www.aaai.org/AITopics/html/machine.html AAAI website)
Statistical learning: is an approach to machine intelligence which is based on statistical modeling of data. With a statistical model in hand, one applies probability theory and decision theory to get an algorithm. This is opposed to using training data merely to select among different algorithms or using heuristics/“common sense” to design an algorithm.
Kernel method: Generally speaking, a kernel function is a function that maps a couple of points to a real value. Typically, this value is a measure of dissimilarity between the two points. Assuming a few properties on it, the kernel function implicitly defines a dot product in some function space. This very nice formal property as well as a bunch of others have ensured a strong appeal for these methods in the last 10 years in the field of function approximation. Many classical algorithms have been “kernelized”, that is, restated in a much more general way than their original formulation. Kernels also implicitly induce the representation of data in a certain “suitable” space where the problem to solve (classification, regression, ...) is expected to be simpler (non-linearity turns to linearity).

The fundamental tools used in SequeL come from the field of statistical learning [73] . We briefly present the most important for us to date, namely, kernel-based non parametric function approximation, and non parametric Bayesian models.

Kernel methods for non parametric function approximation

In statistics in general, and applied mathematics, the approximation of a multi-dimensional real function given some samples is a well-known problem (known as either regression, or interpolation, or function approximation, ...). Regressing a function from data is a key ingredient of our research, or to the least, a basic component of most of our algorithms. In the context of sequential learning, we have to regress a function while data samples are being obtained one at a time, while keeping the constraint to be able to predict points at any step along the acquisition process. In sequential decision problems, we typically have to learn a value function, or a policy.

Many methods have been proposed for this purpose. We are looking for suitable ones to cope with the problems we wish to solve. In reinforcement learning, the value function may have areas where the gradient is large; these are areas where the approximation is difficult, while these are also the areas where the accuracy of the approximation should be maximal to obtain a good policy (and where, otherwise, a bad choice of action may imply catastrophic consequences).

We particularly favor non parametric methods since they make quite a few assumptions about the function to learn. In particular, we have strong interests in $l_{1}$ -regularization, and the (kernelized-)LARS algorithm. $l_{1}$ -regularization yields sparse solutions, and the LARS approach produces the whole regularization path very efficiently, which helps solving the regularization parameter tuning problem.

Non–parametric Bayesian models

Numerous problems in signal processing may be solved efficiently by way of a Bayesian approach. The use of Monte-Carlo methods allows us to handle non–linear, as well as non–Gaussian, problems. In their standard form, they require the formulation of probability densities in a parametric form. For instance, it is a common usage to use Gaussian likelihood, because it is handy. However, in some applications such as Bayesian filtering, or blind deconvolution, the choice of a parametric form of the density of the noise is often arbitrary. If this choice is wrong, it may also have dramatic consequences on the estimation quality. To overcome this shortcoming, one possible approach is to consider that this density must also be estimated from data. A general Bayesian approach then consists in defining a probabilistic space associated with the possible outcomes of the object to be estimated. Applied to density estimation, it means that we need to define a probability measure on the probability density of the noise : such a measure is called a random measure. The classical Bayesian inference procedures can then been used. This approach being by nature non parametric, the associated frame is called Non Parametric Bayesian.

In particular, mixtures of Dirichlet processes [72] provide a very powerful formalism. Dirichlet Processes are a possible random measure and Mixtures of Dirichlet Processes are an extension of well-known finite mixture models. Given a mixture density $f (x | θ)$ , and $G (d θ) = \sum_{k = 1}^{\infty} ω_{k} δ_{U_{k}} (d θ)$ , a Dirichlet process, we define a mixture of Dirichlet processes as:

F (x) = \int_{Θ} f (x | θ) G (d θ) = \sum_{k = 1}^{\infty} ω_{k} f (x | U_{k})

(4)

where $F (x)$ is the density to be estimated. The class of densities that may be written as a mixture of Dirichlet processes is very wide, so that they really fit a very large number of applications.

Given a set of observations, the estimation of the parameters of a mixture of Dirichlet processes is performed by way of a Monte Carlo Markov Chain (MCMC) algorithm. Dirichlet Process Mixture are also widely used in clustering problems. Once the parameters of a mixture are estimated, they can be interpreted as the parameters of a specific cluster defining a class as well. Dirichlet processes are well known within the machine learning community and its potential in statistical signal processing still need to be developped.

Random Finite Sets for multisensor multitarget tracking

In the general multi-sensor multi-target Bayesian framework, an unknown (and possibly varying) number of targets whose states $x_{1}, . . . x_{n}$ are observed by several sensors which produce a collection of measurements $z_{1}, . . ., z_{m}$ at every time step $k$ . Well-known models to this problem are track-based models, such as the joint probability data association (JPDA), or joint multi-target probabilities, such as the joint multi-target probability density. Common difficulties in multi-target tracking arise from the fact that the system state and the collection of measures from sensors are unordered and their size evolve randomly through time. Vector-based algorithms must therefore account for state coordinates exchanges and missing data within an unknown time interval. Although this approach is very popular and has resulted in many algorithms in the past, it may not the optimal way to tackle the problem, since the sate and the data are in fact sets and not vectors.

The random finite set theory provides a powerful framework to deal with these issues. Mahler's work on finite sets statistics (FISST) provides a mathematical framework to build multi-object densities and derive the Bayesian rules for state prediction and state estimation. Randomness on object number and their states are encapsulated into random finite sets (RFS), namely multi-target(state) sets $X = {x_{1}, . . ., x_{n}}$ and multi-sensor (measurement) set $Z k = {z_{1}, . . ., z_{m}}$ . The objective is then to propagate the multitarget probability density $f_{k | k} (X | Z (k))$ by using the Bayesian set equations at every time step $k$ :

\begin{matrix} f_{k + 1 | k} (X | Z^{(k)}) = \int f_{k + 1 | k} (X | W) f_{k | k} (W | Z^{(k)}) δ W \\ f_{k + 1 | k + 1} (X | Z^{(k + 1)}) = \frac{f_{k + 1} (Z_{k + 1} | X) f_{k + 1 | k} (X | Z^{(k)})}{\int f_{k + 1} (Z_{k + 1} | W) f_{k + 1 | k} (W | Z^{(k)}) δ W} \end{matrix}

(5)

where:

$X = {x_{1}, . . ., x_{n}}$ is a multi-target state, i.e. a finite set of elements $x_{i}$ defined on the single-target space $𝒳$ ; (The state $x_{i}$ of a target is usually composed of its position, its velocity, etc.)
$Z_{k + 1} = {z_{1}, . . ., z_{m}}$ is the current multi-sensor observation, i.e. a collection of measures $z_{i}$ produced at time $k + 1$ by all the sensors;
$Z^{(k)} = ⋃_{t ⩽ k} Z_{t}$ is the collection of observations up to time $k$ ;
$f_{k | k} (W | Z^{(k)})$ is the current multi-target posterior density in state $W$ ;
$f_{k + 1 | k} (X | W)$ is the current multi-target Markov transition density, from state $W$ to state $X$ ;
$f_{k + 1} (Z | X)$ is the current multi-sensor/multi-target likelihood function.

Although equations (5 ) may seem similar to the classical single-sensor/single-target Bayesian equations, they are generally intractable because of the presence of the set integrals. For, a RFS $Ξ$ is characterized by the family of its Janossy densities $j_{Ξ, 1} (x_{1})$ , $j_{Ξ, 2} (x_{1}, x_{2}) . . .$ and not just by one density as it is the case with vectors. Mahler then introduced the PHD, defined on single-target state space. The PHD is the quantity whose integral on any region $S$ is the expected number of targets inside $S$ . Mahler proved that the PHD is the first-moment density of the multi-target probability density. Although defined on single-state space X, the PHD encapsulates information on both target number and states. The Probability Hypothesis Density is a well-known method for single-sensor multi-target tracking problems in a Bayesian framework, but the extension to the multi-sensor case seems to remain a challenge.

Previous |

Home | Next next