## Section: Scientific Foundations

### Regression models of supervised learning

The most obvious contribution of statistics to machine learning is to consider the supervised learning scenario as a special case of regression estimation: given $n$ independent pairs of observations $({X}_{i},{Y}_{i})$, $i=1,\cdots ,n$, the aim is to “learn” the dependence of ${Y}_{i}$ on ${X}_{i}$. Thus, classical results about statistical regression estimation apply, with the caveat that the hypotheses we can reasonably assume about the distribution of the pairs $({X}_{i},{Y}_{i})$ are much weaker than what is usually considered in statistical studies. The aim here is to assume very little, maybe only independence of the observed sequence of input-output pairs, and to validate model and variable selection schemes. These schemes should produce the best possible approximation of the joint distribution of $({X}_{i},{Y}_{i})$ within some restricted family of models. Their performance is evaluated according to some measure of discrepancy between distributions, a standard choice being to use the Kullback-Leibler divergence.

#### PAC-Bayes inequalities

One of the specialties of the team in this direction is to use PAC-Bayes inequalities to combine thresholded exponential moment inequalities. The name of this theory comes from its founder, David McAllester, and may be misleading. Indeed, its cornerstone is rather made of non-asymptotic entropy inequalities, and a perturbative approach to parameter estimation. The team has made major contributions to the theory, first focussed on classification [6] , then on regression [1] . It has introduced the idea of combining the PAC-Bayesian approach with the use of thresholded exponential moments, in order to derive bounds under very weak assumptions on the noise.

#### Sparsity and ${\ell}_{1}$–regularization

Another line of research in regression estimation is the use of sparse models, and its link with ${\ell}_{1}$–regularization. Regularization is the joint minimization of some empirical criterion and some penalty function; it should lead to a model that not only fits well the data but is also as simple as possible.

For instance, the Lasso uses a ${\ell}^{1}$–regularization instead of a ${\ell}^{0}$–one; it is
popular mostly because it leads to *sparse* solutions (the estimate has only a few nonzero coordinates),
which usually have a clear interpretation in many settings (e.g., the influence or lack of influence of some variables).
In addition, unlike ${\ell}^{0}$–penalization, the Lasso is *computationally feasible* for high-dimensional data.

#### Pushing it to the extreme: no assumption on the data

The next brick of our scientific foundations explains why and how, in certains cases, we may formulate absolutely no assumption on the data $({x}_{i},{y}_{i})$, $i=1,\cdots ,n$, which is then considered a deterministic set of input–output pairs.