EN FR
EN FR


Section: New Results

Learning to Learn

Auto-*

Participants: Guillaume Charpiat, Isabelle Guyon, Marc Schoenauer, Michèle Sebag

PhDs: Léonard Blier, François Gonard, Zhengying Liu, Herilalaina Rakotoarison, Lisheng Sun, Pierre Wolinski

Collaboration: Vincent Renault (SME Artelys); Olivier Bousquet (Google Zurich), Yann Ollivier (Facebook)

Tau is an active player in the Auto- field, having organized the sixth COSEAL workshop in Paris in September 2018. Furthermore, Auto- studies at Tau investigate several directions.

As discussed in Section 3.3, the most widely used approach is based on meta-features describing datasets, and builds upon past work in the team, such as Nacim Belkhir's PhD defended in 2017 [71], who won a GECCO competition in 2018 (Section 5.1.1), and François Gonard's PhD [11], defended in May 2018: an empirical performance model is built from the meta-features, and used to choose the best algorithm and its parameter configuration for unknown datasets. One key difficulty is to design useful meta-features: taking inspiration from equivariant learning [136] and learning from distributions [124], on-going work aims to learn such meta-features, based on the OpenML archive [153]. This extensive archive reports on the test predictive accuracy obtained by a few hundred algorithm configurations over a few thousand datasets.

Also mentioned in Section 3.3, another popular approach for algorithm selection is collaborative filtering. Active learning was used on top of the CofiRank algorithm for matrix factorization [156], improving the results and the time to solution of the recommendation algorithm [62].

An original approach to Auto-, explored in Herilalaina Rakotoarison's PhD, extends and adapts Monte-Carlo Tree Search to explore the structured space of pre-processing + learning algorithm configurations, and gradually determine the best pipeline [40]; the resulting algorithm yields promising results comparatively to AutoSklearn. A difficulty consists in managing the exploration together with the resource allocation (considering subsampled datasets and/or limited computational resources in the early MCTS stages, akin [91]).

Most real-world domains evolve with time, and an important issue in real-world applications is that of life-long learning, as static models can rapidly become obsolete. An extension of AutoSklearn was proposed, part of Lisheng Sun's PhD, that detects concept drifts and corrects the current model accordingly [38].

Two on-going works focus on the specific adjustment of hyper-parameters for neural nets, deriving rules for the network architecture (Pierre Wolinski's PhD), or (Leonard Blier's PhD) attaching fixed learning rates to each neuron and calibrating the learning rate distribution in such a way that neurons are sequentially active, learning in an optimally agile manner during a given learning phase, and being stable in later phases.

A last direction of investigation concerns the design of challenges, that contribute to the collective advance of research in the Auto- direction. The team has been very active in the series of AutoML challenges [42], and continuously contributes to the organization of new challenges (Section 7.6).

Deep Learning: Practical Theoretical Insights

Participants: Guillaume Charpiat, Marc Schoenauer, Michèle Sebag

PhDs: Léonard Blier, Corentin Tallec

Collaboration: Yann Ollivier (Facebook AI Research, Paris), the Altschuler and Wu lab. (UCSF, USA)

Even though a full mathematical understanding of deep learning is not available today, theoretical insights from information theory or from dynamical systems can bring significant improvements to practical deep learning algorithms or offer strong explanations for the success of some architectures compared to others.

In [32] we fully derive the LSTM structure from first axiomatic principles, using an axiom of robustness to temporal deformation (warpings) in the data. The LSTM architecture, introduced in the 90's, has become the currently dominant architecture for modeling temporal sequences (such as text) in deep learning. But the LSTM architecture itself is quite complex and appears very much ad hoc at first sight. We prove that LSTMs necessarily arise if one wants the model to be able to handle time warpings in the data (such as arbitrary accelerations or decelerations in the signal). In fact, LSTM-like structures are the only way to provide robustness to such deformations: their complex equations can be derived axiomatically.

In [28] (long oral presentation at ICML) we tackle the problem of mode loss in generative models via information theory. The problem is to find generative models to produce more samples similar to samples in a dataset (eg, realistic images). The standard GAN approach is to couple a generative network and an adversary network whose job is to tell the differences between generated and genuine images. This suffers from mode loss: the generator focuses on doing some images well, rather than covering a full variety of images. Instead we propose to have the discriminator predict the proportion of true and fake images in a set of images, via an information theory criterion. This makes the discriminator work at the level of the overall distribution of images from the generator rather than individual images. By working on sets of images, the discriminator can detect statistical imbalances between different types of images created by the generator, thus reducing mode loss. An adapted architecture is derived for this, provably able to detect (in principle) all permutation-invariant statistics in a set of images.

In [43] we tackle the problem of recurrent network training via the theory of dynamical systems. Recurrent networks deal with temporal data sequences exhibiting temporal dependencies. Then backpropagation becomes backpropagation through time: for every new data point, training must rewind the network's computations backward in time on all past data to update the model parameters. This is unrealistic in any real-time application where the data arrive online. Two years ago we presented a fully online solution avoiding this "time rewind" step, based on real-time, noisy but unbiased approximations of model gradients. Our previous solution was mathematically well motivated but extremely complex to implement for standard models such as LSTMs. We now have a simpler variant which can be implemented easily in a black-box fashion on top of any recurrent model, and which is just as well-justified mathematically. The price to pay is more variance. In the long run, this could quite extend the applicability range of recurrent model to real-time situations.

In [31](to be presented at ICLR 2019), we introduce a multi-domain adversarial learning algorithm in the semi-supervised setting. We extend the single source H-divergence theory for domain adaptation to the case of multiple domains, and obtain bounds on the average- and worst-domain risk in multi-domain learning. This leads to a new loss to accommodate semi-supervised multi-domain learning and domain adaptation. We obtain state-of-the-art results on two standard image benchmarks, and propose as a new benchmark a novel bioimage dataset, CELL, in the domain of automated microscopy data, where cultured cells are imaged after being exposed to known and unknown chemical perturbations, and in which each dataset displays significant experimental bias.

Analyzing and Learning Complex Systems

Participants: Cyril Furtlehner, Aurélien Decelle, François Landes

PhDs: Giancarlo Fissore

Collaboration: Jacopo Rocchi (LPTMS Paris Sud), the Simons team: Rahul Chako (post-doc), Andrea Liu (UPenn), David Reichman (Columbia), Giulio Biroli (ENS), Olivier Dauchot (ESPCI).

The information content of a trained restricted Boltzmann machine (RBM) for instance can be analyzed by comparing the singular values/vectors of its weight matrix, referred to as data modes, to that of a random RBM (typically following a Marchenko-Pastur distribution) [83]. The general strategy here is to replace the analysis of the learning process of a single instance by that of a well chosen statistical ensemble of models. In G. Fissore's PhD, the learning trajectory of an RBM is shown to start with a linear phase recovering the dominant modes of the data, followed by a non-linear regime where the interaction among the modes is characterized [15]. While the mean-field analysis conducted in closed form requires simplifying assumptions, it suggests some simple heuristics to speed up the convergence and to simplify the models. Ongoing works concern extensions of these considerations to settings with missing input on the practical side and to the analysis of exactly solvable RBM - i.e. non-linear RBM for which the contrastive divergence can be computed in closed forms - on the theoretical side. Additionnally, we are collaborating with J. Rocchi, working at the LPTMS (Univ. Paris Sud), to investigate the landscape of RBMs learned from different initial conditions and to characterized it as a function of the number of parameters (hidden nodes) of the system.

A long standing application of our aforementioned mean-field inference methods based on probabilistic modelling concerns road traffic forecasting. In [49] we wrap up some of the techniques developed in these past works and perform, thanks to PTV-SISTeMA comprehensive experimental tests on various real world Urban traffic dataset in order to illustrate in various conditions the effectiveness of our method. As a by-product we show to some extent how to disentangle the model bias from errors caused by corrupted data and shed some light on the nature of the data themselves.

An emerging research topic, that we started to investigate thanks to exchanges with Lenka Zdeborova's group [96], is to revisit the Information Bottleneck framework [151] and analyze on non-toy NNs the gradual distillation of the mutual information (MI) along the NN layers, minimizing the MI with the input while preserving the MI with the sought output (the labels). More generally, information theory concepts could also be used to analyze the behavior of the network, for instance to detect adversarial attacks through unusual neural activity mapping.

As mentioned earlier, the use of ML to address fundamental physics problems is quickly growing. One example is the domain of glasses (how the structure of glasses is related to their dynamics), which is one of the major problems in modern theoretical physics. The idea is to let ML models automatically find the hidden structures (features) that control the flowing or non-flowing state of matter, discriminating liquid from solid states. These models could then help identifying "computational order parameters", that would advance the understanding of physical phenomena, on the one hand, and support the development of more complex models, on the other hand. Furthermore, this problem is new to the ML community and could provide an original non-trivial example for engineering, testing and benchmarking explainability protocols.