Section: Research Program
Learning to learn
According to Ali Rahimi's test of times award speech at NIPS 17, the current ML algorithms have become a form of alchemy. Competitive testing and empirical breakthroughs gradually become mandatory for a contribution to be acknowledged; an increasing part of the community adopts trials and errors as main scientific methodology, and theory is lagging behind practice. This style of progress is typical of technological and engineering revolutions for some; others ask for consolidated and well-understood theoretical advances, saving the time wasted in trying to build upon hardly reproducible results.
Basically, while practical achievements have often passed the expectations, there exist caveats along three dimensions. Firstly, excellent performances do not imply that the model has captured what was to learn, as shown by the phenomenon of adversarial examples. Following Ian Goodfellow, some well-performing models might be compared to Clever Hans, the horse that was able to solve mathematical exercizes using non verbal cues from its teacher [111]; it is the purpose of Pillar I. to alleviate the Clever Hans trap (section 3.1).
Secondly, some major advances, e.g. related to the celebrated adversarial learning [101], [97], establish proofs of concept more than a sound methodology, where the reproducibility is limited due to i) the computational power required for training (often beyond reach of academic labs); ii) the numerical instabilities (witnessed as random seeds happen to be found in the codes); iii) the insufficiently documented experimental settings. What works, why and when is still a matter of speculation, although better understanding the limitations of the current state of the art is acknowledged to be a priority. After Ali Rahimi again, simple experiments, simple theorems are the building blocks that help us understand more complicated systems. Along this line, [128] propose toy examples to demonstrate and understand the defaults of convergence of gradient descent adversarial learning.
Thirdly, and most importantly, the reported achievements rely on carefully tuned learning architectures and hyper-parameters. The sensitivity of the results to the selection and calibration of algorithms has been identified since the end 80s as a key ML bottleneck, and the field of automatic algorithm selection and calibration, referred to as AutoML or Auto- in the following, is at the ML forefront.
Tau aims to contribute to the ML evolution toward a more mature stage along three dimensions. In the short term, the research done in Auto- will be pursued (section 3.3.1). In the medium term, an information theoretic perspective will be adopted to capture the data structure and to calibrate the learning algorithm depending on the nature and amount of the available data. In the longer term, our goal is to leverage the methodologies forged in statistical physics to understand and control the trajectories of complex learning systems (section 3.3.3).
Auto-*
Participants: Isabelle Guyon, Marc Schoenauer, Michèle Sebag
PhD: Guillaume Doquet, Zhengying Liu, Herilalaina Rakotoarison, Lisheng Sun
Collaboration: Olivier Bousquet, André Elisseeff (Google Zurich)
The so-called Auto- task, concerned with selecting a (quasi) optimal algorithm and its hyper-parameters depending on the problem instance at hand, remained a key issue in ML for the last three decades [75], as well as in optimization at large [110], including combinatorial optimization and constraint satisfaction [117], [100] and continuous optimization [71]. This issue, tackled by several European projects along the decades, governs the knowledge transfer to industry, due to the shortage of data scientists. It becomes even more crucial as models are more complex and their training requires more computational resources. This has motivated several international challenges devoted to Auto-ML [45] (see also Section 3.4), including the on-going AutoDL [37] (see also Section 7.6).
Several approaches have been used to tackle Auto- in the literature, and TAU has been particularly active in the first two. Meta-learning aims to build a surrogate performance model, estimating the performance of an algorithm configuration on any problem instance characterized from its meta-feature values [138], [100], [72], [71] [11]. Collaborative filtering, considering that a problem instance "likes better" an algorithm configuration yielding a better performance, learns to recommend good algorithms to problem instances [147], [130]. Bayesian optimization proceeds by alternatively building a surrogate model of algorithm performances on the problem instance at hand, and tackling it [92]. This last approach currently is the prominent one; as shown in [130], the meta-features developed for AutoML are hardly relevant, hampering both meta-learning and collaborative filtering.
Beyond these, current research directions in TAU include the design of more efficient features, as well as the design of an original approach based on MCTS algorithm (see Section 7.2.1).
Information theory: adjusting model complexity and data fitting
Participants: Guillaume Charpiat, Marc Schoenauer, Michèle Sebag
PhD: Corentin Tallec, Pierre Wolinski, Léonard Blier
Collaboration: Yann Ollivier (Facebook)
In the 60s, Kolmogorov and Solomonoff provided a well-grounded theory for building (probabilistic) models best explaining the available data [139], [103], that is, the shortest programs able to generate these data. Such programs can then be used to generate further data or to answer specific questions (interpreted as missing values in the data). Deep learning, from this viewpoint, efficiently explores a space of computation graphs, described from its hyperparameters (network structure) and parameters (weights). Network training amounts to optimizing these parameters, namely, navigating the space of computational graphs to find a network, as simple as possible, that explain the past observations well.
This vision is at the core of variational auto-encoders [116], directly optimizing a bound on the Kolmogorov complexity of the dataset. More generally variational methods provide quantitative criteria to identify superfluous elements (edges, units) in a neural network, that can potentially be used for structural optimization of the network (Leonard Blier's PhD, started Oct. 2018).
The same principles apply to unsupervised learning, aimed to find the maximum amount of structure hidden in the data, quantified using this information-theoretic criterion.
The known invariances in the data can be exploited to guide the model design (e.g. as translation invariance leads to convolutional structures, or LSTM is shown to enforce the invariance to time affine transformations of the data sequence [150]). Scattering transforms exploit similar principles [77]. A general theory of how to detect unknown invariances in the data, however, is currently lacking.
The view of information theory and Kolmogorov complexity suggests that key program operations (composition, recursivity, use of predefined routines) should intervene when searching for a good computation graph. One possible framework for exploring the space of computation graphs with such operations is that of genetic programming [70]. It is interesting to see that evolutionary computation appeared in the last two years among the best candidates to explore the space of deep learning structures [137], [122]. Other approaches might proceed by combining simple models into more powerful ones, e.g. using “Context Tree Weighting” [158] or switch distributions [88]. Another option is to formulate neural architecture design as a reinforcement learning problem [73]; the value of the building blocks (predefined routines) might be defined using e.g., Monte-Carlo Tree Search. A key difficulty is the computational cost of retraining neural nets from scratch upon modifying their architecture; an option might be to use neutral initializations to support warm-restart.
Analyzing and Learning Complex Systems
Participants: Cyril Furtlehner, Aurélien Decelle, François Landes, Michèle Sebag
PhD: Giancarlo Fissore
Collaboration: Enrico Camporeale (CWI); Jacopo Rocchi (LPTMS Paris Sud), the Simons team: Rahul Chako (post-doc), Andrea Liu (UPenn), David Reichman (Columbia), Giulio Biroli (ENS), Olivier Dauchot (ESPCI).
Methods and criteria from statistical physics have been widely used in ML. In early days, the capacity of Hopfield networks (associative memories defined by the attractors of an energy function) was investigated by using the replica formalism [68]. Restricted Boltzmann machines likewise define a generative model built upon an energy function trained from the data. Along the same lines, Variational Auto-Encoders can be interpreted as systems relating the free energy of the distribution, the information about the data and the entropy (the degree of ignorance about the micro-states of the system) [157]. A key promise of the statistical physics perspective and the Bayesian view of deep learning is to harness the tremendous growth of the model size (billions of weights in recent machine translation netwowrks), and make them sustainable through e.g. weight quantization [125], posterior drop-out [129], and probabilistic binary networks [126]. Such "informational cooling" of a trained deep network can reduce its size by several orders of magnitude while preserving its performance.
Statistical physics is among the key expertises of Tau , originally only represented by Cyril Furtlehner, later strenghtened by Aurélien Decelle's and François Landes' arrivals in 2014 and 2018. On-going studies are conducted along several directions.
Generative models are most often expressed in terms of a Gibbs distributions , where energy involves a sum of building blocks, modelling the interactions among variables. This formalization makes it natural to use mean-field methods of statistical physics and associated inference algorithms to both train and exploit such models. The difficulty is to find a good trade-off between the richness of the structure and the efficiency of mean-field approaches. One direction of research pursued in TAU, [94] in the context of traffic forecasting, is to account for the presence of cycles in the interaction graph, to adapt inference algorithms to such graphs with cycles, while constraining graphs to remain compatible with mean-field inference.
Another direction, explored in TAO in the recent years, is based on the definition and exploitation of self-consistency properties, enforcing principled divide-and-conquer resolutions. In the particular case of the message-passing Affinity Propagation algorithm for instance [159], self-consistency imposes the invariance of the solution when handled at different scales, thus enabling to characterize the critical value of the penalty and other hyper-parameters in closed form (in the case of simple data distributions) or empirically otherwise [95].
A more recent research direction examines the quantity of information in a (deep) neural net along the random matrix theory framework [80]. It is addressed in Giancarlo Fissore's PhD, and is detailed in Section 7.2.3.
A collaboration with L. Zbederova's group at Ecole Normale Supérieure is just starting, thanks to François Landes' arrival, based on the study of some information-theoretic indicators for some class of Neural Network. It is, too, described in more details in Section 7.2.3.
Finally, we note the recent surge in using ML to address fundamental physics problems: from turbulence to high-energy physics and soft matter as well (with glasses at its core). TAU's dual expertise in Deep Networks and in statistical physics places it in an ideal position to significantly contribute to this domain and shape the methods that will be used by the physics community in the future. François Landes' recent arrival in the team makes TAU a unique place for such interdisciplinary research, thanks to his collaborators from the Simons Collaboration Cracking the Glass Problem (gathering 13 statistical physics teams at the international level). This project is detailed in Section 7.2.3.