Since its creation in 2003, TAO activities had constantly but slowly evolved, as old problems were being solved, and new applications arose, bringing new fundamental issues to tackle. But recent abrupt progresses in Machine Learning (and in particular in Deep Learning) have greatly accelerated these changes also within the team. It so happened that this change of slope also coincided with some more practical changes in TAO ecosystem: following Inria 12-years rule, the team definitely ended in December 2016. The new team TAU (for TAckling the Underspecified) has been proposed, and formally created in July 2019. At the same time important staff changes took place, that also justify even sharper changes in the team focus. During the year 2018, the second year of this new era for the (remaining) members of the team, our research topics have now stabilized around a final version of the TAU project.
Following the dramatic changes in TAU staff during the years 2016-2017 (see the 2017 activity report of the team for the details), the research around continuous optimization has definitely faded out in TAU (while the research axis on hyperparameter tuning has focused on Machine Learning algorithms), the Energy application domain has slightly changed direction under Isabelle Guyon's supervision (Section ), after the completion of the work started by Olivier Teytaud, and a few new directions have emerged, around the robustness of ML systems (Section ). The other research topics have been continued, as described below.
Building upon the expertise in machine learning (ML) and optimization of the Tao team, the Tau project tackles some under-specified challenges behind the New Artificial Intelligence wave. The simultaneous advent of massive data and massive computational power, blurring the boundaries between data, structure, knowledge and common sense, seemingly makes it possible to fulfill all promises of the good old AI, now or soon.
This makes NewAI under-specified in three respects. A first dimension regards the relationships between AIs and human beings. The necessary conditions for AI sysytems to be accepted by mankind and/or contribute to the common good are yet to be formally defined; it is hard to believe that a general and computable definition of “ethical behavior" can be set once for all. Some of these necessary conditions (explainable and causal modeling; unbiased data and models; model certification) can nevertheless be cast as ambitious and realistic goals for public research.
A second dimension regards the relationships between AI, data and knowledge. Closed world AIs can manage and acquire sufficient data to reach human-level performances from scratch . In open worlds however, prior knowledge is used in various ways to overcome the lack of direct interactions with the world, e.g. through i) exploiting domain-dependent data invariances in intension or in extension (ranging from convolution to domain augmentation); ii) taking advantage of the low-rank structure (generative learning) or known properties (equivariant learning) of the observed data; iii) leveraging diverse domains and datasets, assumedly related to each other (domain adaptation; multi-task learning). A general and open question is how available prior knowledge can be best leveraged by an AI algorithm, all the more so as domains with small to medium-size data are considered.
A third dimension regards the intrinsic limitations of AI in terms of information theory. Long established theories, e.g. rooted in Occam's razor, currently hardly account for the practical leaps of deep learning, where the solution dimension outnumbers the input dimension. Beyond trials-and-errors, a long-term goal is to characterize the learning landscape w.r.t. order parameters to be defined, and a priori estimate the regions of problem instances where it is likely/possible/unlikely to learn accurate models.
The above under-specified AI issues define three core research pillars (Section 3), examining three interdependent aspects of AI:
I. The first pillar aims to answer the question of what 'good AI' means, and how to build it. More specifically, our goal is to advance the state of the art concerning robust learning (against adversarial attacks), causal modeling (aimed to support explanations and prescriptions), and unbiased models in the sense of prescribed neutrality constraints (including the assessment and repair of the data).
II. The second pillar tackles the "innate vs acquired" question: how to best combine available human knowledge, and agnostic machine learning. Tau will examine this question focusing on domains with spatial and temporal multi-scale structure, as pervasive in natural sciences (where domain knowledge is expressed using PDEs, or through powerful compact representations as in signal processing), taking advantage of the pluri-disciplinary expertise and scientific collaborations of the Tau members.
III. The third pillar aims to understand the learning landscape. In the short term, it tackles the so-called Auto-
The above research pillars will take inspiration and be validated with three application domains (Section 4):
1.Energy management encompasses a variety of scientific problems related to research pillars I. (fair learning, privacy-compliant modelling, safety-related guarantees) and II. (spatio-temporal multi-scale modelling, distributional learning). It is also a strategic application for the planet, where Tau benefits from the Tao expertise and the long established relationships with Artelys (ILab Metis) and RTE.
2.Computational Social Sciences offer questions and methodological lessons about how to address these questions in a common decency spirit, along research pillar I. On-going studies at Tau include the learning and randomized assessment of prescriptive models for Human Resources (hiring and vocational studies; quality of life at work and economic performance) and nutrition habits (in relation with social networks and health), where i) learned models must be unbiased although data are undoubtedly biased; ii) prior knowledge must be accounted for and the interpretation of the learned models is mandatory; iii) causal modelling is key as models are deployed for prescription and self-fulfilling prophecies must be avoided at all costs.
3.Optimal data-driven design considers several physical/simulated phenomenons, ranging from high-energy physics to space weather, from population biology to medical imaging, from signal processing to certification of autonomous vehicle controllers, with: i) medium-size data; ii) extensive prior knowledge, notably concerning the symmetries and properties of the sought models; iii) computationally expensive simulators. All three characteristics are relevant to pillars II and III.
As discussed by , the topic of ethical AI was non-existent until 2010, was laughed at in 2016, and became a hot topic in 2017 as the AI disruptivity with respect to the fabric of life (travel, education, entertainment, social networks, politics, to name a few) became unavoidable , together with its expected impacts on the nature and amount of jobs. As of now, it seems that the risk of a new AI Winter might arise from legal
The ambition of Tau is to mitigate the BNBB risk along several intricated dimensions, and build i) causal and explainable models; ii) fair data and models; iii) provably robust models.
Participants: Isabelle Guyon, Michèle Sebag, Philippe Caillou, Paola Tubaro
PhD: Diviyan Kalainathan
Collaboration: Olivier Goudet (Université d'Angers), David Lopez-Paz (Facebook)
The extraction of causal models, a long goal of AI , , , became a strategic issue as the usage of learned models gradually shifted from prediction to prescription in the last years. This evolution, following Auguste Comte's vision of science (Savoir pour prévoir, afin de pouvoir) indeed reflects the exuberant optimism about AI: Knowledge enables Prediction; Prediction enables Control.
However, although predictive models can be based on correlations, prescriptions can only be based on causal models
Among the research applications concerned with causal modeling, predictive modeling or collaborative filtering at Tau are all projects described in section (see also Section ), studying the relationships between: i) the educational background of persons and the job openings (FUI project JobAgile and DataIA project Vadore); ii) the quality of life at work and the economic performance indicators of the enterprises (ISN Lidex project Amiqap) ; iii) the nutritional items bought by households (at the level of granularity of the barcode) and their health status, as approximated from their body-mass-index (IRS UPSaclay Nutriperso); iv) the actual offer of restaurants and their scores on online rating systems. In these projects, a wealth of data is available (though hardly sufficient for applications ii), iii and iv))) and there is little doubt that these data reflect the imbalances and biases of the world as is, ranging from gender to racial to economical prejudices. Preventing the learned models from perpetuating such biases is essential to deliver an AI endowed with common decency.
In some cases, the bias is known; for instance, the cohorts in the Nutriperso study are more well-off than the average French population, and the Kantar database includes explicit weights to address this bias through importance sampling. In other cases, the bias is only guessed; for instance, the companies for which Secafi data are available hardly correspond to a uniform sample as these data have been gathered upon the request of the company trade union.
Participants: Guillaume Charpiat, Marc Schoenauer, Michèle Sebag
PhDs: Julien Girard, Marc Nabhan, Nizham Makhoud
Collaboration: Zakarian Chihani (CEA); Hiba Hage, and Yves Tourbier (Renault); Jérôme Kodjabachian (Thalès THERESIS)
Due to their outstanding performances, deep neural networks and more generally machine learning-based decision making systems, referred to as MLs in the following, have been raising hopes in the recent years to achieve breakthroughs in critical systems, ranging from autonomous vehicles to defense. The main pitfall for such applications lies in the lack of guarantees for MLs robustness.
Specifically, MLs are used when the mainstream software design process does not apply, that is, when no formal specification of the target software behavior is available and/or when the system is embedded in an open unpredictable world. The extensive body of knowledge developed to deliver guarantees about mainstream software
These downsides, currently preventing the dissemination of MLs in safety-critical systems (SCS), call for a considerable amount of research, in order to understand when and to which extent an MLs can be certified to provide the desired level of guarantees.
Julien Girard's PhD (CEA scholarship), started in Oct. 2018, co-supervised by Guillaume Charpiat and Zakaria Chihani (CEA), is devoted to the extension of abstract interpretation to deep neural nets, and the formal characterization of the transition kernel from input to output space achieved by a DNN (robustness by design, coupled with formally assessing the coverage of the training set). This approach is tightly related to the inspection and opening of black-box models, aimed to characterize the patterns in the input instances responsible for a decision – another step toward explainability.
On the other hand, experimental validation of MLs, akin statistical testing, also faces three limitations: i) real-world examples are notoriously insufficient to ensure a good coverage in general; ii) for this reason, simulated examples are extensively used; but their use raises the reality gap issue of the distance between real and simulated worlds; iii) independently, the real-world is naturally subject to domain shift (e.g. due to the technical improvement and/or aging of sensors). Our collaborations with Renault tackle such issues in the context of the autonomous vehicle (see Section ).
Participants: Alessandro Bucci, Guillaume Charpiat, Cécile Germain, Isabelle Guyon, Marc Schoenauer, Michèle Sebag
PhD: Théophile Sanchez, Loris Felardos, Wenzhuo Liu
In sciences and engineering, human knowledge is commonly expressed in closed form, through equations or mechanistic models characterizing how a natural or social phenomenon, or a physical device, will behave/evolve depending on its environment and external stimuli, under some assumptions and up to some approximations. The field of numerical engineering, and the simulators based on such mechanistic models, are at the core of most approaches to understand and analyze the world, from solid mechanics to computational fluid dynamics, from chemistry to molecular biology, from astronomy to population dynamics, from epidemiology and information propagation in social networks to economy and finance.
Most generally, numerical engineering supports the simulation, and when appropriate the optimization and control
At the other extreme, machine learning offers the opportunity to model phenomenons from scratch, using any available data gathered through experiments or simulations. Recent successes of machine learning in computer vision, natural language processing and games to name a few, have demonstrated the power of such agnostic approaches and their efficiency in terms of prediction , inverse problem solving , and sequential decision making , , despite their lack of any "semantic" understanding of the universe. Even before these successes, Anderson's claim was that the data deluge [might make] the scientific method obsolete , as if a reasonable option might be to throw away the existing equational or software bodies of knowledge, and let Machine Learning rediscover all models from scratch. Such a claim is hampered among others by the fact that not all domains offer a wealth of data, as any academic involved in an industrial collaboration around data has discovered.
Another approach will be considered in Tau, investigating how existing mechanistic models and related simulators can be partnered with ML algorithms: i) to achieve the same goals with the same methods with a gain of accuracy or time; ii) to achieve new goals; iii) to achieve the same goals with new methods.
Toward more robust numerical engineering: In domains where satisfying mechanistic models and simulators are available, ML can contribute to improve their accuracy or usability. A first direction is to refine or extend the models and simulators to better fit the empirical evidence. The goal is to finely account for the different biases and uncertainties attached to the available knowledge and data, distinguishing the different types of known unknowns. Such known unknowns include the model hyper-parameters (coefficients), the systematic errors due to e.g., experiment imperfections, and the statistical errors due to e.g., measurement errors. A second approach is based on learning a surrogate model for the phenomenon under study that incorporate domain knowledge from the mechanistic model (or its simulation). See Section for case studies.
A related direction, typically when considering black-box simulators, aims to learn a model of the error, or equivalently, a post-processor of the software. The discrepancy between simulated and empirical results, referred to as reality gap , can be tackled in terms of domain adaptation , . Specifically, the source domain here corresponds to the simulated phenomenon, offering a wealth of inexpensive data, and the target domain corresponds to the actual phenomenon, with rare and expensive data; the goal is to devise accurate target models using the source data and models.
Extending numerical engineering: ML, using both experimental and numerical data, can also be used to tackle new goals, that are beyond the current state-of-the-art of standard approaches. Inverse problems are such goals, identifying the parameters or the initial conditions of phenomenons for which the model is not differentiable, or amenable to the adjoint state method.
A slightly different kind of inverse problem is that of recovering the ground truth when only noisy data is available. This problem can be formulated as a search for the simplest model explaining the data. The question then becomes to formulate and efficiently exploit such a simplicity criterion.
Another goal can be to model the distribution of given quantiles for some system: The challenge is to exploit available data to train a generative model, aimed at sampling the target quantiles.
Examples tackled in TAU are detailed in Section . Note that the "Cracking the Glass Problem", described in Section is yet another instance of a similar problem.
Data-driven numerical engineering : Finally, ML can also be used to sidestep numerical engineering limitations in terms of scalability, or to build a simulator emulating the resolution of the (unknown) mechanistic model from data, or to revisit the formal background.
When the mechanistic model is known and sufficiently accurate, it can be used to train a deep network on an arbitrary set of (space,time) samples, resulting in a meshless numerical approximation of the model , supporting by construction differentiable programming .
When no mechanistic model is sufficiently efficient, the model must be identified from the data only. Genetic programming has been used to identify systems of ODEs , through the identification of invariant quantities from data, as well as for the direct identification of control commands of nonlinear complex systems, including some chaotic systems . Another recent approach uses two deep neural networks, one for the state of the system, the other for the equation itself . The critical issues for both approaches include the scalability, and the explainability of the resulting models. Such line of research will benefit from TAU unique mixed expertise in Genetic Programming and Deep Learning.
Finally, in the realm of signal processing (SP), the question is whether and how deep networks can be used to revisit mainstream feature extraction based on Fourier decomposition, wavelet and scattering transforms . E. Bartenlian's PhD (started Oct. 2018), co-supervised by M. Sebag and F. Pascal (Centrale-Supélec), focusing on musical audio-to-score translation , inspects the effects of supervised training, taking advantage from the fact that convolution masks can be initialized and analyzed in terms of frequency.
According to Ali Rahimi's test of times award speech at NIPS 17, the current ML algorithms have become a form of alchemy. Competitive testing and empirical breakthroughs gradually become mandatory for a contribution to be acknowledged; an increasing part of the community adopts trials and errors as main scientific methodology, and theory is lagging behind practice. This style of progress is typical of technological and engineering revolutions for some; others ask for consolidated and well-understood theoretical advances, saving the time wasted in trying to build upon hardly reproducible results.
Basically, while practical achievements have often passed the expectations, there exist caveats along three dimensions. Firstly, excellent performances do not imply that the model has captured what was to learn, as shown by the phenomenon of adversarial examples. Following Ian Goodfellow, some well-performing models might be compared to Clever Hans, the horse that was able to solve mathematical exercizes using non verbal cues from its teacher ; it is the purpose of Pillar I. to alleviate the Clever Hans trap (section ).
Secondly, some major advances, e.g. related to the celebrated adversarial learning , , establish proofs of concept more than a sound methodology, where the reproducibility is limited due to i) the computational power required for training (often beyond reach of academic labs); ii) the numerical instabilities (witnessed as random seeds happen to be found in the codes); iii) the insufficiently documented experimental settings. What works, why and when is still a matter of speculation, although better understanding the limitations of the current state of the art is acknowledged to be a priority. After Ali Rahimi again, simple experiments, simple theorems are the building blocks that help us understand more complicated systems. Along this line, propose toy examples to demonstrate and understand the defaults of convergence of gradient descent adversarial learning.
Thirdly, and most importantly, the reported achievements rely on carefully tuned learning architectures and hyper-parameters. The sensitivity of the results to the selection and calibration of algorithms has been identified since the end 80s as a key ML bottleneck, and the field of automatic algorithm selection and calibration, referred to as AutoML or Auto-
Tau aims to contribute to the ML evolution toward a more mature stage along three dimensions. In the short term, the research done in Auto-
Participants: Isabelle Guyon, Marc Schoenauer, Michèle Sebag
PhD: Guillaume Doquet, Zhengying Liu, Herilalaina Rakotoarison, Lisheng Sun
Collaboration: Olivier Bousquet, André Elisseeff (Google Zurich)
The so-called Auto-
Several approaches have been used to tackle Auto-
Participants: Guillaume Charpiat, Marc Schoenauer, Michèle Sebag
PhD: Corentin Tallec, Pierre Wolinski, Léonard Blier
Collaboration: Yann Ollivier (Facebook)
In the 60s, Kolmogorov and Solomonoff provided a well-grounded theory for building (probabilistic) models best explaining the available data , , that is, the shortest programs able to generate these data. Such programs can then be used to generate further data or to answer specific questions (interpreted as missing values in the data). Deep learning, from this viewpoint, efficiently explores a space of computation graphs, described from its hyperparameters (network structure) and parameters (weights). Network training amounts to optimizing these parameters, namely, navigating the space of computational graphs to find a network, as simple as possible, that explain the past observations well.
This vision is at the core of variational auto-encoders , directly optimizing a bound on the Kolmogorov complexity of the dataset. More generally variational methods provide quantitative criteria to identify superfluous elements (edges, units) in a neural network, that can potentially be used for structural optimization of the network (Leonard Blier's PhD, started Oct. 2018).
The same principles apply to unsupervised learning, aimed to find the maximum amount of structure hidden in the data, quantified using this information-theoretic criterion.
The known invariances in the data can be exploited to guide the model design (e.g. as translation invariance leads to convolutional structures, or LSTM is shown to enforce the invariance to time affine transformations of the data sequence ). Scattering transforms exploit similar principles . A general theory of how to detect unknown invariances in the data, however, is currently lacking.
The view of information theory and Kolmogorov complexity suggests that key program operations (composition, recursivity, use of predefined routines) should intervene when searching for a good computation graph. One possible framework for exploring the space of computation graphs with such operations is that of Genetic Programming. It is interesting to see that evolutionary computation appeared in the last two years among the best candidates to explore the space of deep learning structures , . Other approaches might proceed by combining simple models into more powerful ones, e.g. using “Context Tree Weighting” or switch distributions . Another option is to formulate neural architecture design as a reinforcement learning problem ; the value of the building blocks (predefined routines) might be defined using e.g., Monte-Carlo Tree Search. A key difficulty is the computational cost of retraining neural nets from scratch upon modifying their architecture; an option might be to use neutral initializations to support warm-restart.
Participants: Cyril Furtlehner, Aurélien Decelle, François Landes, Michèle Sebag
PhD: Giancarlo Fissore
Collaboration: Enrico Camporeale (CWI); Jacopo Rocchi (LPTMS Paris Sud), the Simons team: Rahul Chako (post-doc), Andrea Liu (UPenn), David Reichman (Columbia), Giulio Biroli (ENS), Olivier Dauchot (ESPCI), Hufei Han (Symantec).
Methods and criteria from statistical physics have been widely used in ML. In early days, the capacity of Hopfield networks (associative memories defined by the attractors of an energy function) was investigated by using the replica formalism . Restricted Boltzmann machines likewise define a generative model built upon an energy function trained from the data. Along the same lines, Variational Auto-Encoders can be interpreted as systems relating the free energy of the distribution, the information about the data and the entropy (the degree of ignorance about the micro-states of the system) . A key promise of the statistical physics perspective and the Bayesian view of deep learning is to harness the tremendous growth of the model size (billions of weights in recent machine translation netwowrks), and make them sustainable through e.g. posterior drop-out , weight quantization and probabilistic binary networks . Such "informational cooling" of a trained deep network can reduce its size by several orders of magnitude while preserving its performance.
Statistical physics is among the key expertises of Tau, originally only represented by Cyril Furtlehner, later strenghtened by Aurélien Decelle's and François Landes' arrivals in 2014 and 2018. On-going studies are conducted along several directions.
Generative models are most often expressed in terms of a Gibbs distributions
Another direction, explored in TAO/TAU in the recent years, is based on the definition and exploitation of self-consistency properties, enforcing principled divide-and-conquer resolutions. In the particular case of the message-passing Affinity Propagation algorithm for instance , self-consistency imposes the invariance of the solution when handled at different scales, thus enabling to characterize the critical value of the penalty and other hyper-parameters in closed form (in the case of simple data distributions) or empirically otherwise .
A more recent research direction examines the quantity of information in a (deep) neural net along the random matrix theory framework . It is addressed in Giancarlo Fissore's PhD, and is detailed in Section .
Finally, we note the recent surge in using ML to address fundamental physics problems: from turbulence to high-energy physics and soft matter as well (with amorphous materials at its core) . TAU's dual expertise in Deep Networks and in statistical physics places it in an ideal position to significantly contribute to this domain and shape the methods that will be used by the physics community in the future. François Landes' recent arrival in the team makes TAU a unique place for such interdisciplinary research, thanks to his collaborators from the Simons Collaboration Cracking the Glass Problem (gathering 13 statistical physics teams at the international level). This project is detailed in Section .
Independently, François Landes is actively collaborating with statistical physicists (Alberto Rosso, LPTMS, Univ. Paris-Saclay) and physcists at the frontier with geophysics (Eugenio Lippiello, Second Univ. of Naples) . A possible CNRS grant (80Prime) may finance a shared PhD, at the frontier between seismicity and ML (Alberto Rosso, Marc Schoenauer and François Landes).
Participants: Cécile Germain, Isabelle Guyon, Marc Schoenauer, Michèle Sebag
Challenges have been an important drive for Machine Learning research for many years, and TAO members have played important roles in the organization of many such challenges: Michèle Sebag was head of the challenge programme in the Pascal European Network of Excellence (2005-2013); Isabelle Guyon, as mentioned, was the PI of many challenges ranging from causation challenges , to AutoML . The Higgs challenge , most attended ever Kaggle challenge, was jointly organized by TAO (C. Germain), LAL-IN2P3 (D. Rousseau and B. Kegl) and I. Guyon (not yet at TAO), in collaboration with CERN and Imperial College.
TAU was also particularly implicated with the ChaLearn Looking At People (LAP) challenge series in Computer Vision, in collaboration with the University of Barcelona including the Job Candidate Screening Coopetition ; the Real Versus Fake Expressed Emotion Challenge (ICCV 2017) ; the Large-scale Continuous Gesture Recognition Challenge (ICCV 2017) ; the Large-scale Isolated Gesture Recognition Challenge (ICCV 2017) .
Other challenges have been organized in 2019, or are planned for the near future, detailed in Section . In particular, many of them now run on the Codalab platform, managed by Isabelle Guyon and maintained at LRI.
Participants: Philippe Caillou, Isabelle Guyon, Michèle Sebag, Paola Tubaro
Collaboration: Jean-Pierre Nadal (EHESS); Marco Cuturi, Bruno Crépon (ENSAE); Thierry Weil (Mines); Jean-Luc Bazet (RITM)
Computational Social Sciences (CSS) studies social and economic phenomena, ranging from technological innovation to politics, from media to social networks, from human resources to education, from inequalities to health. It combines perspectives from different scientific disciplines, building upon the tradition of computer simulation and modeling of complex social systems on the one hand, and data science on the other hand, fueled by the capacity to collect and analyze massive amounts of digital data.
The emerging field of CSS raises formidable challenges along three dimensions. Firstly, the definition of the research questions, the formulation of hypotheses and the validation of the results require a tight pluridisciplinary interaction and dialogue between researchers from different backgrounds. Secondly, the development of CSS is a touchstone for ethical AI. On the one hand, CSS gains ground in major, data-rich private companies; on the other hand, public researchers around the world are engaging in an effort to use it for the benefit of society as a whole . The key technical difficulties related to data and model biases, and to self-fulfilling prophecies have been discussed in section . Thirdly, CSS does not only regard scientists: it is essential that the civil society participate in the science of society .
Tao was involved in CSS for the last five years, and its activities have been strengthened thanks to P. Tubaro's and I. Guyon's expertises respectively in sociology and economics, and in causal modeling. Details are given in Section .
Participants: Isabelle Guyon, Marc Schoenauer, Michèle Sebag
PhD: Victor Berger, Benjamin Donnot, Balthazar Donon, Herilalaina Rakotoarison
Collaboration: Antoine Marot, Patrick Panciatici (RTE), Vincent Renault (Artelys)
Energy Management has been an application domain of choice for Tao since the end 2000s, with main partners SME Artelys (METIS Ilab Inria; ADEME project POST; on-going ADEME project NEXT), RTE (See.4C European challenge; two CIFRE PhDs), and, since Oct. 2019, IFPEN. The goals concern i) optimal planning over several spatio-temporal scales, from investments on continental Europe/North Africa grid at the decade scale (POST), to daily planning of local or regional power networks (NEXT); ii) monitoring and control of the French grid enforcing the prevention of power breaks (RTE); iii) improvement of house-made numerical methods using data-intense learning in all aspects of IFPEN activities (as described in Section ).
Optimal planning over long periods of time amounts to optimal sequential decision under high uncertainties, ranging from stochastic uncertainties (weather, market prices, demand prediction) handled based on massive data, to non-stochastic uncertainties (e.g., political decisions about the nuclear policy) handled through defining and selecting a tractable number of scenarios. Note that non-anticipativity constraints forbid the use of dynamic programming-related methods; this led to propose the Direct Value Search method at the end of the POST project.
The daily maintainance of power grids requires the building of approximate predictive models on the top of any given network topology. Deep Networks are natural candidates for such modelling, considering the size of the French grid (
Furthermore, predictive models of local grids are based on the estimated consumption of end-customers: Linky meters only provide coarse grain information due to privacy issues, and very few samples of fine-grained consumption are available (from volunteer customers). A first task is to transfer knowledge from small data to the whole domain of application. A second task is to directly predict the peaks of consumption based on the user cluster profiles and their representativity (see Section ).
Participants: Alessandro Bucci, Guillaume Charpiat, Cécile Germain, Isabelle Guyon, Flora Jay, Marc Schoenauer, Michèle Sebag
PhD and Post-doc: Victor Estrade, Loris Felardos, Adrian Pol, Théophile Sanchez, Wenzhuo Liu
Collaboration: D. Rousseau (LAL), M. Pierini (CERN)
As said (section ), in domains where both first principle-based models and equations, and empirical or simulated data are available, their combined usage can support more accurate modelling and prediction, and when appropriate, optimization, control and design. This section describes such applications, with the goal of improving the time-to-design chain through fast interactions between the simulation, optimization, control and design stages. The expected advances regard: i) the quality of the models or simulators (through data assimilation, e.g. coupling first principles and data, or repairing/extending closed-form models); ii) the exploitation of data derived from different distributions and/or related phenomenons; and, most interestingly, iii) the task of optimal design and the assessment of the resulting designs.
The proposed approaches are based on generative and adversarial modelling , , extending both the generator and the discriminator modules to take advantage of the domain knowledge.
A first challenge regards the design of the model space, and the architecture used to enforce the known domain properties (symmetries, invariance operators, temporal structures). When appropriate, data from different distributions (e.g. simulated vs real-world data) will be reconciled, for instance taking inspiration from real-valued non-volume preserving transformations in order to preserve the natural interpretation.
Another challenge regards the validation of the models and solutions of the optimal design problems. The more flexible the models, the more intensive the validation must be, as reminded by Leon Bottou. Along this way, generative models will be used to support the design of "what if" scenarios, to enhance anomaly detection and monitoring via refined likelihood criteria.
In the application case of dynamical systems such as fluid mechanics, the goal of incorporating machine learning into classical simulators is to speed up the simulations. Many possible tracks are possible for this; for instance one can search to provide better initialization heuristics to solvers (which make sure that physical constraints are satisfied, and which are responsible of most of the computational complexity of simulations) at each time step; one can also aim at predicting directly the state at
Best Paper Award in Machine Learning at ECML-PKDD 2019 in Wurzburg to Guillaume Doquet and Michèle Sebag for their paper Agnostic feature selection
Nacim Belkhir, Winner ACM-GECCO 2019 BBComp single-objective, BBComp two-objective and three-objective tracks. The winning program is a slighly modified version of the one Nacim wrote during his PhD in TAU in 2017 , co-supervised by Marc Schoenauer, Johann Dréo and Pierre Savéant (Thalès TRT).
Marc Schoenauer, expert seconding Guillaume Klossa, special advisor to European Vice-President Ansip, for the report Toward European Media Sovereignty giving strategic advice on the opportunities and challenges related to the use artificial intelligence, with a focus on the media sector.
Input Output Data Science
Keywords: Open data - Semantic Web - FAIR (Findable, Accessible, Interoperable, and Reusable)
Functional Description: io.datascience (Input Output Data Science) is the instance of the Linked Wiki platform developed specifically in Paris-Saclay University as part of its Center for Data Science.
The goal of io.datascience: to facilitate the sharing and use of scientific data. The technological concept of io.datascience: the exploitation of semantic web advances, and in particular wiki technologies.
(Findable, Accessible, Interoperable, and Reusable) (Wilkinson, M., and The FAIR Guiding Principles for Scientific Data Management and Stewardship, Nature Scientific Data 2016)
io.datascience is both a data sharing platform and a framework for further development. It realizes a practical implementation of FAIR (Findable, Accessible, Interoperable, and Reusable - Wilkinson, M., Nature Scientific Data 2016) principles through a user-centric approach.
Partners: Border Cloud - Paris Saclay Center for Data Science - Université Paris-Sud
Contact: Cécile Germain
Publications: Data acquisition for analytical platforms: Automating scientific workflows and building an open database platform for chemical anlysis metadata - A platform for scientific data sharing - TFT, Tests For Triplestores : Certifying the interoperability of RDF database systems using a continuous delivery workflow - Une autocomplétion générique de SPARQL dans un contexte multi-services - Certifying the interoperability of RDF database systems - Transforming Wikipedia into an Ontology-based Information Retrieval Search Engine for Local Experts using a Third-Party Taxonomy - The Grid Observatory 3.0 - Towards reproducible research and open collaborations using semantic technologies
Keywords: Benchmarking - Competition
Functional Description: Challenges in machine learning and data science are competitions running over several weeks or months to resolve problems using provided datasets or simulated environments. Challenges can be thought of as crowdsourcing, benchmarking, and communication tools. They have been used for decades to test and compare competing solutions in machine learning in a fair and controlled way, to eliminate “inventor-evaluator" bias, and to stimulate the scientific community while promoting reproducible science. See our slide presentation.
As of June 2019 Codalab exceeded 40,000 users, 1000 competitions (300 public), and had over 300 submissions per day. Some of the areas in which Codalab is used include Computer vision and medical image analysis, natural language processing, time series prediction, causality, and automatic machine learning. Codalab was selected by the Région Ile de France to organize its challenges in the next three years.
TAU is going to continue expanding Codalab to accommodate new needs. One of our current focus is to support use of challenges for teaching (i.e. include a grading system as part of Codalab) and support for hooking up data simulation engines in the backend of Codalab to enable Reinforcement Learning challenges and simulate interactions of machines with an environment. For the fith year, we are using Codalab for student projects. M2 AIC students create mini data science challenges in teams of 6 students. L2 math and informatics students then solve them as part of their mini projects. We are collaborating with RPI (New York, USA) and Université de Grenoble to use this platform as part of a curriculum of medical students. We created a special application called ChaGrade to grade homework using challenges. Our PhD. students are involved in co-organizing challenges to expose the research community at large with the topic of their PhD. This helps them formalizing a task with rigor and allows them to disseminate their research.
Contact: Isabelle Guyon
Keyword: Information visualization
Functional Description: The goal of Cartolabe is to build a visual map representing the scientific activity of an institution/university/domain from published articles and reports. Using the HAL Database, Cartolabe provides the user with a map of the thematics, authors and articles . ML techniques are used for dimensionality reduction, cluster and topics identification, visualisation techniques are used for a scalable 2D representation of the results.
News Of The Year: This year, Cartolabe was applied to the Grand Debat dataset (3M individual propositions from french Citizen, see https://cartolabe.fr/map/debat). The results were used to test both the scaling capabilities of Cartolabe and its flexibility to non-scientific and non-english corpuses. We also Added sub-map capabilities to display the result of a year/lab/word filtering as an online generated heatmap with only the filtered points to facilitate the exploration.
Participants: Philippe Caillou, Jean-Daniel Fekete, Jonas Renault and Anne-Catherine Letournel
Partners: LRI - Laboratoire de Recherche en Informatique - CNRS
Contact: Philippe Caillou
Participants: Philippe Caillou, Isabelle Guyon, Michèle Sebag;
PhDs: Diviyan Kalainathan
Collaboration: David Lopez-Paz (Facebook).
The search for causal models relies on quite a few hardly testable assumptions, e.g. causal sufficiency ; it is a data hungry task as it has the identification of independent and conditionally independent pairs of variables at its core. A new approach investigated through the Cause-Effects Pairs (CEP) Challenge formulates causality search as a supervised learning problem, considering the joint distributions of pairs of variables (e.g. (Age, Salary)) labelled with the proper causation relationship between both variables (e.g. Age "causes" Salary) and learning algorithms apt to learn from distributions have been proposed . An edited book has been published , that somewhat summarizes the whole history of Cause-EFfect Paris research. Several chapters of this book have co-authors in TAU: Evaluation methods of cause-effect pairs , Learning Bivariate Functional Causal Models , Discriminant Learning Machines , and Results of the Cause-Effect Pair Challenge .
In D. Kalainathan's PhD and O. Goudet's postdoc, the search for causal models has been tackled in the framework of generative networks , trained to minimize the Maximum Mean Discrepancy loss; the resulting Causal Generative Neural Network improves on the state of the art on the CEP Challenge. CGNN favorably compares with the state of the art w.r.t. usual performance indicators (AUPR, SID) on main causal benchmarks, though with a large computational cost.
An attempt to scale up causal discovery, we proposed the Structural Agnostic Model approach . Working directly on the observational data, this global approach implements a variant of the popular adversarial game between a discriminator, attempting to distinguish actual samples from fake ones, obtained by generating each variable, given real values from all others. A sparsity
An innovative usage of causal models is for educational training in sensitive domains, such as medicine, along the following line. Given a causal generative model, artificial data can be generated using a marginal distribution of causes; such data will enable students to test their diagnosis inference (with no misleading spurious correlations in principle), while forbidding to reverse-engineer the artificial data and guess the original data. Some motivating applications for causal modeling are described in section .
Participants: Isabelle Guyon, François Landes, Marc Schoenauer, Michèle Sebag
PhD: Marc Nabhan
Causal modeling is one particular method to tackle explainability, and TAU has been involved in other initiatives toward explainable AI systems. Following the LAP (Looking At People) challenges, Isabelle Guyon and co-organizers have edited a book that presents a snapshot of explainable and interpretable models in the context of computer vision and machine learning. Along the same line, they propose an introduction and a complete survey of the state-of-the-art of the explainability and interpretability mechanisms in the context of first impressions analysis .
The team is also involved in the proposal for the IPL HyAIAI (Hybrid Approaches for Interpretable AI), coordinated by the LACODAM team (Rennes) dedicated to the design of hybrid approaches that combine state of the art numeric models (e.g., deep neural networks) with explainable symbolic models, in order to be able to integrate high level (domain) constraints in ML models, to give model designers information on ill-performing parts of the model, to provide understandable explanations on its results. Kickoff took place in September 2019, and we are still looking for good post-doc candidates.
Note also that the on-going work on the identification of the border of the failure zone in the parameter space of the autonomous vehicle simulator (Section ) also pertains to explainability.
Finally, a completely original approach to DNN explainability might arise from the study of structural glasses (), with a parallel to Graph Neural Networks (GNNs), that could become an excellent non-trivial example for developing explainability protocols.
Participants: Guillaume Charpiat, Marc Schoenauer, Michèle Sebag
PhDs: Julien Girard, Marc Nabhan, Nizham Makhoud
Collaboration: Zakarian Chihani (CEA); Hiba Hage and Yves Tourbier (Renault); Johanne Cohen (LRI-GALAC) and Christophe Labreuche (Thalès)
As said (Section , Tau is considering two directions of research related to the certification of MLs. The first direction, related to formal approaches, is the topic of Julien Girard's PhD (see also Section ). On the opposite, the second axis aims to increase the robustness of systems that can only be experimentally validated. Two paths are investigated in the team: assessing the coverage of the datasets (more particularly here, used to train an autonomous vehicle controller), topic of Marc Nabhan's CIFRE with Renault; and detecting flaws in the system by reinforcement learning, as done by Nizam Makdoud's CIFRE PhD with Thalès THERESIS.
Formal validation of Neural Networks
The topic of provable deep neural network robustness has raised considerable interest in recent years. Most research in the literature has focused on adversarial robustness, which studies the robustness of perceptive models in the neighbourhood of particular samples. However, other works have proved global properties of smaller neural networks. Yet, formally verifying perception remains uncharted. This is due notably to the lack of relevant properties to verify, as the distribution of possible inputs cannot be formally specified. With Julien Girard-Satabin's PhD thesis, we propose to take advantage of the simulators often used either to train machine learning models or to check them with statistical tests, a growing trend in industry. Our formulation allows us to formally express and verify safety properties on perception units, covering all cases that could ever be generated by the simulator, to the difference of statistical tests which cover only seen examples. Along with this theoretical formulation, we provide a tool to translate deep learning models into standard logical formulae. As a proof of concept, we train a toy example mimicking an autonomous car perceptive unit, and we formally verify that it will never fail to capture the relevant information in the provided inputs.
Experimental validation of Autonomous Vehicle Command
Statistical guarantees (e.g., less than
TAU is collaborating with Renault on step ii) within Marc Nabhan's CIFRE PhD (defense expected in Sept. 2020). The current target scenario is the insertion of a car on a motorway, the "drosophila" of autonomous car scenarios and the goal is the identification of the conditions of failures of the autonomous car controller. Only simulations are considered here, with one scenario being defined as a parameter setting of the in-house simulator SCANeR. The goal is the detection of as many failures as possible, running as few simulations as possible, and the identification of the borders of the failure zone using an as simple as possible description, thus allowing engineers to understand the reasons for the flaws. A first paper was published proposing several approaches for the identification of failures. On-going work is concerned with a precise yet simple definition of the border of the failure zone.
Reinforcement Learning from Advice In the context of his CIFRE PhD with Thalès, Nizam Makdoud tests (in simulation) physical security systems using reinforcement learning to learn the best sequence of action that will break through the system. This lead him to propose an original approach called LEarning from Advice (LEA) that uses knowledge from several policies learned on different tasks. Whereas Learning by imitation uses the actions of the known policy, the proposed method uses the different Q-functions of the known policies. The main advantage of this strategy is its robustness to poor advice, as the policy then reverts to standard DDPG . The results (submitted) demonstrate that LEA is able to learn faster than DDPG if given good-enough policies, and only slightly slower when given lousy advices.
Learning Multi-Criteria Decision Aids (Hierarchical Choquet models) In collaboration with Johanne Cohen (LRI-GALAC) and Christophe Labreuche (Thalès), the representation and data-driven elicitation of hierarchical Choquet models has been tackled. A specific neural architecture, enforcing by design the model constraints (monotonicity, additivity), and supporting the end-to-end training of the Multi-Criteria Decision aid, has been proposed in Roman Bresson's PhD. Under mild assumptions, an identifiability result (existence and unicity of the sought model in the neural space) is obtained. The approach is empirically validated and successfully compared to the state of the art.
Participants: Guillaume Charpiat, Isabelle Guyon, Marc Schoenauer, Michèle Sebag
PhDs: Léonard Blier, Guillaume Doquet, Zhengying Liu, Herilalaina Rakotoarison, Lisheng Sun, Pierre Wolinski
Collaboration: Vincent Renault (SME Artelys); Yann Ollivier (Facebook)
Auto-
As mentioned in Section , a popular approach for algorithm selection is collaborative filtering. In Lisheng Sun's PhD , active learning was used on top of the CofiRank algorithm for matrix factorization , improving the results and the time to solution of the recommendation algorithm. Furthermore, most real-world domains evolve with time, and an important issue in real-world applications is that of life-long learning, as static models can rapidly become obsolete. Another contribution in Lisheng's PhD is an extension of AutoSklearn that detects concept drifts and corrects the current model accordingly.
An original approach to Auto-
Auto-
A key building block in Auto-
Several works have focused on the adjustment of specific hyper-parameters for neural nets. Pierre Wolinski's PhD (to be defended in January 2020, publication submitted) studies three such hyper-parameters: i) network width (number of neurons in each layer); ii) regularizer importance in the objective function to minimize (factor balancing data term and regularizer); and iii) learning rate. Regarding the network width, it is adjusted during training thanks to a criterion quantifying each neuron's importance, naturally leading to a sparsification effect (as for L1 norm minimization). This study is actually extendable to not only layers' widths but also layers' connectivity (e.g., in modern networks where each layer may be connected to any other layer with 'skip' connections). Regarding the regularizer weight, it is formulated as a probabilistic prior from a Bayesian perspective, which leads to a particular value that the regularizer weight should have in order the network to satisfy some property. Regarding the learning rate, Pierre Wolinski and Leonard Blier proposed to attach fixed learning rates to each neuron (picked randomly) and calibrate this learning rate distribution in such a way that neurons are sequentially active, learning in an optimally agile manner during a first learning phase, and being stable in later phases. This remove the need to tune the learning rate.
A last direction of investigation concerns the design of challenges, that contribute to the collective advance of research in the Auto-
Participants: Guillaume Charpiat, Marc Schoenauer, Michèle Sebag
PhDs: Léonard Blier, Corentin Tallec
Collaboration: Yann Ollivier (Facebook AI Research, Paris), the Altschuler and Wu lab. (UCSF, USA), Y. Tarabalka (Inria Titane)
Although a comprehensive mathematical theory of deep learning is yet to come, theoretical insights from information theory or from dynamical systems can deliver principled improvements to deep learning and/or explain the empirical successes of some architectures compared to others.
In his PhD , Corentin Tallec presents several contributions along these lines:
In , it is shown that the LSTM structure can be understood from axiomatic principles, enforcing the robustness of the learned model to temporal deformation (warpings) in the data. The complex LSTM architecture, introduced in the 90's, has become the currently dominant architecture for modeling temporal sequences (such as text) in deep learning. It is shown that this complex LSTM architecture necessarily arises if one wants the model to be able to handle time warpings in the data (such as arbitrary accelerations or decelerations in the signal), and their complex equations can be derived axiomatically.
In (oral presentation at ICML 2018) the issue of mode dropping in adversarial generative models is tackled using information theory. The adversary (discriminator) task is set to predict the proportion of true and fake images in a set of images, via an information theory criterion, thus working at the level of the overall distribution of images. The discriminator is thus mande more able to detect statistical imbalances between the modes created by the generator, thereby reducing the mode dopping phenomenon. The proposed architecture, inspired from equivariant approaches, is provably able to detect all permutation-invariant statistics in a set of images.
In , the problem of recurrent network training is tackled via the theory of dynamical systems by proposing a simple fully online solution avoiding the "time rewind" step, based on real-time, noisy but unbiased approximations of model gradients, which can be implemented easily in a black-box fashion on top of any recurrent model, and which is well-justified mathematically. The price to pay is an increase of variance.
In , we identify sensitivity to time discretization of Deep RL in near continuous-time environments as a critical factor. Empirically, we find that Q-learning-based approaches collapse with small time steps. Formally, we prove that Q-learning does not exist in continuous time. We detail a principled way to build an off-policy RL algorithm that yields similar performances over a wide range of time discretizations, and confirm this robustness empirically.
Several other directions have been investigated: In , we introduce a multi-domain adversarial learning algorithm in the semi-supervised setting. We extend the single source H-divergence theory for domain adaptation to the case of multiple domains, and obtain bounds on the average- and worst-domain risk in multi-domain learning. This leads to a new loss to accommodate semi-supervised multi-domain learning and domain adaptation. We obtain state-of-the-art results on two standard image benchmarks, and propose as a new benchmark a novel bioimage dataset, CELL, in the domain of automated microscopy data, where cultured cells are imaged after being exposed to known and unknown chemical perturbations, and in which each dataset displays significant experimental bias.
Another direction regards the topology induced by a trained neural net, and how similar are two samples in the NN perspective. The definition proposed in relies on varying the NN parameters and examining whether the impacts of this variation on both samples are aligned. The mathematical properties of this similarity measure are investigated and the similarity is shown to define a kernel on the input space. This kernel can be used to tractably estimate the sample density, and it leads to new directions for the statistical learning analysis of NN, e.g. in terms of additional loss (requiring that similar examples have a similar latent representation in the above sense) or in terms of resistance to noise. Specifically, a multimodal image registration task is presented where almost perfect accuracy is reached, despite a high label noise (see Section ). Such an impressive self-denoising phenomenon can be explained and quantified as a noise averaging effect over the labels of similar examples.
Participants: Cyril Furtlehner, Aurélien Decelle, François Landes
PhDs: Giancarlo Fissore
Collaboration: Jacopo Rocchi (LPTMS Paris Sud); the Simons team: Rahul Chako (post-doc), Andrea Liu (UPenn), David Reichman (Columbia), Giulio Biroli (ENS), Olivier Dauchot (ESPCI).; Clément Vignax (EPFL); Yufei Han (Symantec).
The information content of a trained restricted Boltzmann machine (RBM) can be analyzed by comparing the singular values/vectors of its weight matrix, referred to as modes, to that of a random RBM (typically following a Marchenko-Pastur distribution) . The analysis of a single learning trajectory is replaced by analyzing the distribution of a well chosen ensemble of models. In G. Fissore's PhD, the learning trajectory of an RBM is shown to start with a linear phase recovering the dominant modes of the data, followed by a non-linear regime where the interaction among the modes is characterized . Although simplifying assumptions are required for a mean-field analysis in closed form of the above distribution, it nevertheless delivers some simple heuristics to speed up the learning convergence and to simplify the models.
This analysis will be extended along two directions: handling missing data ; and considering exactly solvable RBM (non-linear RBM for which the contrastive divergence can be computed in closed forms, e.g. using a spherical model) . W.r.t. missing data, state of the art results have been obtained on semi-supervised tasks in the context of Internet-of-Things security, considering a high rate of missing inputs and labels). On the theoretical side exact generic RBM learning trajectories have been characterized, showing intriguing connections based on Bose-Einstein condensation mechanism associated to information storing. Our collaboration with J. Rocchi(LPTMS, Univ. Paris Sud) aims to characterize the landscape of RBMs learned from different initial conditions, and to relate this landscape to the number of parameters (hidden nodes) of the system.
An emerging research topic, concerns the interpretation of deep learning by means of Gaussian processes and associated neural tangent limit kernel in the thermodynamical limit obtained by letting layer's width go to infinity . Various things are planned to be investigated on the basis of this theoretical tool in particular how this translate to RBM or DBM setting, and whether a double dip behaviour is to be expected as well for generative models.
As mentioned earlier, the use of ML to address fundamental physics problems is quickly growing. This leads to some methodological mistakes from newcomers, that have been investigated by Rémi Perrier (2 month internship). One example is the domain of glasses (how the structure of glasses is related to their dynamics), which is one of the major problems in modern theoretical physics. The idea is to let ML models automatically find the hidden structures (features) that control the flowing or non-flowing state of matter, discriminating liquid from solid states. These models can then help identifying "computational order parameters", that would advance the understanding of physical phenomena , on the one hand, and support the development of more complex models, on the other hand. More generally, attacking the problem of amorphous condensed matter by novel Graph Neural Networks (GNN) architectures is a very promising lead, regardless of the precise quantity one may want to predict. Currently GNNs are engineered to deal with molecular systems and/or crystals, but not to deal with amorphous matter. This second axis is currently being attacked in collaboration with Clément Vignac (PhD Student at EPFL), using GNNs. Furthermore, this problem is new to the ML community and it provides an original non-trivial example for engineering, testing and benchmarking explainability protocols.
Computational Social Sciences (CSS) is making significant progress in the study of social and economic phenomena thank to the combination of social science theories and new insight from data science. While the simultaneous advent of massive data and unprecedented computational power has opened exciting new avenues, it has also raised new questions and challenges.
Several studies are being conducted in TAU, about labor (labor markets, platform "micro-work", quality of life and economic performance), about nutrition (health, food, and socio-demographic issues), around Cartolabe, a platform for scientific information system and visual querying and around GAMA, a multi-agent based simulation platform.
Participants: Philippe Caillou, Isabelle Guyon, Michèle Sebag, Paola Tubaro
PhDs: Diviyan Kalainathan, Guillaume Bied, Armand Lacombe
Post-Docs: Saumya Jetley
Engineers: Raphael Jaiswal, Victor Alfonso Naya
Collaboration: Jean-Pierre Nadal (EHESS); Marco Cuturi, Bruno Crépon (ENSAE); Antonio Casilli, Ulrich Laitenberger (Telecom Paris); Odile Chagny (IRES); Alessandro Delfanti (University of Toronto)
A first area of activity of TAU in Computational Social Sciences is the study of labor, from the functioning of the job market, to the rise of new, atypical forms of work in the networked society of internet platforms, and the quality of life at work.
Job markets Two projects deal with the domain of job markets and machine learning. The DataIA project Vadore, in collaboration with ENSAE and Pôle Emploi, has two goals. First, to improve the recommendation of jobs for applicants (and the recommendation of applicants to job offers). The main originalities in this project are: i) to use both machine learning and optimal transport to improve the recommendation by learning a matching function for past hiring, and then to apply optimal transport-like bias to tackle market congestion (e.g. to avoid assigning many applicants to a same job offer); ii) to use randomized test on micro-markets (AB testing) in collaboration with Pôle Emploi to test the global impact of the algorithms.
The JobAgile project, BPI-PIA contract, coll. EHESS, Dataiku and Qapa, deals with low salary interim job recommendations. A main difference with the Vadore project relies on the high reactivity of the Qapa and Dataiku startups: i) to actually implement AB-testing; ii) to explore related functionalities, typically the recommendation of formations; iii) to propose a visual querying of the job market, using the Cartolabe framework (below).
The platform economy and digital labor
Another topic concerns the digital economy and the transformations of labor that accompany it. One part of the platform economy carries promises of social, not only techno-economic, innovation. If enthusiasms for a new "sharing economy" or "collaborative economy" have progressively faded away, values of decentralization, autonomy, and flatter coordination are still commonly associated to platforms. A conference paper by P. Tubaro studies how events constitute places where actors of the platform economy negotiate values and collectively drive forms of social change .
The platform economy and its effects on labor are also linked to the current developments of AI . In collaboration with A.A. Casilli (Telecom ParisTech), P. Tubaro has received funding from the Union Force Ouvrière, from France Stratégie (a Prime Minister's service), and from MSH Paris-Saclay, to map "micro-work" in France (DiPLab project). The term micro-work refers to small, data-related tasks that are performed online against low remunerations, such as tagging objects in images, transcribing bits of text, and recording utterances aloud. Specialized platforms such as Amazon Mechanical Turk, Clickworker and Microworkers recruit online providers to execute these tasks for their clients, mostly for data-intensive production processes. In addition to poor working conditions and low pay, micro-work raises issues in terms of privacy and data protection, insofar as outside providers are entrusted with data that may include personal information .
The results of the DiPLab study were published in a report that attracted significant media attention , and presented as part of a large event on micro-work at the headquarters of France Stratégie in June 2019. Two articles relying on DiPlab results, one of which was first published as a working paper , , are now under review.
A joint franco-Canadian grant obtained by P. Tubaro and A. Delfanti (University of Toronto) enabled the creation of an "International Network on Digital Labor", aiming to bring together scholars interested in various forms of digital platform labor. Inauguration of the network involved the organization of two workshops, one in Paris (June 2019) and the other in Toronto (October 2019).
Further research on how digital platforms transform labor practices and affect the very definition of professions is being undertaken by P. Tubaro as part of a two-year (2018-2020) grant from DARES (French Ministry of Labor), in collaboration with O. Chagny of IRES, a union-funded think-tank), and A.A. Casilli (Telecom ParisTech).
A newly-obtained grant from ANR (with A.A. Casilli and U. Laitenberger, Telecom ParisTech) will enable P. Tubaro to further explore the global production networks that link AI developers and producers to data-related work across national boundaries, following outsourcing chains that extend from France to French-speaking African countries, and from Spain and the USA to parts of Latin America. This new project, entitled "The HUman Supply cHain behind smart technologies" (HUSH), will start in January 2020.
Participants: Philippe Caillou, Michèle Sebag, Paola Tubaro
Post-doc: Ksenia Gasnikova
Collaboration: Louis-Georges Soler, Olivier Allais (INRA)
Another area of activity concerns the relationships between eating practices, socio-demographic features and health.
The Nutriperso project (IRS Univ. Paris-Saclay, coll. INRA, CEA, CNRS, INSERM, Telecom ParisTech and Univ. Paris-Sud) aims to: i) determine the impact of food items on health (e.g., related to T2 diabetes); ii) identify alternative food items, admissible in terms of taste and budget, and better in terms of health; iii) emit personalized food recommendations> One project motivation is the fact that general recommendations (e.g., Eat 5 fruit and vegetable per day) are hardly effective on populations at risk. Based on the Kantar database, reporting the food habits of 20,000 households over 20 years, our challenge is to analyze the food purchases at an unprecedented fine-grained scale (at the barcode level), and to investigate the relationship between diets, socio-demographic features, and body mass index (BMI). The challenge also regards the direction of causality; while some diets are strongly correlated to high BMI, the question is to determine whether, e.g., sugar-free sodas are a cause of obesity, or a consequence thereof, or both a cause and a consequence. A main difficulty is the lack of control populations to assess a diet impact. Such a control population could be approximated in the case of the organic diet, showing a statistically significant impact of this diet on the BMI distribution. The question of finding confounders (e.g. based on wealth or education) or "backdoor" variables in under study.
Participants: Philippe Caillou, Michèle Sebag
Engineers: Anne-Catherine Letournel, Jonas Renault
Collaboration: Jean-Daniel Fekete (AVIZ, Inria Saclay)
A third area of activity concerns the 2D visualisation and querying of a corpus of documents. Its initial motivation was related to scientific organisms, institutes or Universities, using their scientific production (set of articles, authors, title, abstract) as corpus. The Cartolabe project started as an Inria ADT (coll. Tao and AVIZ, 2015-2017). It received a grant from CNRS (coll. Tau, AVIZ and HCC-LRI, 2018-2019). Further extensions, as an open-source platform, are under submission at the time of writing.
The originality of the approach is to rely on the content of the documents (as opposed to, e.g. the graph of co-authoring and citations). This specificity allowed to extend Cartolabe to various corpora, such as Wikipedia, Bibliotheque Nationale de France, or the Software Heritage. Cartolabe was also applied in 2019 to the Grand Debat dataset: to support the interactive exploration of the 3 million propositions; and to check the consistency of the official results of the Grand Debat with the data.
Among its intended functionalities are: the visual assessment of a domain and its structuration (who is expert in a scientific domain, how related are the domains); the coverage of an institute expertise relatively to the general expertise; the evolution of domains along time (identification of rising topics). A round of interviews with beta-user scientists is under way since end 2019. Cartolabe usage raises questions at the crossroad of human-centered computing, data visualization and machine learning: i) how to deal with stressed items (the 2D projection of the item similarities poorly reflects their similarities in the high dimensional document space; ii) how to customize the similarity and exploit the users' feedback about relevant neighborhoods.
Participants: Philippe Caillou
Collaboration: Patrick Taillandier (INRA), Alexis Drogoul and Nicolas Marilleau (IRD), Arnaud Grignard (MediaLab, MIT), Benoit Gaudou (Université Toulouse 1)
Since 2008, P. Caillou contributes to the development of the GAMA platform, a multi-agent based simulation framework. Its evolution is driven by the research projects using it, which makes it very well suited for social sciences studies and simulations.
The focus of the development team in 2019 was on the stability of the platform and on the documentation to provide a stable and well documented framework to the users.
Participants: Isabelle Guyon, Marc Schoenauer
PhDs: Benjamin Donnot, Balthazar Donon
Collaboration: Antoine Marot, Patrick Panciatici (RTE), Olivier Teytaud (Facebook)
Benjamin Donnot's CIFRE PhD with RTE dealt with Power Grid safety: The goal is to assess in real time the so-called "(n-1) safety" (see Section ) of possible recovery actions modifying the topology of the grid after some problem occurred somewhere on the grid. However, the HADES simulator, that allows to compute the power flows in the whole network, is far too slow to simulate in real time all n-1 possible failures of a tentative topology. A simplified simulator is also available, but its accuracy is too poor to give good results. Deep surrogate models can be trained off-line for a given topology, based on the results of the slow simulator, with high-enough accuracy, but training as many models as possible failures (i.e., n-1) obviously doesn't scale up: the topology of the grid must be an input of the learned model, allowing to instantly compute the power flows at least for grid configurations close to the usual running state of the grid. A standard approach is the one-hot encoding of the topology, where n additional boolean inputs are added to the neural network, encoding the presence or absence of each line. Nevertheless, this approach poorly generalize to topologies outside the distribution of the ones used for training.
An original "guided dropout" approach was first proposed , in which the topology directly acts on the connections of the deep network: a missing line suppresses some connections. Whereas the standard dropout method disconnect random connections for every batch, in order to improve the generalization capacity of the network, the "guided dropout" method removes some connections based on the actual topology of the network. This approach is experimentally validated against the one-hot encoding on small subsets of the French grid (up to 308 lines). Interestingly, and rather surprisingly, even though only examples with a single disconnected line are used in the training set, the learned model is able of some additive generalization, and predictions are also accurate enough in the case 2 lines are disconnected. The guided dropout approach was later robustified by learning to rapidly rank higher order contingencies including all pairs of disconnected lines, in order to prioritize the cases where the slow simulator is run: Another neural network is trained to rank all (n-1) and (n-2) contingencies in decreasing order of presumed severity.
The guided dropout approach has been further extended and generalized with the LEAP (Latent Encoding of Atypical Perturbation) architecture , , by crossing-out connections between the encoder and the decoder parts of the ResNet architecture. LEAP then performs transfer learning over spaces of distributions of topology perturbations, allowing to better handle more complex actions on the tolology, going beyond (n-1) and (n-2) perturbations by also including node-split, a current action in the real world. The LEAP approach was theoretically studied in the case of additive perturbations, and experimentally validated on an actual sub-grid of the French grid with 46 consumption nodes, 122 production nodes, 387lines and 192 substations.
LEAP is also the firs part of Balthazar Donon's on-going PhD, that currently develops using a completely different approach to approximate the power flows on a grid, i.e. that of Graph Neural Networks (GNNs). From a Power Grid perspective, GNNs can be viewed as including the topology in the very structure of the neural network, and learning some generic transfer function amongst nodes that will perform well on any topology. First results use a loss based on a large dataset of actual power flows computed using the slow HADES simulator. The results indeed generalize to very different topologies than the ones used for training, in particular very different sizes of power grids. On-going work removes the need to run HADES thanks to a loss that directly aims to minimize Kirshoff's law on all lines.
Participants: Isabelle Guyon, Marc Schoenauer, Michèle Sebag
PhDs: Victor Berger, Herilalaina Rakotoarison; Post-doc: Berna Batu
Collaboration: Vincent Renaut (Artelys)
One of the goals of the ADEME Next project, in collaboration with SME Artelys (see also Section ), is the sizing and capacity design of regional power grids. Though smaller than the national grid, regional and urban grids nevertheless raise scaling issues, in particular because many more fine-grained information must be taken into account for their design and predictive growth.
Regarding the design of such grids, and provided accurate predictions of consumption are available (see below), off-the-shelf graph optimization algorithms can be used. Berna Batu is gathering different approaches. Herilalaina Rakotoarison's PhD tackles the automatic tuning of their parameters (see Section ); while the Mosaic algorithm is validated on standard AutoML benchmarks , its application to Artelys' home optimizer at large Knitro is on-going, and compared to the state-of-the-art in parameter tuning (confidential deliverable).
In order to get accurate consumption predictions, V. Berger's PhD tackles the identification of the peak of energy consumption, defined as the level of consumption that is reached during at least a given duration with a given probability, depending on consumers (profiles and contracts) and weather conditions. The peak identification problem is currently tackled using Monte-Carlo simulations based on consumer profile- and weather-dependent individual models, at a high computational cost. The challenge is to exploit individual models to train a generative model, aimed to sampling the collective consumption distribution in the quantiles with highest peak consumption. The concept of Compositional Variational Auto-Encoder was proposed: it is amenable to multi-ensemblist operations (addition or subtraction of elements in the composition), enabled by the invariance and generality of the whole framework w.r.t. respectively, the order and number of the elements. It has been first tested on synthetic problems .
Participants: Cécile Germain, Isabelle Guyon
PhD: Victor Estrade, Adrian Pol
Collaboration: D. Rousseau (LAL), M. Pierini (CERN)
The role and limits of simulation in discovery is the subject of V. Estrade’s PhD, specifically uncertainty quantification and calibration, that is how to handle the systematic errors, arising from the differences (“known unknowns”) between simulation and reality, coming from uncertainty in the so-called nuisance parameters. In the specific context of HEP analysis, where relatively numerous labelled data are available, the problem is at the crosspoint of domain adaptation and representation learning. We have investigated how to directly enforce the invariance w.r.t. the nuisance in the sought embedding through the learning criterion (tangent back-propagation) or an adversarial approach (pivotal representation). The results contrast the superior performance of incorporating a priori knowledge on a well separated classes problem (MNIST data) with a real case setting in HEP, in relation with the Higgs Boson Machine Learning challenge and the TRrackML challenge . More indirect approaches based on either incorporating variance reduction for the parameter of interest or constraining the representation in a variational auto-encoder farmework are currently considered.
Anomaly detection (AD) is the subject of A. Pol's PhD. Reliable data quality monitoring is a key asset in delivering collision data suitable for physics analysis in any modern large-scale high energy physics experiment. focuses on supervised and semi-supervised methods addressing the identification of anomalies in the data collected by the CMS muon detectors. The combination of DNN classifiers capable of detecting the known anomalous behaviors, and convolutional autoencoders addressing unforeseen failure modes has shown unprecedented efficiency. The result has been included in the production suite of the CMS experiment at CERN. Recent work has focused on improving AD for the trigger system, which is the first stage of event selection process in most experiments at the LHC at CERN. The hierarchical structure of the trigger process called for exploiting the advances in modeling complex structured representations that perform probabilistic inference effectively, and specifically variational autoencoders. Previous works argued that training VAE models only with inliers is insufficient and the framework should be significantly modified in order to discriminate the anomalous instances. In this work, we exploit the deep conditional variational autoencoder (CVAE) and we define an original loss function together with a metric that targets hierarchically structured data AD , . This results in an effective, yet easily trainable and maintainable model.
The highly visible TrackML challenge is described in section .
Participants: Guillaume Charpiat
Collaboration: Yuliya Tarabalka, Armand Zampieri, Nicolas Girard, Pierre Alliez (Titane team, Inria Sophia-Antipolis)
The analysis of satellite or aerial images has been a long-time ongoing topic of research, but the remote sensing community moved only very recently to a principled vision of the tasks in a machine learning
perspective, with sufficiently large benchmarks for validation. The main topics are the segmentation of
(possibly multispectral) remote sensing images into objects of interests, such as buildings, roads, forests,
etc., and the detection of changes between two images of the same place taken at different moments. The
main differences with classical computer vision is that images are large (covering whole countries, typically
cut into
These last years, deep learning techniques took over classical approaches in most labs, adapting neural network architectures to the specifics of the tasks. This is due notably to the creation of several large scale benchmarks (including one by us and, soon after, larger ones by GAFAM).
This year, we continued the work started in about the registration of remote sensing images (RGB pictures) with cadastral maps (made of polygons indicating buildings and roads). We extended it in to the case of real datasets, i.e. to noisy data. Indeed, in remote sensing, datasets are often large but of poor ground truth annotation quality. It turns out that, when training on datasets with noisy labels, one can still obtain accuracy scores far better than the noise variance in the training set, due to averaging effects over the labels of similar examples. To properly explain this, a theoretical study was conducted (cf. Section ). Given any already trained neural network and its noisy training set, without knowing the real ground truth, we were then able to quantify this noise averaging effect .
We also tackled the problem of pansharpening, i.e. the one of producing a high-resolution color image, given a low-resolution color image and a high-resolution greyscale one , with deep convolutional neural networks as well.
Participants: Cyril Furtlehner, Michèle Sebag
PhD: Mandar Chandorkar
Collaboration: Enrico Camporeale (CWI)
Space Weather is broadly defined as the study of the relationships between the variable conditions on the Sun and the space environment surrounding Earth. Aside from its scientific interest from the point of view of fundamental space physics phenomena, Space Weather plays an increasingly important role on our technology-dependent society. In particular, it focuses on events that can affect the performance and reliability of space-borne and ground-based technological systems, such as satellite and electric networks that can be damaged by an enhanced flux of energetic particles interacting with electronic circuits.
Since 2016, in the context of the Inria-CWI partnership, a collaboration between Tau and the Multiscale Dynamics Group of CWI aims to long-term Space Weather forecasting. The goal is to take advantage of the data produced everyday by satellites surveying the sun and the magnetosphere, and more particularly to relate solar images and the quantities (e.g., electron flux, proton flux, solar wind speed) measured on the L1 libration point between the Earth and the Sun (about 1,500,000 km and 1 hour time forward of Earth). A challenge is to formulate such goals in terms of supervised learning problem, while the "labels" associated to solar images are recorded at L1 (thus with a varying and unknown time lag). In essence, while typical ML models aim to answer the question What, our goal here is to answer both questions What and When. This project has been articulated around Mandar Chandorkar's Phd thesis which has been defended this year in Eindhoven. One of the main result that has been obtained concerns the prediction of solar wind impacting earth magnetosphere from solar images. In this context we encountered an interesting sub-problem related to the non deterministic travel time of a solar eruption to earth's magnetosphere. We have formalized it as the joint regression task of predicting the magnitude of signals as well as the time delay with respect to their driving phenomena. We have provided in an approach to this problem combining deep learning and an original Bayesian forward attention mechanism. A theoretical analysis based on linear stability has been proposed to put this algorithm on firm ground. From the practical point of view, encouraging tests have been performed both on synthetic data and real data with results slightly better than those present in the specialized literature on a small dataset. Various extension of the method, of the experimental tests and of the theoretical analysis are planned.
Participants: Guillaume Charpiat, Flora Jay, Aurélien Decelle, Cyril Furtlehner
PhD: Théophile Sanchez – PostDoc: Jean Cury
Collaboration: Bioinfo Team (LRI), Estonian Biocentre (Institute of Genomics, Tartu, Estonia), Pasteur Institute (Paris), TIMC-IMAG (Grenoble)
Thanks to the constant improvement of DNA sequencing technology, large quantities of genetic data should greatly enhance our knowledge about evolution and in particular the past history of a population. This history can be reconstructed over the past thousands of years, by inference from present-day individuals: by comparing their DNA, identifying shared genetic mutations or motifs, their frequency, and their correlations at different genomic scales. Still, the best way to extract information from large genomic data remains an open problem; currently, it mostly relies on drastic dimensionality reduction, considering a few well-studied population genetics features.
We developed an approach that extracts features from genomic data using deep neural networks and combines them with a Bayesian framework to approximate the posterior distribution of demographic parameters. The key difficulty is to build flexible problem-dependent architectures, supporting transfer learning and in particular handling data with variable size. We designed new generic architectures, that take into account DNA specificities for the joint analysis of a group of individuals, including its variable data size aspects and compared their performances to state-of-the-art approaches . In the short-term these architectures can be used for demographic inference or selection inference in bacterial populations (ongoing work with a postdoctoral researcher, J Cury, and the Pasteur Institute); the longer-term goal is to integrate them in various systems handling genetic data or other biological sequence data.
In collaboration with the Institute of Genomics of Tartu (Estonia; B Yelmen, 3-month visitor at LRI), we leveraged two types of generative neural networks (Generative Adversarial Networks and Restricted Boltzmann Machines) to learn the high dimensional distributions of real genomic datasets and create artificial genomes . These artificial genomes retain important characteristics of the real genomes (genetic allele frequencies and linkage, hidden population structure, ...) without copying them and have the potential to be valuable assets in future genetic studies by providing anonymous substitutes for private databases (such as the ones hold by companies or public institutes like the Institute of Genomics of Tartu). Yet, ensuring anonymity is a challenging point and we measured the privacy loss by using and extending the Adversarial Accuracy score developed by the team for synthetic medical data .
In collaboration with TIMC-IMAG, we proposed a new factor analysis approach that process genetic data of multiple individuals from present-day and ancient populations to visualize population structure and estimate admixture coefficients (that is, the probability that an individual belongs to different groups given the genetic data). This method corrects the traditionally-used PCA by accounting for time heterogeneity and enables a more accurate dimension reduction of paleogenomic data .
Participants: Guillaume Charpiat
PhD: Loris Felardos
Collaboration: Jérôme Hénin (IBPC), Bruno Raffin (InriAlpes)
Numerical simulations on massively parallel architectures, routinely used to study the dynamics of biomolecules at the atomic scale, produce large amounts of data representing the time trajectories of molecular configurations, with the goal of exploring and sampling all possible configuration basins of given molecules. The configuration space is high-dimensional (10,000+), hindering the use of standard data analytics approaches. The use of advanced data analytics to identify intrinsic configuration patterns could be transformative for the field.
The high-dimensional data produced by molecular simulations live on low-dimensional manifolds; the extraction of these manifolds will enable to drive detailed large-scale simulations further in the configuration space. This year, we studied how to bypass simulations by directly predicting, given a molecule formula, its possible configurations. This is done using Graph Neural Networks in a generative way, producing 3D configurations. The goal is to sample all possible configurations, and with the right probability.
Participants: Guillaume Charpiat
Collaboration: Sophie Giffard-Roisin (IRD), Claire Monteleoni (Boulder University), Balazs Kegl (LAL)
Cyclones, hurricanes or typhoons all designate a rare and complex event characterized by strong winds surrounding a low pressure area. Their trajectory and intensity forecast, crucial for the protection of persons and goods, depends on many factors at different scales and altitudes. Additionally storms have been more numerous since the 1990s, leading to both more representative and more consistent error statistics.
Currently, track and intensity forecasts are provided by numerous guidance models. Dynamical models solve the physical equations governing motions in the atmosphere. While they can provide precise results, they are computationally demanding. Statistical models are based on historical relationships between storm behavior and other parameters . Current national forecasts are typically driven by consensus methods able to combine different dynamical models.
Statistical models perform poorly compared to dynamical models, although they rely on steadily increasing data resources. ML methods have scarcely been considered, despite their successes in related forecasting problems . A main difficulty is to exploit spatio-temporal patterns. Another difficulty is to select and merge data coming from heterogeneous sensors. For instance, temperature and pressure are real values on a 3D spatial grid, while sea surface temperature or land indication rely on a 2D grid, wind is a 2D vector field, while many indicators such as geographical location (ocean, hemisphere...) are just real values (not fields), and displacement history is a 1D vector (time). An underlying question regards the innate vs acquired issue, and how to best combine physical models with trained models. The continuation of the work started last year shows that with deep learning one can outperform the state-of-the-art in many cases .
Participants: Cécile Germain, Isabelle Guyon, Adrien Pavao, Anne-Catherine Letournel, Michèle Sebag
PhD: Zhengying Liu, Lisheng Sun, Balthazar Donon
Collaborations: D. Rousseau (LAL), André Elisseeff (Google Zurich), Jean-Roch Vilmant (CERN), Antoine Marot and Benjamin Donnot (RTE), Kristin Bennett (RPI), Magali Richard (Université de Grenoble).
The Taugroup uses challenges (scientific competitions) as a means of stimulating research in machine learning and engage a diverse community of engineers, researchers, and students to learn and contribute advancing the state-of-the-art. The Taugroup is community lead of the open-source Codalab platform, hosted by Université Paris-Saclay. The project had grown in 2019 and includes now an engineer dedicated full time to administering the platform and developing challenges (Adrien Pavao), financed by a new project just starting with the Région Ile-de-France. This project will also receive the support of the Chaire Nationale d'Intelligence Artificielle of Isabelle Guyon for the next four years.
Following the highly successful ChaLearn AutoML Challenges (NIPS 2015 – ICML 2016 – PKDD 2018 ), a series of challenges on the theme of AutoDL was run in 2019 (see http://
Part of the High Energy Physics activities of the team, TrackML , first phase was run and co-sponsored by Kaggle, until September 2018. The second phase has been run on Codalab until March 2019, requiring code submission; algorithms were then ranked by combining accuracy and speed. The best submissions largely outperform the existing solutions. The challenge has been presented at NeurIPS , and at a CERN workshop
A new challenge series in Reinforcement Learning was started with the company RTE France, one the theme “Learning to run a power network” (L2RPN, http://
The HADACA project (EIT Health) aims to run a series of challenges to promote and encourage innovations in data analysis and personalized medicine. Université de Grenoble organized a challenge on matrix factorization (https://
It is important to introduce challenges in ML teaching. This has been done (and is on-going) in I. Guyon's Licence and Master courses : some assignments to Master students are to design small challenges, which are then given to Licence students in labs, and both types of students seem to love it. Codalab has also been used to implement reinforcement learning homework in the form of challenges by Victor Berger and Heri Rakotoarison for the class of Michèle Sebag.
Along similar line, F. Landes proposed a challenge in the context of S. Mallat's course, at Collège de France. Finally, in collaboration with aiforgood.org, and Heri Rakotoarison has put in place a hackathon for the conference Data Science Africa (https://
In terms of dissemination, four books were published in 2019 in the Springer series on challenges in machine learning, see http://
Tau will continue Tao policy about technology transfer, accepting any informal meeting following industrial requests for discussion (and we are happy to be too much solicited), and deciding about the follow-up based upon the originality, feasibility and possible impacts of the foreseen research directions, provided they fit our general canvas. This lead to the following 5 on-going CIFRE PhDs, with the corresponding side-contracts with the industrial supervisor, plus 3 other bilateral contracts. In particular, we now have a first “Affiliate” partner, the SME DMH, and hope to further develop in the future this form of transfer. Note that it can also sometimes lead to collaborative projects, as listed in the following sections.
DMH 2019 (1 an, 45kEuros) related to consulting activities with DMH (Digital for Mental Health)
Coordinator: Aurélien Decelle and Simon Moulieras (DMH)
Participants: Michèle Sebag
CIFRE Renault 2017-2020 (45 kEuros), related to Marc Nabhan's CIFRE PhD Sûreté de fonctionnement d’un véhicule autonome - évaluation des fausses détections au travers d’un profil de mission réduit
Coordinator: Marc Schoenauer and Hiba Hage (Renault)
Participants: Marc Nabhan (PhD), Yves Tourbier (Renault)
BOBCAT The new B-tO-B work intermediaries: comparing business models in the "CollaborATive" digital economy, 2018-2020 (100k euros), funded by DARES (French Ministry of Labor).
Coordinator : Odile Chagny (IRES)
Participants: Paola Tubaro and Antonio A. Casilli (Telecom Paris)
INDL-KW International Network on Digital Labor - Kickoff Workshops, 2019 (10k euros), funded by CNRS and the University of Toronto.
Coordinator: Paola Tubaro and Alessandro Delfanti (UoT)
Participants: Antonio A. Casilli (Telecom Paris)
CIFRE Thalès 2018-2021 (45 kEuros), with Thales Teresis, related to Nizam Makdoud's CIFRE PhD
Coordinator: Marc Schoenauer and Jérôme Kodjabatchian
Participants: Nizam Makdoud
CIFRE RTE 2018-2021 (72 kEuros), with Réseau Transport d'Electricité, related to Balthazar Donon's CIFRE PhD
Coordinator: Isabelle Guyon and Antoine Marot (RTE)
Participants: Balthazar Donon, Marc Schoenauer
CIFRE FAIR 2018-2021 (45 kEuros), with Facebook AI Research, related to Leonard Blier's CIFRE PhD
Coordinator: Marc Schoenauer and Yann Olliver (Facebook)
Participants: Guillaume Charpiat, Michèle Sebag, Léonard Blier
IFPEN (Institut Français du Pétrole Energies Nouvelles) 2019-2023 (300 kEuros), to hire an Inria Starting Research Position (PhD + 4-6 years) to work in all topics mentioned in Section relevant to IFPEN activity (see also Section ). Started October 2019.
Coordinator: Marc Schoenauer
Participants: Alessandro Bucci, Guillaume Charpiat
EPITOME 2017-2020 (225kEuros), Efficient rePresentatIon TO structure large-scale satellite iMagEs (Section ).
Coordinator: Yuliya Tarabalka (Titane team, Inria Sophia-Antipolis)
Participant: Guillaume Charpiat
HUSH 2020-2023 (348kEuros), The HUman Supply cHain behind smart technologies.
Coordinator: Antonio A. Casilli (Telecom Paris)
Participant: Paola Tubaro
Nutriperso 2017-2020, 122 kEuros. Personalized recommendations toward healthier eating practices (Section ).
U. Paris-Saclay IRS (Initiative de Recherche Stratégique)
Partners: INRA (coordinator), INSERM, Agro Paristech, Mines Telecom
Participants: Philippe Caillou, Flora Jay, Michèle Sebag, Paola Tubaro
IRS CDS 2017-2020, 75 kEuros. Personalized recommendations toward healthier eating practices
U. Paris-Saclay IRS (Initiative de Recherche Stratégique)
Partners: INRA (coordinator), INSERM, Agro Paristech, Mines Telecom
Participants: Philippe Caillou, Flora Jay, Michèle Sebag, Paola Tubaro
PIA Adamme 2015-2019 (258 kEuros) Machine Learning on a mass-memory architecture.
Coordinator: Bruno Farcy (Bull SAS)
Participants: Marc Schoenauer, Guillaume Charpiat, Cécile Germain-Renaud
NEXT 2017-2021 (675 kEuros). Simulation, calibration, and optimization of regional or urban power grids (Section ).
ADEME (Agence de l'Environnement et de la Maîtrise de l'Energie)
Coordinator: ARTELYS
Participants Isabelle Guyon, Marc Schoenauer, Michèle Sebag, Victor Berger (PhD), Herilalaina Rakotoarison (PhD), Berna Bakir Batu (Post-doc)
DATAIA Vadore 2018-2020 (105 kEuros) VAlorizations of Data to imprOve matching in the laboR markEt, with CREST (ENSAE) and Pôle Emploi (Section ).
Coordinator: Michèle Sebag
Participants: Philippe Caillou, Isabelle Guyon
PIA JobAgile 2018-2021 (379 kEuros) Evidence-based Recommandation pour l’Emploi et la Formation (Section ).
Coordinator: Michèle Sebag and Stéphanie Delestre (Qapa)
Participants: Philippe Caillou, Isabelle Guyon
HADACA 2018-2019 (50 kEuros), within EIT Health, for the organization of challenges toward personalized medicine (Section ).
Coordinator: Magali Richard (Inria Grenoble)
Participants: Isabelle Guyon
IPL
Coordinator: Bruno Raffin (Inria Grenoble)
Participants: Guillaume Charpiat, Loris Felardos (PhD)
ScGlass 2016-2020 (10 M$), “Cracking the Glass problem” international collaboration on cracking the glass problem, funded by the Simons Fundation (NY, NYC, USA).
Coordinator: 13 PIs around the world (see https://
Participants: (alumni, actively collaborating with members) François Landes
CERN: collaboration with two major CERN experiments (ATLAS and CMS) on the role of machine learning at all stages of the scientific discovery process. C. Germain supervises a CERN-funded PhD.
IIL CWI-Inria
Associate Team involved in the International Lab:
Title: Data-driven simulations for Space Weather predictions
International Partner (Institution - Laboratory - Researcher):
CWI (Netherlands) - Multiscale Dynamics Group - Enrico Camporeale
Start year: 2017
See also: http://
We propose an innovative approach to Space Weather modeling: the synergetic use of state-of-the-art simulations with Machine Learning and Data Assimilation techniques, in order to adjust for errors due to non-modeled physical processes, and parameter uncertainties. We envision a truly multidisciplinary collaboration between experts in Computational Science and Data assimilation techniques on one side (CWI), and experts in Machine Learning and Data Mining on the other (Inria). Our research objective is to realistically tackle long-term Space Weather forecasting, which would represent a giant leap in the field. This proposal is extremely timely, since the huge amount of (freely available) space missions data has not yet been systematically exploited in the current computational methods for Space Weather. Thus, we believe that this work will result in cutting-edge results and will open further research topics in space Weather and Computational Plasma Physics.
Isabelle Guyon - Competition co-chair, ECMLPKDD 2019
Michele Sebag - Area Chair NIPS 2017-2019, ICLR 2020
Flora Jay - PASADENA workshop co-chair, Paris, 2019
Guillaume Charpiat - Organizing & scientific committee of the ForMaL summer school, at ENS Cachan, June 2019, and of WAISE (Second International Workshop on Artificial Intelligence Safety Engineering) held at SafeComp in September 2019.
Flora Jay - Organizing & scientific committee of Research Program "Ecosystem dynamics : stakes, data and models" at Institut Pascal, Paris-Saclay, 2019.
Isabelle Guyon - Advisory committee BayLearn 2019; Co-organizer NeurIPS 2019 workshop on Challenges in Machine Learning; Co-organizer NeurIPS 2019 NewInML workshop.
Marc Schoenauer - Steering Committee, Parallel Problem Solving from Nature (PPSN); Steering Committee, Learning and Intelligent OptimizatioN (LION).
Michele Sebag - President of Steering Committee, Eur. Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD).
All TAU members are reviewers of the main conferences in their respective fields of expertise.
Isabelle Guyon - Action editor, Journal of Machine Learning Research (JMLR); series editor, Springer series Challenges in Machine Learning (CiML).
Marc Schoenauer - Advisory Board, Evolutionary Computation Journal, MIT Press, and Genetic Programming and Evolutionary Machines, Springer Verlag; Action editor, Journal of Machine Learning Research (JMLR).
Michèle Sebag - Editorial Board, Machine Learning, Springer Verlag.
Paola Tubaro - Associate Editorial Board Sociology, Sage; Co-editor, Revue Française de Sociologie, Presses de Sciences Po.
All members of the team reviewed numerous articles for the most prestigious journals in their respective fields of expertise.
Guillaume Charpiat - Deep Learning for Satellite Imagery in the Imagine team (ENPC), at Champs-sur-Marne, March 2019; and at LRDE lab (EPITA), at Kremlin-Bicêtre, April 2019; Deep Learning for Storm Trajectory Prediction and Remote Sensing at the seminar AI for Climate, at Jussieu (Paris), December 2019.
Cyril Furtlehner - A machine learning approach to solar wind speed forecasting from solar images, Machine Learning in Heliophysics conference, Amsterdam September 2019.
Flora Jay - Machine Learning and Deep Learning for Population Genetics, Statistics/Machine Learning at Paris-Saclay, Bures-sur-Yvette, Jan 2019; Inferring Demography? A Deep Learning Approach for Population Genetic Data, ALPHY 7-8 Feb. 2019; Génomes présents, histoires d'antan: Apprentissage profond pour la génétique des populations, Inria Seminar Unithé ou Café with G Charpiat, Feb 2019; When AI & Big Data meet life sciences - Advances in research and ethical questions, Round Table at YLRS2019, Paris, Jun 2019; Inferring past history from genetic data using ABC and Deep Learning approaches, Seminar at Lille University, Jan 2019; Creating Artificial Human Genomes Using Generative Models, Seminars at LCQB and Evolmol IBENS, Paris, Nov and Dec 2019.
Julien Girard - Formal validation for machine learning, DataIA day on Safety & AI, Palaiseau, Sept. 11th.
Isabelle Guyon - Invited talk at 2019 Sackler Coloquium of the Science of Deep Learning, and Keynote IJCNN 2019: "Neural network solvers for power transmission problems" (https://
Michèle Sebag - Meta-Learning, Kickoff of the Kompetenz Center in Machine Learning, Rhine Ruhr Region, Jan. 2019; Structural Agnostic Models, Oberwolfach Symposium on Causality, May 2019; Artificial Intelligence & Causal Modeling, CREST Big Data Applications Symposium, Tokyo, Sept. 24, 2019; Some news and questions about AI and Machine Learning, Arensberg Symposium, Leuven, Nov. 2019.
Marc Schoenauer - Intelligence Artificielle Mythes et réalités, audition par la Fédération Française du Bâtiment, Feb. 13, 2019; Intelligence Artificielle : du congrès de Dartmouth au rapport Villani, Séminaire IFPEN, Apr. 4. 2019; Intelligence Artificielle : de Dartmouth à l’apprentissage profond et au rapport Villani, Rencontres Franciliennes de Mécanique, May 28, 2019; A gentle introduction to AI and (Deep) Learning, Mexican-French Workshop, Aug. 27. 2019, Mexico; Intelligence artificielle : certification, transparence, et impact sur la société, Data Science Day, Mines Paris Tech, Sept. 18, 2019; When Big Data and Machine Learning meet Partial Differential Equations, CREST Big Data Applications Symposium, Tokyo, Sept. 25, 2019;
Paola Tubaro - Sélectionné.e par une IA ? Algorithmes, inégalités, et les "humains dans la boucle", Centre D'Alembert, Orsay, 18/04/2019; Dans la fabrique des algorithmes : plateformes, micro-travail et dynamiques d'externalisation, INSEE, 14/05/2019; The human labour that makes AI possible: An empirical study of micro-work in France, Alan Turing Institute, London, 22/01/2019; Que font les big data aux sciences sociales ? Retour sur une ‘crise’ annoncée, EHESS, 21/02/2019.
Isabelle Guyon - President and co-founder of ChaLearn, a non-for-profit organization dedicated to the organization of challenge.
Marc Schoenauer - Chair of ACM-SIGEVO (Special Interest Group on Evolutionary Computation), 2015-2019, now in Advisory Board; Founding President (since 2015) of SPECIES (Society for the Promotion of Evolutionary Computation In Europe and Surroundings), that organizes the yearly series of conferences EvoStar.
Michèle Sebag - Elected Chair of Steering Committee, ECML-PKDD; Board member, Institut de Convergence DataIA.
Paola Tubaro - Convenor of the Social Network Analysis Group of British Sociological Association; co-founder of European Network on Digital Labor.
Guillaume Charpiat - Member of the Inria Saclay Commission Scientifique, and as such, jury committee for grants for PhD theses / postdocs / professor delegations; expertise for DigiCosme grants, ANRT CIFRE PhD grants; for GPU platforms: Jean Zay (GENCI) and Lab-IA (Saclay plateau). Discussion panel of the workshop day IA & Océan-Atmosphère-Climat (IMT Atlantique Rennes), February 6th. Panel of the Machine Learning workshop at CRiP ITES, Deauville, April 5th.
Cécile Germain - Evaluator for the H2020-2016-CNECT program; member of the DFG review panel within Germany’s excellence strategy selection process.
Isabelle Guyon - Member of the NeurIPS foundation board.
Marc Schoenauer - Comité Scientifique IA, SCube (Scientipôle Savoirs & Société), Orsay; Scientific Committee, TrackML (see Section ); Comité de sélection, Chaire ABEONA-ENS "Biais et Equité en IA"; Conseil Scientifique, IFPEN; Scientific Advisory Board, BCAM, Bilbao, Spain; Scientific Advisory Board, Tara Oceans, Paris.
Michèle Sebag - Hiring juries : LRI; Centrale-Supélec; ENS-Paris; UCA-Nice; U. Freiburg, Germany. Selection juries: Awards NIPS 2019. Propositions NSERC, Canada; Propositions Dpt STIC; Prix de thèse AFIA. Expert Committee from Finland’s Minister of Economic Affairs, AI Strategy, February 2019.
Isabelle Guyon - Representative of UPSud in the DataIA Institut de Convergence Program Committee, University of Paris-Saclay. Responsible of Master AIC (becoming Paris-Saclay master in Artificial Intelligence).
Marc Schoenauer - Deputy Scientific Director of Inria (in French, Directeur Scientifique Adjoint, DSA), in charge of AI.
Michele Sebag - Deputy director of LRI, CNRS UMR 8623; elected member of the Research Council of Univ. Paris-Saclay; member of the STIC department council of Univ. Paris-Saclay; member of the Scientific Council of Labex AMIES, Applications des Mathématiques ds l'Industrie, l'Entreprise et la Société.
Paola Tubaro - Representative of CNRS in the DataIA Institut de Convergence Program Committee, University of Paris-Saclay; member of the Board, Maison des Sciences de l'Homme Paris-Saclay; member of CLIP, Institut Pascal, University of Paris-Saclay.
Licence : Philippe Caillou, Computer Science for students in Accounting and Management, 192h, L1, IUT Sceaux, Univ. Paris Sud.
Licence : Aurélien Decelle, Computer Architecture, 42h, L2, Univ. Paris-Sud.
Licence : Aurélien Decelle, Introduction to Machine Learning, 57h, L2, Univ. Paris-Sud.
Licence : François Landes, Mathematics for Computer Scientists, 51h, L2, Univ. Paris-Sud.
Licence : François Landes, Intro to Machine Learning, 48h, L2, Univ. Paris-Sud.
Licence and Polytech : Cécile Germain, Computer Architecture
Licence : Isabelle Guyon: Introduction to Data Science, L1, Univ. Paris-Sud.
Licence : Isabelle Guyon, Project: Resolution of mini-challenges (created by M2 students), L2, Univ. Paris-Sud.
Master : François Landes, Machine Learning, 34h, M1 Polytech, U. Paris-sud.
Master : Aurélien Decelle, Machine Learning, 26h, M1, Univ. Paris-Sud.
Master : Aurélien Decelle, Probability and statistics, 26h, M1, Univ. Paris-Sud.
Master : Aurélien Decelle, Probability, statistics and information theory, M1, Univ. Paris-Sud.
Master : Guillaume Charpiat, Deep Learning in Practice, 24h, M2 Recherche, Centrale-Supelec + MVA.
Master : Isabelle Guyon, Project: Creation of mini-challenges, M2, Univ. Paris-Sud.
Master : Michèle Sebag, Machine Learning, 12h; Deep Learning, 9h; Reinforcement Learning, 12h; M2 Recherche, U. Paris-Sud.
Master : François Landes, Machine Learning, 22h, M2 Recherche, U. Paris-Sud.
Master : Paola Tubaro, Start -up project for engineering students, 24h, Telecom ParisTech.
Master : Paola Tubaro, Sociology of social networks, 24h, M2, EHESS/ENS.
Master : Paola Tubaro, Social and economic network science, 24h, M2, ENSAE.
Doctorate: Paola Tubaro, Research Methods, 12h, University of Insubria, Italy.
Summer school: Guillaume Charpiat, Machine Learning & Deep Learning Tutorial, 4h30, ForMaL, ENS Cachan, June 4th
HdR - Paola Tubaro, Decoding the platform society: Organizations, markets and networks in the digital economy, 11/12/2019, Sciences Po Paris.
PhD - Benjamin DONNOT, Deep learning methods for predicting flows in power grids : novel architectures and algorithms., 13/02/2019, Isabelle Guyon and Antoine Marot (RTE)
PhD - Corentin TALLEC, Reinforcement Learning and Recurrent Neural Networks: Dynamical approaches, 7/10/2019, Université Paris-Saclay, Yann Ollivier
PhD - Mandar CHANDORKAR, Machine Learning in Space Weather, 14/11/2019, University of Eindhoven, Enrico Camporeale, Cyril Furtlehner, and Michèle Sebag
PhD - Guillaume DOQUET, Agnostic Feature Selection, 29/11/2019, Université Paris-Saclay, Michèle Sebag
PhD - Diviyan KALAINATHAN, Generative Neural Networks to Infer Causal Mechanisms: Algorithms and Applications, 17/12/2019, Université Paris-Saclay, Michèle Sebag and Isabelle Guyon
PhD - Lisheng SUN, Meta-Learning as a Markov Decision Process, 19/12/2019, Université Paris-Saclay, Isabelle Guyon and Michèle Sebag
PhD in progress - Eléonore BARTENLIAN, Deep Learning pour le traitement du signal, 1/10/2018, Michèle Sebag and Frédéric Pascal (Centrale-Supélec)
PhD in progress - Victor BERGER, Variational Anytime Simulator, 1/10/2017, Michèle Sebag
PhD in progress - Guillaume BIED, Valorisation des Données pour la Recherche d’Emploi, 1/10/2019, Bruno Crepon (CREST-ENSAE) and Philippe Caillou
PhD in progress - Leonard BLIER, Vers une architecture stable pour les systèmes d’apprentissage par renforcement, 1/09/2018, Yann Ollivier (Facebook AI Research, Paris) and Marc Schoenauer
PhD in progress - Tony BONNAIRE, Reconstruction de la toile cosmique, from 1/10/2018, Nabila Aghanim (Institut d'Astrophysique Spatiale) and Aurélien Decelle
PhD in progress - Balthazar DONON, Apprentissage par renforcement pour une conduite stratégique du système électrique, 1/10/2018, Isabelle Guyon and Antoine Marot (RTE)
PhD in progress - Victor ESTRADE Robust domain-adversarial learning, with applications to High Energy Physics, 01/10/2016, Cécile Germain and Isabelle Guyon.
PhD in progress - Loris FELARDOS, Neural networks for molecular dynamics simulations, 1/10/2018, Guillaume Charpiat, Jérôme Hénin (IBPC) and Bruno Raffin (InriAlpes)
PhD in progress - Giancarlo FISSORE, Statistical physics analysis of generative models, 1/10/2017, Aurélien Decelle and Cyril Furtlehner
PhD in progress - Julien GIRARD, Vérification et validation des techniques d’apprentissage automatique, 1/10/2018, Zakarian Chihani (CEA) and Guillaume Charpiat
PhD in progress - Nicolas GIRARD, Satellite image vectorization using neural networks, 1/10/2017, Yuliya Tarabalka & Pierre Alliez (Inria Sophia-Antipolis) and Guillaume Charpiat
PhD in progress - Armand LACOMBE, Recommandation de Formations: Application de l'apprentissage causal dans le domaine des ressources humaines, 1/10/2019, Michele Sebag and Philippe Caillou
PhD in progress - Zhengying LIU, Automation du design des reseaux de neurones profonds, 1/10/2017, Isabelle Guyon
PhD in progress - Nizam MAKDOUD, Motivations intrinsèques en apprentissage par renforcement. Application à la recherche de failles de sécurité, 1/02/2018, Marc Schoenauer and Jérôme Kodjabachian (Thalès ThereSIS, Palaiseau).
PhD in progress - Marc NABHAN, Sûreté de fonctionnement d’un véhicule autonome - évaluation des fausses détections au travers d’un profil de mission réduit, 1/10/2017, Marc Schoenauer and Hiba Hage (Renault)
PhD in progress - Adrian POL Machine Learning Anomaly Detection, with application to CMS Data Quality Monitoring, 01/10/2016, Cécile Germain.
PhD in progress - Herilalaina RAKOTOARISON, Automatic Algorithm Configuration for Power Grid Optimization, 1/10/2017, Marc Schoenauer and Michèle Sebag
PhD in progress - Théophile SANCHEZ, Reconstructing the past: deep learning for population genetics, 1/10/2017, Guillaume Charpiat and Flora Jay
PhD in progress - Pierre WOLINSKI, Learning the Architecture of Neural Networks, from 1/9/2016, Yann Ollivier (Facebook AI Research, Paris) and Guillaume Charpiat
PhD in progress - Wenzhuo LIU, Machine Learning for Numerical Simulation of PDEs, from 1/11/2019, Mouadh Yagoubi (IRT SystemX) and Marc Schoenauer
PhD in progress - Marion ULLMO, Reconstruction de la toile cosmique, from 1/10/2018, Nabila Aghanim (Institut d'Astrophysique Spatiale) and Aurélien Decelle
Marc Schoenauer: Reviewer for Dennis Wilson, IRIT, Université Toulouse; PhD jury of Patricio Cerda Reyes, U. Paris-Saclay, 28/11/2019.
François Landes: PhD rapporteur of Martina Teruzzi (Condensed matter and Machine Learning PhD, at SISSA, Trieste, Italy).
Paola Tubaro: PhD jury of Sophie Balech, U. Paris Nanterre, 09/07/2019; PhD jury of Linda Rua, U. Paris Dauphine, 13/12/2019.
Guillaume Charpiat: half-way PhD committee ("à mi-parcours") of Rodrigo Daudt (ONERA), of Julie Rivet (EPITA) and of Nissim Zerbib (Institut de la Vision)
Isabelle Guyon: PhD jury Justine Falque, U. Paris-Saclay (29/11/2019).
Michele Sebag: Reviewer for A. Chemla (PhD U. Roma & IRCAM); Pierre Fournier (PhD ISIR).
Michèle Sebag - Il faut dissiper le malentendu sur “les prétentions infondées” de l’intelligence artificielle, tribune du journal Le Monde du 12 août 2019.
Paola Tubaro - BBC World service, The 'microworkers' making your digital life possible, 02/08/2019; Radio France Inter, Travailleurs du clic, les soutiers du clavier, 23/03/2019; Science & Vie, La force de l’IA repose sur des travailleurs bien r'eels, 24/10/2019; AFP/L'Express, Les travailleurs du clic, petites mains invisibles de l'économie numérique, 21/06/2019; CNRS Le Journal, Ces microtravailleurs de l’ombre, 24/05/2019; Le Monde, Jobs du clic : la France compte plus de 250 000 micro-travailleurs, 15/02/2019; L'Humanité Dimanche, Travailleuse du clic, la double invisibilité, 02/05/2019.
Guillaume Charpiat - Génomes présents, histoires d'antan: Apprentissage profond pour la génétique des populations, Inria Seminar Unithé ou Café with Flora Jay, Feb 2019.
Marc Schoenauer - Intelligence Artificielle Mythes et réalités, Classe virtuelle de la Délégation Académique du Numérique Educatif, Feb. 2, 2019; Intelligence Artificielle Mythes et réalités, Médiathèque George Sand, Palaiseau, Feb. 9, 2019.
Michèle Sebag - Journées Scientifiques Inria, Lyon 2019; Centre de Recherches Interdisciplinaires, Workshop on Artificial Intelligence and Women Empowerment (Oct. 2019); CRI, Open AI: From big data to smart data (Nov. 2019).
Paola Tubaro - Le Micro-Travail en France : derrière l'automatisation, de nouvelles précarités au travail ? CFE-CGC Orange, Paris, 01/07/2019; Le Micro-Travail en France, Fondation Gabriel Péri, Paris, 26/06/2019; Big data et société : vie privée, travail, inégalités, Banque de France, 06/06/2019.