EN FR
EN FR


Section: New Results

Data-driven Numerical Modelling

High Energy Physics

Participants: Cécile Germain, Isabelle Guyon

PhD: Victor Estrade, Adrian Pol

Collaboration: D. Rousseau (LAL), M. Pierini (CERN)

The role and limits of simulation in discovery is the subject of V. Estrade’s PhD, specifically uncertainty quantification and calibration, that is how to handle the systematic errors, arising from the differences (“known unknowns”) between simulation and reality, coming from uncertainty in the so-called nuisance parameters. In the specific context of HEP analysis, where relatively numerous labelled data are available, the problem is at the crosspoint of domain adaptation and representation learning. We have investigated how to directly enforce the invariance w.r.t. the nuisance in the sought embedding through the learning criterion (tangent back-propagation) or an adversarial approach (pivotal representation). The results [25] contrast the superior performance of incorporating a priori knowledge on a well separated classes problem (MNIST data) with a real case setting in HEP, in relation with the Higgs Boson Machine Learning challenge [66]. More indirect approaches based on either incorporating variance reduction for the parameter of interest or constraining the representation in a variational auto-encoder farmework are currently considered.

Anomaly detection is the subject of A. Pol PhD. Reliable data quality monitoring is a key asset in delivering collision data suitable for physics analysis in any modern large-scale high energy physics experiment. [60] focuses on supervised and semi-supervised methods addressing the identification of anomalies in the data collected by the CMS muon detectors. The combination of DNN classifiers capable of detecting the known anomalous behaviors, and convolutional autoencoders addressing unforeseen failure modes has shown unprecedented efficiency, compared either to production solution or classical anomaly detection (one-class or I-Forest). The result has been included in the production suite of the CMS experiment at CERN.

The highly visible TrackML challenge is described in section 7.6.

Remote Sensing Imagery

Participants: Guillaume Charpiat

Collaboration: Yuliya Tarabalka, Armand Zampieri, Nicolas Girard, Pierre Alliez (Titane team, Inria Sophia-Antipolis)

The analysis of satellite or aerial images has been a long-time ongoing topic of research, but the remote sensing community moved only very recently to a principled vision of the tasks in a machine learning perspective, with sufficiently large benchmarks for validation. The main topics are the segmentation of (possibly multispectral) remote sensing images into objects of interests, such as buildings, roads, forests, etc., and the detection of changes between two images of the same place taken at different moments. The main differences with classical computer vision is that images are large (covering whole countries, typically cut into 5000×5000 pixels tiles), containing many small, potentially similar objects (and not one big object per image), that every pixel needs to be annotated (w.r.t. assigning a single label to a full image), and that the ground truth is often not reliable (spatially mis-registered, missing new constructions).

This year, deep learning techniques took over classical approaches in most labs, adapting neural network architectures to the specifics of the tasks. This is due notably to the creation of several large scale benchmarks (including one by us [127] and, soon after, larger ones by GAFAM). A still ongoing issue is the ability to generalize across datasets (as urban and rural areas look different in different parts of the world, or even within the same country, e.g. roof types in France).

The task of segmenting satellite images comes together with the one of their registration with cadastral maps. Indeed, the ground truth in remote sensing benchmarks (cadastral maps) is often imperfect, due to spurious deformations. We tackle this issue by learning how to register images of different modalities (RGB pictures vs. binary cadastral maps). If one tries to predict, given an RGB photography and an associated cadastral map, the deformation that warps one onto the other, by outputting a 2D vector field indicating the predicted displacement of each pixel (which can be as large as ±32 px), then the problem considered is too hard (32×32 possibilities for each pixel 2D displacement vector). Instead, we simplifly the problem by decomposing it in a cascade of increasing resolutions. The idea is that if one zooms out by a factor 32, while knowing that the maximum possible displacement is of magnitude 32 px, then at this low resolution one has to move pixels by at most 1 pixel. Learning the task at this low resolution is thus easy. When it is done, if we zoom in by a factor 2, thus reaching a resolution lower than the original one by a factor 16, then the maximum displacement is again of 1 pixel (since larger displacements have been dealt with at the previous scale). And so on. In the end, we train a multi-scale chain of neural networks (double U-nets) [34], and later combine it with a segmentation task [27] in order to benefit from multi-task training, known to improve results.

Space Weather Forecasting

Participants: Cyril Furtlehner, Michèle Sebag

PhD: Mandar Chandorkar

Collaboration: Enrico Camporeale (CWI)

Space Weather is broadly defined as the study of the relationships between the variable conditions on the Sun and the space environment surrounding Earth. Aside from its scientific interest from the point of view of fundamental space physics phenomena, Space Weather plays an increasingly important role on our technology-dependent society. In particular, it focuses on events that can affect the performance and reliability of space-borne and ground-based technological systems, such as satellite and electric networks that can be damaged by an enhanced flux of energetic particles interacting with electronic circuits.(After a recent survey conducted by the insurance company Lloyd's, an extreme Space Weather event could produce up to $2.6 trillion in financial damage.)

Since 2016, in the context of the Inria-CWI partnership, a collaboration between Tau and the Multiscale Dynamics Group of CWI aims to long-term Space Weather forecasting. The project is extremely timely, as the huge amount of (freely available) space missions data has not yet been systematically exploited in the current computational methods for space weather. Specifically, the goal is to take advantage of the data produced everyday by satellites surveying the sun and the magnetosphere, and more particularly to relate solar images and the quantities (e.g., electron flux, proton flux, solar wind speed) measured on the L1 libration point between the Earth and the Sun (about 1,500,000 km and 1 hour time forward of Earth). The project is very ambitious: the accurate prediction of e.g., geomagnetic storms, or solar wind speed from solar images, would represent a giant leap in the field. A challenge is to formulate such goals in terms of supervised learning problem, while the "labels" associated to solar images are recorded at L1 (thus with a varying and unknown time lag). In essence, while typical ML models aim to answer the question What, our goal here is to answer both questions What and When. Concerning the prediction of solar wind impacting earth magnetosphere from solar images, we encountered an interesting sub-problem related to the non deterministic travel time of a solar eruption to earth's magnetosphere. We have formalized it as the joint regression task of predicting the magnitude of signals as well as the time delay with respect to their driving phenomena and provided a solution tested on synthetic data.

Genomic Data and Population Genetics

Participants: Guillaume Charpiat, Flora Jay

PhD: Théophile Sanchez

Collaboration: TIMC-IMAG (Grenoble), Estonian Biocentre (Institute of Genomics, Tartu, Estonia)

Thanks to the constant improvement of DNA sequencing technology, large quantities of genetic data should greatly enhance our knowledge about evolution and in particular the past history of a population. This history can be reconstructed over the past thousands of years, by inference from present-day individuals: by comparing their DNA, identifying shared genetic mutations or motifs, their frequency, and their correlations at different genomic scales. Still, the best way to extract information from large genomic data remains an open problem; currently, it mostly relies on drastic dimensionality reduction, considering a few well-studied population genetics features.

On-going work at TAU, around Théophile Sanchez' PhD, co-supervised by G. Charpiat and Flora Jay, aims at extracting information from genomic data using deep neural networks; the key difficulty is to build flexible problem-dependent architectures, supporting transfer learning and in particular handling data with variable size. In collaboration with the Bioinfo group at LRI, we designed new generic architectures, that take into account DNA specificities for the joint analysis of a group of individuals, including its variable data size aspects [141]. In the short-term these architectures can be used for demographic inference; the longer-term goal is to integrate them in various systems handling genetic data (e.g., epidemiological statistics) or other biological sequence data. In collaboration with the Estonian Biocentre (Tartu, Estonia), applications will consider thousands of sequenced human genomes, and expand our knowledge of the past human history. To this aim Burak Yelmen (PhD student at the Estonian Biocentre) will visit the lab from February to April 2019. Indeed, Tau expertise regarding the methodologies of exploiting missing and noisy data, and the resulting modeling biases, can contribute to enhance these novel population genetics methods, particularly so for methods heavily relying on simulated data (thus potentially suffering from the reality gap).

We also contributed to tess, a method for fast inference of population genetic structure, through a collaboration with TIMC-IMAG. This method analyses SNP data and estimates the admixture coefficients (that is, the probability that an individual belongs to different groups given the genetic data) via matrix factorization. The observed high dimensional genetic data are reduced automatically via the rank-k approximation of the matrix factorization and thereby highlight the latent structure of the data: the matrix factorization scores correspond to the admixture coefficients while the loadings give the genetic characteristics of each cluster. This method is faster than the hierarchical Bayesian models that we had previously developed and hence well suited for large NGS data. We participated in the tess3 R package, that implements this algorithm, facilitates the visualization of population genetic structure and the projection on maps [14]. We are currently adapting closely related algorithms to enable dimension reduction of temporal data with an application to paleogenomics.

Sampling molecular conformations

Participants: Guillaume Charpiat

PhD: Loris Felardos

Collaboration: Jérôme Hénin (IBPC), Bruno Raffin (InriAlpes)

Numerical simulations on massively parallel architectures, routinely used to study the dynamics of biomolecules at the atomic scale, produce large amounts of data representing the time trajectories of molecular configurations. The configuration space is high-dimensional (10,000+), hindering the use of standard data analytics approaches. The use of advanced data analytics to identify intrinsic configuration patterns could be transformative for the field.

The high-dimensional data produced by molecular simulations live on low-dimensional manifolds; the extraction of these manifolds will enable to drive detailed large-scale simulations further in the configuration space. Among the possible options are i) learning a parameterization of the local, low-dimensional manifold and performing a geometric extrapolation of the molecule trajectories; ii) learning a coarse description of the system and its dynamics, supporting a fast prediction of its evolution. In both cases, the states estimated from the time- or configuration-simplified models will be used for steering large scale simulations, thus accelerating the sampling of stable molecular conformations.

This task will be tackled by combining manifold learning (to find a relevant low-dimensional representation space) and reinforcement learning (for the efficient exploration of the space), taking inspiration from Graph Neural Networks [86]. On-going studies use Graph Auto-encoders to extract a meaningful representation of the conformation of molecules and to predict dynamics.

Storm trajectory prediction

Participants: Mo Yang, Guillaume Charpiat

Collaboration: Claire Monteleoni, Sophie Giffard-Roisin (LAL / Boulder University), Balazs Kegl (LAL)

Cyclones, hurricanes or typhoons all designate a rare and complex event characterized by strong winds surrounding a low pressure area. Their trajectory and intensity forecast, crucial for the protection of persons and goods, depends on many factors at different scales and altitudes. Additionally storms have been more numerous since the 1990s, leading to both more representative and more consistent error statistics.

Currently, track and intensity forecasts are provided by numerous guidance models. Dynamical models solve the physical equations governing motions in the atmosphere. While they can provide precise results, they are computationally demanding. Statistical models are based on historical relationships between storm behavior and other parameters [82]. Current national forecasts are typically driven by consensus methods able to combine different dynamical models.

Statistical models perform poorly compared to dynamical models, although they rely on steadily increasing data resources. ML methods have scarcely been considered, despite their successes in related forecasting problems [160]. A main difficulty is to exploit spatio-temporal patterns. Another difficulty is to select and merge data coming from heterogeneous sensors. For instance, temperature and pressure are real values on a 3D spatial grid, while sea surface temperature or land indication rely on a 2D grid, wind is a 2D vector field, while many indicators such as geographical location (ocean, hemisphere...) are just real values (not fields), and displacement history is a 1D vector (time). An underlying question regards the innate vs acquired issue, and how to best combine physical models with trained models. On-going studies, conducted in collaboration with S. Giffard-Roisin and C. Monteleoni (now Univ. Boulder), outperform the state-of-the-art in many cases [26], [36], [35].

Analyzing Brain Activity

Participants: Guillaume Charpiat

Collaboration: Hugo Richard, Bertrand Thirion (Parietal team, Inria Saclay / CEA)

With the goal of understanding brain functional architecture, the brain activity of ten subjects is recorded by an fMRI scanner, while they are watching movies (sequences of short pieces of real movies). The analysis of the ensuing complex stimulation streams proceeds by extracting relevant features from the stimuli and correlating the occurrence of these features with brain activity recorded simultaneously with the presentation of the stimuli. The analysis of video streams has been carried in [87] or [108] using a deep convolutional network trained for image classification. The question is then to build good descriptors of videos, possibly involving motion.

We consider a deep neural network trained for action recognition on the largest dataset available [115], and use its activations as descriptors of the input video. This provides deep representations of the watched movies, from an architecture that relies either on optical flow, or on image content, or both simultaneously. We then train a linear model to predict brain activity from these features. From the different layers of the deep neural networks, we build video representations that allow us to segregate (1) occipital and lateral areas of the visual cortex (reproducing the results of [108] and (2) foveal and peripheric areas of the visual cortex. We also introduce an effi- cient spatial compression scheme for deep video features that allows us to speed up the training of our predictive algorithm [41]. We show that our compression scheme outperforms PCA by a large margin.