EN FR
EN FR


Section: New Results

Applications to E-science

Participants : Cécile Germain-Renaud [correspondent] , Marco Bressan, Philippe Caillou, Dawei Feng, Cyril Furtlehner, Blaise Hanczar, Karima Rafes, Balázs Kégl, Michèle Sebag.

The E-S-SIG explores the issues related to applications to E-Science, starting with modeling and optimizing very large scale computational grids, in particular in the context of Physics, to social sciences modelling with multi-agent systems.

The Higgs boson Machine Learning challenge

The HiggsML challenge (https://www.kaggle.com/c/higgs-boson ) has been set up to promote collaboration between high-energy physicists and computer scientists. The challenge, hosted by Kaggle, has drawn a remarkably large audience (with 1700+ teams it is one of the all-time most popular Kaggle challenges) and large coverage both in the social networks and in the media.

The goal of the challenge is to improve the procedure that classifies events produced by the decay of the Higgs boson versus events produced by other (background) processes, based on a training set of 250,000 examples. The challenge is a premier: it is the first time that a CERN experiment (ATLAS) made public such a large set of the official event and detector simulations. It also features a unique formal objective representing an approximation of the median significance (AMS) of a discovery (counting) test, which generates interesting algorithmic/theoretical questions beyond the usual challenges of finding and tuning the best classification algorithm [55] .

A follow-up, the HEPML workshop was organized at NIPS14 (http://nips.cc/Conferences/2014/Program/event.php?ID=4292 ), reporting on the results and the winning algorithms. The dataset and a software toolkit are available from the CERN Data Portal (http://opendata.cern.ch )

The Center for Data Science

is a Lidex of the Université Paris-Saclay (UPSay), headed by Balazs Kégl and Arnak Dalalyan, gathering over 52 research teams and 34 labs with the goal of designing and applying automated methods to analyze massive and complex scientific datasets in order to extract useful information. Data science projects require expertise from a vast spectrum of disciplines (statistics, signal processing, machine learning, data mining, data visualization, high performance computing), besides the mastery of the scientific domain where the data originate from.

The goal of CDS is to establish an institutionalized agora in which scientists can find each other, exchange ideas, initiate and nurture interdisciplinary projects, and share their experience on past data science projects. To foster synergy between data analysts and data producers CDS organizes actions to provide initial resources for helping collaborations to get off the ground, to mitigate the non-negligible risk taken by researchers venturing into interdisciplinary data science projects, and to encourage the use of unconventional forms of information transmission and dissemination essential in this communication-intensive research area. The CDS fits perfectly in the recent surge of similar initiatives, both at the international and at the national level, and it has the potential to make the University Paris-Saclay one of the international fore-runners of data science (http://www.datascience-paris-saclay.fr/en ).

Fault management

As Lamport formulated decades ago, fault management in distributed systems exemplifies the unreachability of exact prior knowledge. Real-world large scale system add a supplementary complexity, which is non-stationarity.

  • [12] models the system state and its ruptures (non-stationarity) through the flow of jobs as a stream (scalability), with a traceability goal (interpretability). These new streaming approaches involve self-calibration of the model based on scale invariance.

  • D. Feng’s PhD thesis [3] formulates the problem of probe selection for fault prediction based on end-to-end probing as a Collaborative Prediction (CP) problem, based on the reasonable assumption of an underlying factorial model. [26] extends the matrix completion/compressd sensing setup to a sequential (tensor) context. We propose and evaluate a new algorithme, Sequential Matrix Factorization (SMF) that combines matrix completion with a self-calibrating exploration/exploitation balancing heuristic. Its active learning version (SMFA) exhibits superior performance over state-of-the-art methods.

Distributed system observation

The work on distributed system automated analaysis and description[7] has been persued thru the continued development of the GAMA multi-agent framework https://code.google.com/p/gama-platform/wiki/GAMA . Philipps Caillou is associated to the new young researcher ANR ACTEUR, coordinated by Patrick Taillandier (IDEES, Rouen university), which will give an additional strucuture for further collaborations.

Identifying leaders in Social Networks

The Modyrum contract with the SME Augure (funding Marco Bressan's Post-doc) aims at providing criteria to identify the trend leaders from blogs, tweets and other web-site posts. The same methods is being applied to fashion leaders in business as well as to opinion leaders in politics.