Section: New Results

Distributed Systems

Participants : Cécile Germain-Renaud [correspondent] , Philippe Caillou, Dawei Feng, Cyril Furtlehner, Victorin Martin, Michèle Sebag.

The DIS-SIG explores the issues related to modeling and optimizing distributed systems, ranging from very large scale computational grids to multi-agent systems and large scale traffic management.

Fault management.

As Lamport formulated decades ago, fault management in distributed systems exemplifies the unreachability of exact prior knowledge. Real-world large scale systems additionally face the non-stationarity issue.

[20] models the system state and its ruptures (non-stationarity) through the flow of jobs as a stream (scalability), with a traceability goal (interpretability), and addresses a key difficulty in Data Streaming, which is timely detection of a change in the generative process underlying the data stream drift. A statistical model based on spatial distance and time frequency is proposed, together with adaptive thresholding. Theoretical and experimental validation show the robustness of the method.

D. Feng's PhD formulates the problem of probe selection for fault prediction based on end-to-end probing as a Collaborative Prediction (CP) problem, based on the reasonable assumption of an underlying factorial model [13] . Monitoring large scale distributed systems differ from CP’s usual applications (personalized recommendation), in two major ways. On the brighter side, while users cannot be queried for specific recommendations, probes can be launched at will. On the downside, firstly the distribution of the probe results is highly skewed, faults being a small fraction of the total population, and secondly, some of the faults are transient. Amongst the numerous approaches addressing CP, Minimum Margin Matrix Factorization (MMMF) is easily amenable to active learning, which addresses fault sparsity both at spatial (skewed distributions) and temporal (transients) level. From extensive experiments on real-world data, we have shown that modelling probe-based fault prediction as a CP task and addressing this task through MMMF is an extremely efficient strategy for fault prediction. Comparative analysis and experiments motivate the critical advantage of active learning. It offers a scalable alternative to direct AUC optimization. Similarly, comparison with bias-aware methods (Mixed Membership Matrix Factorization) indicates that the capacity of actively selecting the most informative probes provides the most efficient method to capture the time variability of the system.

Multi-agent and games.

Resuming earlier work, our goal is to provide an automated abstract description of simulation results. Data mining methods are used to identify groups in complex simulations [11] . Using activity indicators to identify the most interesting agent groups [17] , the groups and their evolution are described through one or several simulations [10] . To facilitate the dissemination of the algorithms, we have participated in the development of a generic multiagent platform (GAMA), in collaboration with IRD, University of Rouen (IDEES), and University of Toulouse (IRIT) [34] , [35] .

A statistical physics perspective.

With motivating applications in large scale traffic congestion inference problems, we have

  • Settled a method for encoding real data with pairwise dependencies into an Ising copula [58] suitable to infer real-valued data (travel time) from the computation of its corresponding latent binary state (congested/not congested) probabilities.

  • In parallel we have investigated the inverse Ising problem, proposing among other methods a loop analysis based on a duality transformation, leading to a dual belief propagation algorithm runing on the dual graph formed by the network of independent cycles. This aims at finding an MRF to represent pairwise correlated data, close to a dependence tree, able to take into account most important loops [14] , [15] .