The research of the Delys team addresses the theory and practice of distributed systems, including multicore computers, clusters, networks, peer-to-peer systems, cloud, fog end edge computing systems, and other communicating entities such as swarms of robots. It addresses the challenges of correctly communicating, sharing information, and computing in such large-scale, highly dynamic computer systems. This includes addressing the core problems of communication, consensus and fault detection, scalability, replication and consistency of shared data, information sharing in collaborative groups, dynamic content distribution, and multi- and many-core concurrent algorithms.
Delys is a joint research team between LIP6 (Sorbonne University/CNRS) and Inria Paris.
DELYS addresses both theoretical and practical issues of
Computer Systems, leveraging our dual expertise in theoretical and experimental research.
Our approach is a “virtuous cycle,” triggered by
issues with real systems, of algorithm design which we prove correct and evaluate
theoretically, and then implement and test experimentally feeding back to theory.
The major challenges addressed by DELYS are the sharing of information and
guaranteeing correct execution of highly-dynamic computer
systems.
Our research covers a large spectrum of distributed computer systems: multicore computers,
mobile networks, cloud computing systems, and dynamic
communicating entities.
This holistic approach enables handling related problems at different levels.
Among such problems we can highlight consensus, fault detection, scalability, search of information, resource allocation, replication and consistency of shared
data, dynamic content distribution, and concurrent and parallel algorithms.
Two main evolutions in the Computer Systems area strongly influence our research project:
(1) Modern computer systems are increasingly distributed, dynamic and composed of multiple devices geographically spread over heterogeneous platforms, spanning multiple management domains.
Years of research in the field are now coming to fruition,
and are being used by millions of users of web systems, peer-to-peer
systems, gaming and social applications, cloud computing, and now fog computing.
These new uses bring new challenges, such as
adaptation to dynamically-changing conditions, where knowledge of the system
state can only be partial and incomplete.
(2) Heterogeneous architectures and virtualisation are everywhere. The parallelism
offered by distributed clusters and multicore architectures is opening highly parallel
computing to new application areas. To be successful, however, many
issues need to be addressed. Challenges include obtaining a consistent view of
shared resources, such as memory, and optimally distributing computations
among heterogeneous architectures. These issues arise at a more fine-grained level than before, leading to the
need for different solutions down to OS level itself.
The scientific challenges of the distributed computing systems are subject to many important features
which include scalability, fault tolerance, dynamics, emergent behaviour, heterogeneity,
and virtualisation at many levels.
Algorithms designed for traditional distributed systems, such as resource
allocation, data storage and placement, and concurrent access to shared
data, need to be redefined or revisited in order to work properly under the constraints of
these new environments.
Sometimes, classical “static” problems, (e.g., Leader Election, Spanning Tree Construction, ...) even need to be redefined to consider the unstable nature of the distributed system.
In particular, DELYS will focus on a number of key challenges:
We target highly distributed infrastructures composed of multiple devices geographically spread over heterogeneous platforms including cloud, fog computing and IoT.
At OS level, we study multicore architectures and virtualized environments based on VM hypervisors and containers. Our research focuses on providing solutions to efficiently share memory between virtualized environments.
Francis Laniel received the Best Paper Award at IEEE NCA 2020 for “MemOpLight: Leveraging application feedback to improve container memory consolidation”
Sreeja Nair was awarded the ”Séphora Berrebi Scholarship for Women in Advanced Mathematics & Computer Science” (2020, 3d edition).
Nowadays, distributed systems are more and more heterogeneous and versatile.
Computing units can join, leave or move inside a global infrastructure.
These features require the implementation of dynamic systems, that is to say they can cope autonomously with changes
in their structure in terms of physical facilities and software. It therefore becomes necessary to define, develop, and validate
distributed algorithms able to managed such dynamic and large scale systems, for instance mobile ad hoc
networks, (mobile) sensor networks, P2P systems, Cloud environments, robot networks, to quote only a few.
The fact that computing units may leave, join, or move may result of an intentional behavior or not. In the latter case, the system may be subject to disruptions due to component faults that can be permanent, transient, exogenous, evil-minded, etc. It is therefore crucial to come up with solutions tolerating some types of faults.
In 2020, we obtained the following results.
Eventual leader election is an essential service for many reliable applications that require coordination actions on top of asynchronous fail-prone distributed systems. In 26 we proposed an new algorithm that eventually elects a leader for each connected component of a dynamic network where nodes can move or fail by crash. A node only communicates with nodes in its transmission range and locally keeps a global view, denoted topological knowledge, of the communication graph of the network and its dynamic evolution. Every change in the topology or in nodes membership is detected by one or more nodes and propagated over the network, updating thus the topological knowledge of the nodes. As the choice of the leader has an impact on the performance of applications that use an eventual leader election service, our algorithm, thanks to nodes topological knowledge, exploits the closeness centrality as the criterion for electing a leader. Experiments were conducted on top of PeerSim simulator, comparing our algorithm to a representative flooding algorithm. Performance results show that our algorithm outperforms the flooding one when considering leader choice stability, number of messages, and average distance to the leader.
Essentially, self-stabilizing algorithms tolerate transient failures, since by definition such failures last a finite time (as opposed to crash failures, for example) and their frequency is low (as opposed to intermittent failures).
We initiate research on self-stabilization in highly dynamic identified message passing systems where the dynamics is modeled using TVGs to obtain solutions tolerating both transient faults and high dynamics in 16. We reformulate the definition of self-stabilization to accommodate Time-Vary Graphs (TVGs).
We investigate the self-stabilizing leader election problem. This problem is fundamental in distributed computing since it allows to synchronize and self-organize a network. In 17, we have studied this problem in three classes of TVGs: (i) the
A team of mobile agents, starting from different nodes of an unknown network, possibly at different times, have to meet at the same node and declare that they have all met. Agents have different labels which are positive integers, and move in synchronous rounds along links of the network. The above task is known as gathering and was traditionally considered under the assumption that when some agents are at the same node then they can talk, i.e., exchange currently available information. In 24, we ask the question of whether this ability of talking is needed for gathering. The answer turns out to be no.
Our main contribution are two deterministic algorithms that always accomplish gathering in a much weaker model. We only assume that at any time an agent knows how many agents are at the node that it currently occupies but agents do not see the labels of other co-located agents and cannot exchange any information with them. They also do not see other nodes than the current one. Our first algorithm works under the assumption that agents know a priori some upper bound a priori knowledge about the network but its complexity is exponential in the size ofthe network and in the labels of agents. Its purpose is to show feasibility of gathering under this harsher scenario.
As a by-product of our techniques we obtain, in the same weak model, the
solution of the fundamental problem of leader election among agents: One agent is elected aleader and all agents learn its identity. As an application of our result we also solve, in the same model, the well-known gossiping problem: if each
agent has a message at the beginning, we show how to make all messages known to all agents, even without any a priori knowledge about the network. If agents know an upper bound
In 10, we investigate a special case of hereditary property in graphs, referred to as robustness. A property (or structure) is called
robust in a graph i.e. they can always reach each other through temporal paths). Each context induces a different interpretation of the notion of robustness.
We start by motivating the definition and discussing the two interpretations, after what we consider the notion independently from its interpretation, taking as our focus the robustness of maximal independent sets (MIS). A graph may or may not admit a robust MIS. We characterize the set of graphs in which all MISs are robust. Then, we turn our attention to the graphs that admit a robust MIS. This class has a more complex structure; we give a partial characterization in terms of elementary graph properties, then a complete characterization by means of a (polynomial time) decision algorithm that accepts if and only if a robust MIS exists. This algorithm can be adapted to construct such a solution if one exists.
In 9, we consider a mobile agent equipped with a compass and a measure of length has to find an inert treasure in the Euclidean plane. Both the agent and the treasure are modeled as points.
In the beginning, the agent is at a distance at most
In 13, we deal with a team of autonomous robots that are endowed with motion actuators and visibility sensors. Those robots are weak and evolve in a discrete environment. By weak, we mean that they are anonymous, uniform, unable to explicitly communicate, and oblivious.
We propose optimal (w.r.t. the number of robots) deterministic
solutions for the terminating exploration of an anonymous
grid-shaped network by a team of asynchronous oblivious robots.
We first consider the semi-synchronous model. We show
that it is impossible to explore a grid of at least 3 nodes with
less than 3 robots. Next, we show that it is impossible to
explore a
We then consider the asynchronous model. This latter
being strictly weakest than the semi-synchronous model, all
the aforementioned impossibility results still hold in that context.
We then propose deterministic algorithms to exhibit the optimal
number of robots allowing to explore of a given grid. Our results
show that except in two particular cases, 3 robots are necessary
and sufficient to deterministically explore a grid of at least 3
nodes. The optimal number of robots for the two remaining cases is:
4 for the
Two mobile agents represented by points freely moving in the plane and starting at two different positions, have to meet. The meeting, called rendezvous, occurs when agents are at distance at most visibility radius. Agents are anonymous and execute the same deterministic algorithm. Each agent has a set of private attributes, some or all of which can differ between agents. These attributes are: the initial position of the agent, its system of coordinates (orientation and chirality), the rate of its clock, its speed when it moves, and the time of its wake-up. If all attributes (except the initial positions) are identical and agents start at distance larger than feasible.
Our contribution in 23 is three-fold. We first give an exact characterization of feasible instances. Thus it is natural to ask whether there exists a single algorithm that guarantees rendezvous for all these instances. We give a strong negative answer to this question: we show two sets almost universal.
We explore the impact of approximation on time-polynomial distributed algorithms. In particular we show in 22 that approximation can help reduce the space used for self-stabilization. In the classic state model, where the nodes of a network communicate by reading the states of their neighbors, an important measure of efficiency is the space: the number of bits used at each node to encode the state. In this model, a classic requirement is that the algorithm has to be silent, that is, after stabilization the states should not change anymore. We design a silent self-stabilizing algorithm for the problem of minimum spanning tree, that has a trade-off between the quality of the solution and the space needed to compute it.
New distributed system models such as Fog computing are based on computing resources decentralization. However, this complicates the orchestration of distributed resources due to its large-scale, unreliable and highly dynamic nature preventing any effcicient construction of a global and consistent view. Thus, we need to define new decentralized solutions where different autonomous subsytems, having a local view of their own resources, are able to make collaborative decisions in a reasonable time while limiting the communication cost. In this axis, we currently work about a new model of collaborative decicion-making based on consensus protocols (such as Paxos). In our model, we consider several concurrent sets of nodes on a common dynamic infrastructure where each set runs an instance of a consensus protocol to decide the value of a shared and replicated variable. A given node can belong to several consensus set. This implies that several decisions can be taken asynchronously in the system by several subsets of nodes and decisions conflicts may occur. In case of decision conflict, we need to revoke some decisions in order to garantee the invariants of nodes, which consequently modify the initial definition of the consensus problem. Our works are focused on 1) the execution optimisation of a high number of concurrent consensus and 2) the problem of decision revocability.
This work has been submitted for publication.
Network Operators expect to accurately satisfy a wide range of user's needs by providing fully customized services relying on Network Slicing. The efficiency of Network Slicing depends on an optimized management of network resources and Quality of Service (QoS). We focus on Network Slice placement optimization problem.
In 21 we propose a Proof-of-Concept (PoC) illustrated by an Interactive Gaming time-sensitive use case. In 20, We focus on Virtual Network Functions (VNF) Placement and Chaining problem. In contrary to most studies related to VNF placement, we deal with the most complete and complex Network Slice topologies and we pay special attention to the geographic location of Network Slice Users. We propose a data model adapted to Integer Linear Programming. Extensive numerical experiments assess the relevance of taking into account the user location constraints. We also propose in 19 an online heuristic algorithm for the problem of network slice placement optimization. The solution is adapted to support placement on large scale networks and integrates Edge-specific and URLLC constraints. We rely on an approach called the Power of Two Choices to build the heuristic. The evaluation results show the good performance of the heuristic that solves the problem in few seconds under a large scale scenario. The heuristic also improves the acceptance ratio of network slice placement requests when compared against a deterministic online Integer Linear Programming (ILP) solution.
In 27, we study slicing in the context of 5G networks for allowing multiple users to share a common infrastructure. The chaining of Network Function (NFs) introduces constraints on the order in which NFs are allocated. We first model the allocation of resources for Chains of NFs in 5G Slices. Then we introduce a distributed mutual exclusion algorithm to address the problem of the allocation of resources. We show with selected metrics that choosing an order of allocation of the resources that differs from the order in which resources are used can give better performances. We then show experimental results where we improve the usage rate of resources by more than 20% compared to the baseline algorithm in some cases. The experiments run on our own simulator based on SimGrid.
Cloud platforms usually offer several types of Virtual Machines (VMs) with different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per use for on-demand VMs while spot VMs are instances available at lower prices. However, a spot VM can be terminated or hibernated by EC2 at any moment. In 14, we proposed the Hibernation-Aware Dynamic Scheduler (HADS) that schedules Bag-of-Tasks (BoT) applications with deadline constraints in both hibernation prone spots VMs and on-demand VMs. HADS aims at minimizing the monetary costs of executing BoT applications on Clouds ensuring that their deadlines are respected even in the presence of multiple hibernations. Results collected from experiments on Amazon EC2 VMs using synthetic applications and a NAS benchmark application show the effectiveness of HADS in terms of monetary costs when compared to on-demand VM only solutions.
To provide high availability in distributed systems, object replicas allow concurrent updates. Although replicas eventually converge, they may diverge temporarily, for instance when the network fails. This makes it difficult for the developer to reason about the object's properties , and in particular, to prove invariants over its state. For the sub-class of state-based distributed systems, we propose a proof methodology for establishing that a given object maintains a given invariant, taking into account any concurrency control. Our approach allows reasoning about individual operations separately. We demonstrate that our rules are sound, and we illustrate their use with some representative examples. We automate the rule using Boogie, an SMT-based tool.
This work was published at the 29th European Symposium on Programming (ESOP), April 2020, Dublin, Ireland 31.
The tree is a basic data structure present in many applications. We consider the case where the tree is replicated across a distributed system, for instance in a distributed file system. To improve performance and availability, it is desirable to support concurrent updates to different replicas without coordination. Such concurrent updates converge if the effects commute. However, in a naïve implementation, concurrent moves might violate the tree invariant. To avoid this issue, previous approaches would either eschew atomic moves, require preventative cross-replica coordination, or totally order move operations after-the-fact, requiring roll-back and compensation operations.
In this work, we study a novel replicated tree data structure that supports coordination-free concurrent atomic moves, and provably maintains the tree invariant. Our analysis identifies cases where concurrent moves are inherently safe, and we devise a coordination-free, rollback-free algorithm for the remaining cases. The trade-off is that in some cases a move operation “loses” (i.e., is interpreted as skip).
We present a detailed analysis of the concurrency issues with trees, justifying our replicated tree data structure. We provide mechanized proof that the data structure is convergent and maintains the tree invariant. Finally, we compare the response time and availability of our design against the literature.
This work has been submitted for publication.
Large-scale application are typically built on top of geo-distributed databases running on multiple datacenters (DCs) situated around the globe. Network failures are unavoidable, but in most internet services, availability is not negotiable; in this context, the CAP theorem proves that it is impossible to provide both availability and strong consistency at the same time. Sacrificing strong consistency, exposes developers to complex anomalies that are complex to build against. AntidoteDB is a database designed for geo-replication. As it aims to provide high availability with the strongest possible consistency model, it guarantees Transactional Causal Consistency (TCC) and supports CRDTs. TCC means that: (1) if one update happens before another, they will be observed in the same order (causal consistency), and (2) updates in the same transaction are observed all-or-nothing. In AntidoteDB, the database is persisted as a journal of operations. In the current implementation, the journal grows without bound. The main objective of this work is to specify a mechanism for pruning the journal safely, by storing recent checkpoints. This will enable faster reads and crash recovery.
Work in cooperation with Annette Bieniusa (Uni Kaiserslautern), Carla Ferreira (Universidade NOVA de Lisboa) and Gustao Petri (ARM, Cambridge, UK).
Modern applications are highly distributed and data-intensive. Programming a distributed system is challenging because of asynchrony, failures and trade-offs. In addition, application requirements vary with the use-case and throughout the development cycle. Moreover, existing tools come with restricted expressiveness or limited runtime customizability. Our work aims to address this by improving reuse while maintaining fine-grain control and enhancing dependability. We argue that an environment for composable distributed computing will facilitate the process of developing distributed systems. We use high-level composable specification, verification tools and a distributed runtime.
This work was presented at the EuroSys Doctoral Workshop by Benoît Martin and Laurent Prosperi 37.
The container mechanism supports consolidating several servers on the same machine, thus amortizing cost. To ensure performance isolation between containers, Linux relies on memory limits. However these limits are static, but application needs are dynamic; this results in poor performance. To solve this issue, MemOpLight reallocates memory to containers based on dynamic applicative feedback. MemOpLight rebalances physical memory allocation, in favor of under-performing ones, with the aim of improving overall performance. Our research explores the issues, addresses the design of MemOpLight, and validates it experimentally. Our approach increases total satisfaction by 13% compared to the default.
It is standard practice in Infrastructure as a Service to
consolidate several logical servers on the same physical machine,
thus amortizing cost.
However, the execution of one logical server should not disturb the
others: the logical servers should remain isolated from one
another.
To ensure both consolidation and isolation, a recent approach is
“containers,” a group of processes with sharing and isolation
properties.
To ensure memory performance isolation, i.e.,
guaranteeing to each container enough memory for it to perform well,
the administrator limits the total amount of physical memory that a
container may use at the expense of others.
In previous work, we showed that these limits impede memory
consolidation.
Furthermore, the metrics available to the kernel to evaluate its
policies (e.g., frequency of page faults, I/O requests, use of
CPU cycles, etc.), are not directly relevant to performance as
experienced from the application perspective, which is better
characterized by, for instance, response time or throughput measured at
application level.
To solve these problems, we propose a new approach, called the Memory Optimization Light (MemOpLight). It is based on application-level feedback from containers. Our mechanism aims to rebalance memory allocation in favor of unsatisfied containers, while not penalizing the satisfied ones. By doing so, we guarantee application satisfaction, while consolidating memory; this also improves overall resource consumption.
Our main contributions are the following:
These results published at NCA 2020
In modern server CPUs, individual cores can run at different frequencies, which allows for fine-grained control of the performance/energy tradeoff. Adjusting the frequency, however, incurs a high latency. We find that this can lead to a problem of frequency inversion, whereby the Linux scheduler places a newly active thread on an idle core that takes dozens to hundreds of milliseconds to reach a high frequency, just before another core already running at a high frequency becomes idle. In 28, we first illustrate the significant performance overhead of repeated frequency inversion through a case study of scheduler behavior during the compilation of the Linux kernel on an 80-core Intel R Xeon-based machine. Following this, we propose two strategies to reduce the likelihood of frequency inversion in the Linux scheduler. When benchmarked over 60 diverse applications on the Intel R Xeon, the better performing strategy, Smove, improves performance by more than 5% (at most 56% with no energy overhead) for 23 applications, and worsens performance by more than 5% (at most 8%) for only 3 applications. On a 4-core AMD Ryzen we obtain performance improvements up to 56%.
DELYS has a CIFRE contract with Scality SA:
DELYS has three contracts with Orange within the I/O Lab joint laboratory:
Marc Shapiro received support from Inria Startup Studio to incubate start-up concordant.io, developing CRDT-based solutions for geo-scale and edge distribution of data. ISS supports two software engineers for 12 months.
The core of ESTATE consists in laying the foundations of a new algorithmic framework for enabling Autonomic Computing
in distributed and highly dynamic systems and networks.
We plan to design a model that includes the minimal algorithmic basis allowing the emergence of dynamic distributed
systems with self-* capabilities, e.g., self-organization, self-healing, self-configuration, self-management,
self-optimization, self-adaptiveness, or self-repair.
In order to do this, we consider three main research streams:
The coordinator of ESTATE is Franck Petit.
RainbowFS proposes a “just-right” approach to storage and consistency, for developing distributed, cloud-scale applications. Existing approaches shoehorn the application design to some predefined consistency model, but no single model is appropriate for all uses. Instead, we propose tools to co-design the application and its consistency protocol. Our approach reconciles the conflicting requirements of availability and performance vs. safety: common-case operations are designed to be asynchronous; synchronisation is used only when strictly necessary to satisfy the application's integrity invariants. Furthermore, we deconstruct classical consistency models into orthogonal primitives that the developer can compose efficiently, and provide a number of tools for quick, efficient and correct cloud-scale deployment and execution. Using this methodology, we will develop an entreprise-grade, highly-scalable file system, exploring the rainbow of possible semantics, and we demonstrate it in a massive experiment.
The coordinator of RainbowFS is Marc Shapiro.
The goal is to propose an autonomic Fog system designed in a generic way. To this end, we will address several open challenges: 1) Provide an Architecture Description Language (ADL) for modeling Fog systems and their specific features such as the locality concept, QoS constraints applied on resources and their dependencies, the dynamicity of considered workloads, etc. This ADL should be generic and customizable to address any possible kind of Fog system. 2) Support collaborative decision-making between a fleet of small autonomic controllers distributed over the Fog. Tackling the convergence of local decisions to obtain a shared and consistent decision among these autonomic controllers requires new distributed agreement protocols based on distributed consensus algorithms. 3) Support the automatic generation and coordination of reconfiguration plans between the autonomic controllers. Even if each controller gets a new local target configuration to apply from the consensus, the execution plan of the overall reconfiguration needs to be generated and coordinated to minimize the disruption time and avoid failures. 4) Design and implement a fully open source framework usable in a standalone way or integrated with standard solutions (e.g., Kubernetes). The project targets in particular the future generation of Fog architects, DevOps engineers. We plan to evaluate the solution both on simulated Fog infrastructures as well as real infrastructures.
The local coordinator of SeMaFor in Delys is Jonathan Lejeune.
Marc Shapiro organised a series of seminars on the Loi de programmation pluriannuelle de la recherche (French bill organising the next 10 years of publicly-funded research) open to all scholars in Informatics.
Speakers: Sebastian Stride (SIRIS Academic), Antoine Petit (head of CNRS), Patrick Lemaire (leader of the assembly of French learned societies), Christine Musselin (instituts d'Études Politiques, Paris), Pierre Ouzoulias (CNRS, senator, member of OPECST).
Transaction on Computers (P. Sens), Journal of Parallel and Distributed Computing (L. Arantes, P. Sens), Theoretical Computer Science (S. Dubois), Transactions on Parallel and Distributed Computing (M. Shapiro,
Marc Shapiro, Vice President for Research of Société informatique de France (SIF), the French learned society in Informatics.
Pierre Sens was the reviewer of:
Pierre Sens was Chair of
Colette Johnen was the reviewer of