Section: New Results

Fault Tolerance in Distributed Networks

Verification of population protocols

Participants : Hugues Fauconnier, Carole Gallet-Delporte.

In [15] , we address the problem of verification by model- checking of the basic population protocol (PP) model of Angluin et al. . This problem has received special attention in the last two years and new tools have been proposed to deal with it. We show that the problem can be solved by using the existing model-checking tools, e.g., Spin and Prism. In order to do so, we apply the counter abstraction to get an abstraction of the PP model which can be efficiently verified by the existing model-checking tools. Moreover, this abstraction preserves the correct stabilization property of PP models. To deal with the fairness assumed by the PP models, we provide two new recipes. The first one gives sufficient conditions under which the PP model fairness can be replaced by the weak fairness implemented in Spin. We show that this recipe can be applied to several PP models. In the second recipe, we show how to use probabilistic model-checking and, in particular, Prism to take completely in consideration the fairness of the PP models. The correctness of this recipe is based on existing theorems involving finite discrete Markov chains. An abstract of this paper has been also published in [34] .

Failure Detection

Participants : Hugues Fauconnier, Carole Gallet-Delporte.

What does it mean to solve a distributed task? In Paxos, Lamport proposed a definition of solvability in which every process is split into a proposer that submits commands to be executed, an acceptor that takes care of the command execution order, and a learner that receives the outcomes of executed commands. The resulting perspective of computation in which every proposed command can be executed, be its proposer correct or faulty, proved to be very useful when processes take steps on behalf of each other, i.e., in simulations.

Most interesting tasks cannot be solved asynchronously, and failure detectors were proposed to circumvent these impossibilities. Alas, when it comes to solving a task using a failure detector, we cannot leverage simulation-based techniques. A process cannot perform steps of failure detector-based computation on behalf of another process, since it cannot access the remote failure-detector module.

In [17] , we propose a new definition of solving a task with a failure detector in which computation processes that propose inputs and provide outputs are treated separately from synchronization processes that coordinate using a failure detector. In the resulting framework, any failure detector is shown to be equivalent to the availability of some k-set agreement. As a corollary, we obtain a complete classification of tasks, including ones that evaded comprehensible characterization so far, such as renaming.

Shared objects like atomic register, test-and-set, cmp-and-swap are classical hardware primitives that help to develop fault-tolerant distributed applications. In order to compare shared objects, in [41] , we consider their implementations in message passing models. With the minimal failure detector for each object, we get a new hierarchy that has only two levels. This paper summarizes recent works and results on this topic.

In [7] , we first define the basic notions of local and non-local tasks for distributed systems. Intuitively, a task is local if, in a system with no failures, each process can compute its output value locally by applying some local function on its own input value (so the output value of each process depends only on the process' own input value, not on the input values of the other processes); a task is non-local otherwise. All the interesting distributed tasks, including all those that have been investigated in the literature (e.g., consensus, set agreement, renaming, atomic commit, etc.) are non-local.

In this paper we consider non-local tasks and determine the minimum information about failures that is necessary to solve such tasks in message-passing distributed systems. As part of this work, we also introduces weak set agreement — a natural weakening of set agreement — and show that, in some precise sense, it is the weakest non-local task in message-passing systems.

Adversary disagreement and Byzantine agreement

Participants : Hugues Fauconnier, Carole Gallet-Delporte.

At the heart of distributed computing lies the fundamental result that the level of agreement that can be obtained in an asynchronous shared memory model where t processes can crash is exactly t + 1. In other words, an adversary that can crash any subset of size at most t can prevent the processes from agreeing on t values. But what about all the other 2 2 n -1 -(n+1) adversaries that are not uniform in this sense and might crash certain combination of processes and not others? In [6] , we present a precise way to classify all adversaries. We introduce the notion of disagreement power: the biggest integer k for which the adversary can prevent processes from agreeing on k values. We show how to compute the disagreement power of an adversary and derive n equivalence classes of adversaries.

So far, the distributed computing community has either assumed that all the processes of a distributed system have distinct identifiers or, more rarely, that the processes are anonymous and have no identifiers. These are two extremes of the same general model: namely, n processes use different authenticated identifiers, where 1n. In [18] , we ask how many identifiers are actually needed to reach agreement in a distributed system with t Byzantine processes.

We show that having 3t+1 identifiers is necessary and sufficient for agreement in the synchronous case but, more surprisingly, the number of identifiers must be greater than n+3t 2 in the partially synchronous case. This demonstrates two differences from the classical model (which has =n): there are situations where relaxing synchrony to partial synchrony renders agreement impossible; and, in the partially synchronous case, increasing the number of correct processes can actually make it harder to reach agreement. The impossibility proofs use the fact that a Byzantine process can send multiple messages to the same recipient in a round. We show that removing this ability makes agreement easier: then, t+1 identifiers are sufficient for agreement, even in the partially synchronous model.

Fast and compact self stabilizing verification, computation, and fault detection of an MST

Participants : Amos Korman [CNRS LIAFA, University of Paris Diderot, France] , Shay Kutten [Technion, Israel] , Toshimitsu Masuzawa [Osaka University, Japan] .

In [27] , we address the impact of optimizing the memory size on the time complexity, and show that this carries at most a small cost in terms of time in the context of MST. Specifically, we present a self stabilizing distributed verification algorithm whose time complexity is O(log 2 n) in synchronous networks, or O(log 2 n) in asynchronous networks, where denotes the largest degree of a node. More importantly, the memory size at each node remains optimal- O(logn) bits throughout the execution. This answers an open problem posed by Awerbuch and Varghese (FOCS 1991). We also show that Ω(logn) time is necessary if the memory size is restricted to O(logn) bits, even in synchronous networks. We demonstrate the usefulness of our verification scheme by using it as a module in a new self stabilizing MST construction algorithm. This algorithm has the important property that, if faults occur after the construction ended, they are detected by some nodes within O(log 2 n) time in synchronous networks, or within O(log 2 n) time in asynchronous networks. The rest of the nodes detect within O(Dlogn) time, where D denotes the diameter. Moreover, if a constant number of faults occur, then, within the required detection time above, they are detected by some node in the O(logn) locality of each of the faults. The memory size of the self stabilizing MST construction is O(logn) bits per node (optimal), and the time complexity is O(n). This time complexity is significantly better than the best time complexity of previous self stabilizing MST algorithms, that was Ω(n 2 ) even when using memory of Ω(log 2 n) bits, and even without having the above localized fault detection property.The time complexity of previous algorithms that used O(logn) memory size was O(n|E|).