CEPAGE is an INRIA Team joint with University of Bordeaux (UB1 and ENSEIRB) and CNRS (LaBRI, UMR 5800)
The development of interconnection networks has led to the emergence of new types of computing platforms. These platforms are characterized by heterogeneity of both processing and communication resources, geographical dispersion, and instability in terms of the number and performance of participating resources. These characteristics restrict the nature of the applications that can perform well on these platforms. Due to middleware and application deployment times, applications must be long-running and involve large amounts of data; also, only loosely-coupled applications may currently be executed on unstable platforms.
The new algorithmic challenges associated with these platforms have been approached from two different directions. On the one hand, the parallel algorithms community has largely concentrated on the problems associated with heterogeneity and large amounts of data. On the other hand, the distributed systems community has focused on scalability and fault-tolerance issues. The success of file sharing applications demonstrates the capacity of the resulting algorithms to manage huge volumes of data and users on large unstable platforms. Algorithms developed within this context are completely distributed and based on peer-to-peer (P2P for short) communication.
The goal of our project is to establish a link between these two directions, by gathering researchers from the distributed algorithms and data structures, parallel and randomized algorithms
communities. More precisely, the objective of our project is to extend the application field that can be executed on large scale distributed platforms. Indeed, whereas protocols designed for
P2P file exchange are actually distributed, computationally intensive applications executed on large scale platforms (BOINC
Projects must meet three basic technological requirements, to ensure benefits from grid computing:
Projects should have a need for millions of CPU hours of computation to proceed. However, humanitarian projects with smaller CPU hour requirements are able to apply.
The computer software algorithms required to accomplish the computations should be such that they can be subdivided into many smaller independent computations.
If very large amounts of data are required, there should also be a way to partition the data into sufficiently small units corresponding to the computations.
Given these constraints, applications using large data sets should be such that they can be arbitrarily split into small pieces of data (such as Seti@home
These constraints are both related to security and algorithmic issues. Security is of course an important issue, since executing non-certified code on non-certified data on a large scale, open, distributed platform is clearly unacceptable. Nevertheless, we believe that external techniques, such as Sandboxing, certification of data and code through hashcode mechanisms, should be used to solve these problems. Therefore, the focus of our project is on algorithmic issues and in what follows, we assume a cooperative environment of well-intentioned users, and we assume that security and cooperation can be enforced by external mechanisms. Our goal is to demonstrate that gains in performances and extension of the application field justify these extra costs but that, just as operating systems do for multi-users environments, security and cooperation issues should not affect the design of efficient algorithms nor reduce the application field.
We will concentrate on the design of new services for computationaly intensive applications, consisting of mostly independent tasks sharing data, with application to distributed storage, molecular dynamics and distributed continuous integration, that will be described in more details in Section .
Most of the research (including ours) currently carried out on these topics relies on a centralized knowledge of the whole (topology and performances) execution platform, whereas recent evolutions in computer networks technology yield a tremendous change in the scale of these networks. The solutions designed for scheduling and managing compact data structures must be adapted to these systems, characterized by a high dynamism of their entities (participants can join and leave at will), a potential instability of the large scale networks (on which concurrent applications are running), and the increasing probability of failure.
P2P systems have achieved stability and fault-tolerance, as witnessed by their wide and intensive usage, by changing the view of the networks: all communication occurs on a logical network (fixed even though resources change over time), thus abstracting the actual performance of the underlying physical network. Nevertheless, disconnecting physical and logical networks leads to low performance and a waste of resources. Moreover, due to their original use (file exchange), those systems are well suited to exact search using Distributed Hash Tables (DHT's) and are based on fixed regular virtual topologies (Hypercubes, De Bruijn graphs...). In the context of the applications we consider, more complex queries will be required (finding the set of edges used for content distribution, finding a set of replicas covering the whole database) and, in order to reach efficiency, unstructured virtual topologies must be considered.
In this context, the main scientific challenges of our project are:
Models:
At a low level, to understand the underlying physical topology and to obtain both realistic and instanciable models. This requires expertise in graph theory (all the members of the project) and platform modelling (Olivier Beaumont, Nicolas Bonichon, Lionel Eyraud and Ralf Klasing). The obtained results will be used to focus the algorithms designed in Sections and .
At a higher level, to derive models of the dynamism of targeted platforms, both in terms of participating resources and resource performances (Olivier Beaumont, Philippe Duchon). Our goal is to derive suitable tools to analyze and prove algorithm performances in dynamic conditions rather than to propose stochastic modeling of evolutions (Section ).
Overlays and distributed algorithms:
To understand how to augment the logical topology in order to achieve the good properties of P2P systems. This requires knowledge in P2P systems and small-world networks (Olivier Beaumont, Nicolas Bonichon, Philippe Duchon, Nicolas Hanusse, Cyril Gavoille). The obtained results will be used for developing the algorithms designed in Sections and .
To build overlays dedicated to specific applications and services that achieve good performances (Olivier Beaumont, Nicolas Bonichon, Philippe Duchon, Lionel Eyraud, Ralf Klasing). The set of applications and services we target will be described in more details in Section and .
To understand how to dynamically adapt scheduling algorithms (in particular collective communication schemes) to changes in network performance and topology, using randomized algorithms (Olivier Beaumont, Nicolas Bonichon, Nicolas Hanusse, Philippe Duchon, Ralf Klasing) (Section ).
Compact and distributed data structures:
To understand how to dynamically adapt compact data structures to changes in network performance and topology (Nicolas Hanusse, Cyril Gavoille) (Section )
To design sophisticated labeling schemes in order to answer complex predicates using local labels only (Nicolas Hanusse, Cyril Gavoille) (Section )
We will detail in Section how the various expertises in the team will be employed for the considered applications.
We therefore tackle several problems related to two priorities that INRIA identified in its strategic plan (2008-2012): "Modeling, Simulation and Optimization of Complex Dynamic Systems" and "Information, Computation and Communication Everywhere "
The recent evolutions in computer networks technology, as well as their diversification, yield a tremendous change in the use of these networks: applications and systems can now be designed at a much larger scale than before. This scaling evolution is dealing with the amount of data, the number of computers, the number of users, and the geographical diversity of these users. This race towards large scalecomputing has two major implications. First, new opportunities are offered to the applications, in particular as far as scientific computing, data bases, and file sharing are concerned. Second, a large number of parallel or distributed algorithms developed for average size systems cannot be run on large scale systems without a significant degradation of their performances. In fact, one must probably relax the constraints that the system should satisfy in order to run at a larger scale. In particular the coherence protocols designed for the distributed applications are too demanding in terms of both message and time complexity, and must therefore be adapted for running at a larger scale. Moreover, most distributed systems deployed nowadays are characterized by a high dynamism of their entities (participants can join and leave at will), a potential instability of the large scale networks (on which concurrent applications are running), and an increasing individual probability of failure. Therefore, as the size of the system increases, it becomes necessary that it adapts automatically to the changes of its components, requiring self-organization of the system to deal with the arrival and departure of participants, data, or resources.
As a consequence, it becomes crucial to be able to understand and model the behavior of large scale systems, to efficiently exploit these infrastructures, in particular w.r.t. designing dedicated algorithms handling a large amount of users and/or data.
In the case of parallel computation solutions, some strategies have been developed in order to cope with the intrinsic difficulty induced by resource heterogeneity. It has been proved that changing the metric (from makespan minimization to throughput maximization) simplifies most scheduling problems, both for collective communications and parallel processing. This restricts the use of target platforms to simple and regular applications, but due to the time needed to develop and deploy applications on large scale distributed platforms, the risk of failures, the intrinsic dynamism of resources, it is unrealistic to consider tightly coupled applications involving many tight synchronizations. Nevertheless, (1) it is unclear how the current models can be adapted to large scale systems, and (2) the current methodology requires the use of (at least partially) centralized subroutines that cannot be run on large scale systems. In particular, these subroutines assume the ability to gather all the information regarding the network at a single node (topology, resource performance, etc.). This assumption is unrealistic in a general purpose large size platform, in which the nodes are unstable, and whose resource characteristics can vary abruptly over time. Moreover, the proposed solutions for small to average size, stable, and dedicated environments do not satisfy the minimal requirements for self-organization and fault-tolerance, two properties that are unavoidable in a large scale context. Therefore, there is a strong need to design efficient and decentralized algorithms. This requires in particular to define new metrics adapted to large scale dynamic platforms in order to analyze the performance of the proposed algorithms.
As already noted, P2P file sharing applications have been successfully deployed on large scale dynamic platforms. Nevertheless, since our goal is the design of efficient algorithms in terms of actual performance and resource consumption, we need to concentrate on specific P2P environments. Indeed, P2P protocols are mostly designed for file sharing applications, and are not optimized for scientific applications, nor are they adapted to sophisticated database applications. This is mainly due to the primitive goal of designing file sharing applications, where anonymity is crucial, exact queries only are used, and all large file communications are made at the IP level.
Unfortunately, the context strongly differs for the applications we consider in our project, and some of the constraints appear to be in contradiction with performance and resource consumption optimization. For instance, in these systems, due to anonymity, the number of neighboring nodes in the overlay network (i.e. the number of IP addresses known to each peer) is kept relatively low, much lower than what the memory constraints on the nodes actually impose. Such a constraint induces longer routes between peers, and is therefore in contradiction with performance. In those systems, with the main exception of the LAND overlay, the overlay network (induced by the connections of each peer) is kept as far as possible separate from the underlying physical network. This property is essential in order to cope with malicious attacks, i.e. to ensure that even if a geographic site is attacked and disconnected from the rest of the network, the overall network will remain connected. Again, since actual communications occur between peers connected in the overlay network, communications between two close nodes (in the physical network) may well involve many wide area messages, and therefore such a constraint is in contradiction with performance optimization. Fortunately, in the case of file sharing applications, only queries are transmitted using the overlay network, and the communication of large files is made at IP level. On the other hand, in the case of more complex communication schemes, such as broadcast or multicast, the communication of large files is done using the overlay network, due to the lack of support, at IP level, for those complex operations. In this case, in order to achieve good results, it is crucial that virtual and physical topologies be as close as possible.
Our aim is to target large scale platforms. From parallel processing, we keep the idea that resource heterogeneity dramatically complicates scheduling problems, what imposes to restrict ourselves to simple applications. The dynamism of both the topology and the performance reinforces this constraint. We will also adopt the throughput maximization objective, though it needs to be adapted to more dynamic platforms and resources.
From previous work on P2P systems, we keep the idea that there is no centralized large server and that all participating nodes play a symmetric role (according to their performance in terms of memory, processing power, incoming and outgoing bandwidths, etc.), which imposes the design of self-adapting protocols, where any kind of central control should be avoided as much as possible.
Since dynamism constitutes the main difficulty in the design of algorithms on large scale dynamic platforms, we will consider several layers in dynamism:
Stable:In order to establish the complexity induced by dynamism, we will first consider fully heterogeneous (in terms of both processing and communication resources) but fully stable platforms (where both topology and performance are constant over time).
Semi-stable:In order to establish the complexity induced by fault-tolerance, we will then consider fully heterogeneous platforms where resource performance varies over time, but topology is fixed.
Unstable:At last, we will target systems facing the arrival and departure of participants, data or resources.
The International Symposium on Distributed Computing (DISC) is one of the leading international conferences in the area of foundations of distributed computing (together with PODC). It has a long tradition (it goes into its 22nd edition), it awards (together with PODC) the prestigious Edsger W. Dijkstra Prize in Distributed Computing to an outstanding paper on the principles of distributed computing. DISC has about 100 participants every year from more than 20 different countries. CEPAGE has been chosen by the steering committee of DISC to host the 22nd edition of DISC in 2008. The conference has been organized by CEPAGE in Arcachon from 22-24 September 2008. There were around 140 participants to the conference and the co-located workshops.
The members of Cepage are involved in the following program committees in 2008 and 2009 (either as PC Chair or PC Member): PMAA'08, RenPar'08, HeteroPar'08, PODC'08, DISC'08, ICPADS'08, ISCIS'08, AlgoTel'08, JDIR'08, AlgoTel'09, SIROCCO'09, ISPDC'09, IPDPS'09 and in the editorial board of the following journals Networks, Parallel Processing Letters, Algorithmic Operations Research, and Computing and Informatics.
Modeling the platform dynamics in a satisfying manner, in order to design and analyze efficient algorithms, is a major challenge. In a semi-stable platform, the performance of individual nodes (be they computing or communication resources) will fluctuate; in a fully dynamic platform, which is our ultimate target, the set of available nodes will also change over time, and algorithms must take these changes into account if they are to be efficient.
There are basically two ways one can model such evolution: one can use a stochastic process, or some kind of adversary model.
In a stochastic model, the platform evolution is governed by some specific probability distribution. One obvious advantage of such a model is that it can be simulated and, in many well-studied cases, analyzed in detail. The two main disadvantages are that it can be hard to determine how much of the resulting algorithm performance comes from the specifics of the evolution process, and that estimating how realistic a given model is – none of the current project participants are metrology experts.
In an adversary model, it is assumed that these unpredictable changes are under the control of an adversary whose goal is to interfere with the algorithms efficiency. Major assumptions on the system's behavior can be included in the form of restrictions on what this adversary can do (like maintaining such or such level of connectivity). Such models are typically more general than stochastic models, in that many stochastic models can be seen as a probabilistic specialization of a nondeterministic model (at least for bounded time intervals, and up to negligible probabilities of adopting "forbidden" behaviors).
Since we aim at proving guaranteed performance for our algorithms, we want to concentrate on suitably restricted adversary models. The main challenge in this direction is thus to describe sets of restricted behaviors that both capture realistic situations and make it possible to prove such guarantees.
On the other hand, in order to establish complexity and approximation results, we also need to rely on a precise theoretical model of the targeted platforms.
At a lower level, several models have been proposed to describe interference between several simultaneous communications. In the 1-port model, a node cannot simultaneously send to (and/or receive from) more than one node. Most of the “steady state” scheduling results have been obtained using this model. On the other hand, some authors propose to model incoming and outgoing communication from a node using fictitious incoming and outgoing links, whose bandwidths are fixed. The main advantage of this model, although it might be slightly less accurate, is that it does not require strong synchronization and that many scheduling problems can be expressed as multi-commodity flow problems, for which decentralized efficient algorithms are known. Another important issue is to model the bandwidth actually allocated to each communication when several communications compete for the same long-distance link.
At a higher level, proving good approximation ratios on general graphs may be too difficult, and it has been observed that actual platforms often exhibit a simple structure. For instance, many real life networks satisfy small-world properties, and it has been proved, for instance, that greedy routing protocols on small world networks achieve good performance. It is therefore of interest to prove that logical (given by the interactions between hosts) and physical platforms (given by the network links) exhibit some structure in order to derive efficient algorithms.
In order to analyze the performance of the proposed algorithms, we first need to define a metric adapted to the targeted platform. In particular, since resource performance and topology may change over time, the metric should also be defined from the optimal performance of the platform at any time step. For instance, if throughput maximization is concerned, the objective is to provide for the proposed algorithm an approximation ratio with respect to
or at least
For instance, Awerbuch and Leighton , developed a very nice distributed algorithm for computing multi-flows. The algorithm proposed in consists in associating queues and potential to each commodity at each node for all incoming or outgoing edges. These regular queues store the flow that did not reach its destination yet. Using a very simple and very natural framework, flow goes from high potential areas (the sources) to low potential areas (the sinks). This algorithm is fully decentralized since nodes make their decisions depending on their state (the size of their queues), the state of their neighbors (the size of their queues), and the capacity of neighboring links.
The remarkable property about this algorithm is that if, at any time step, the network is able to ship
(1 +
)
d
iflow units for each capacity at each time step, then the algorithm will ship at least
diunits of flow at steady state. The proof of this property is based on the overall potential of all the queues in the network, which remains bounded over time.
It is worth noting that this algorithm is quasi-optimal for the metrics we defined above, since the overall throughput can be made arbitrarily close to
In this context, the approximation result is given under an adversary model, where the adversary can change both the topology and the performances of communication resources between any two
steps, provided that the network is able to ship
(1 +
)
d
i.
Most of Scheduling problems are NP-Complete and unapproximability results exist in on-line settings, especially when resources are heterogeneous. Therefore, we need to rely on simplified communication models (see next section) to prove theoretical results. In this context, resource augmentation techniques are very useful. It consists in identifying a weak parameter (a parameter whose value can be slightly increased without breaking any strong modeling constraint) and then to compare the solution produced by a polynomial time algorithm (with this relaxed constraint) with the optimal solution of the NP-Complete problem (without resource augmentation). This technique is both pertinent in a difficult setting and useful in practice.
In the context of large scale dynamic platforms, it is unrealistic to determine precisely the actual topology and the contention of the underlying network at application level. Indeed, existing tools such as Alnem are very much based on quasi-exhaustive determination of interferences, and it takes several days to determine the actual topology of a platform made up of a few tens of nodes. Given the dynamism of the platforms we target, we need to rely on less sophisticated models, whose parameters can be evaluated at runtime.
Therefore, we propose to model each node by an incoming and an outgoing bandwidth and to neglect interference that appears at the heart of the network (Internet), in order
to concentrate on local constraints. We are currently implementing a script, based on Iperfto determine the achieved bit-rates for one-to-one, one-to-many and many-to-one transfers, given the
number of TCP connections, and the maximal size of the TCP windows. The next step will be to build a communication protocol that enforces a prescribed sharing of the network resources. In
particular, if in the optimal solution, a node
P0must send data at rate
xioutto node
Piand receive data at rate
yjinfrom node
Pj, the goal is to achieve the prescribed bitrates, provided that all capacity constraints are satisfied at each node. Our aim is to implement using Java RMI a protocol able to both
evaluate the parameters of our model (incoming and outgoing bandwidths) and to ensure a prescribed sharing of communication resources.
Under this communication model, it is possible to obtain pathological results. For instance, if we consider a master-slave setting (corresponding to the distribution of independent tasks on a Volunteer Computing platform such as BOINC), the number of slaves connected to the master may be unbounded. In fact, opening simultaneously a large number of TCP connections may lead to a bad sharing of communication resources. Therefore, we propose to add a bound on the number of connexions that can be handled simultaneously by a given node. Estimating this bound is an important issue to obtain realistic communication models.
Once low level modeling has been obtained, it is crucial to be able to test the proposed algorithms. To do this, we will first rely on simulation rather than direct experimentation. Indeed, in order to be able to compare heuristics, it is necessary to execute those heuristics on the same platform. In particular, all changes in the topology or in the resource performance should occur at the same time during the execution of the different heuristics. In order to be able to replicate the same scenario several times, we need to rely on simulations. Moreover, the metric we have tentatively defined for providing approximation results in the case of dynamic platforms requires to compute the optimal solution at each time step, which can be done off-line if all traces for the different resources are stored. Using simulation rather than experiments can be justified if the simulator itself has been proved valid. Moreover, the modeling of communications, processing and their interactions may be much more complex in the simulator than in the model used to provide a theoretical approximation ratio, such as in SimGrid. In particular, sophisticated TCP models for bandwidth sharing have been implemented in SimGRID.
At a higher level, the derivation of realistic models for large scale platforms is out of the scope of our project. Therefore, in order to obtain traces and models, we will collaborate with MESCAL, GANG and ASAP projects. We already worked on these topics with the members of GANG in the ACI Pair-A-Pair (ACI Pair-A-Pair finished in 2006, but we have proposed a follow-up, with the members of GANG and Cepage projects to ANR Blanche program). On the other hand, we also need to rely on an efficient simulator in order to test our algorithms. We have not yet chosen the discrete event simulator we will use for simulations. One attractive possibility would be to adapt SimGRID, developed in the Mescal project, to large scale dynamic environments. Indeed, a parallel version of SimGrid, based on activations is currently under development. This version will be able to deal with platforms containing more than 10 5resources. SimGrid has been developed by Henri Casanova (U.C. San Diego) and Arnaud Legrand during his PhD (under the co-supervision of O. Beaumont).
Finally, we propose several applications that will be described in detail in Section . These applications cover a large set of fields (molecular dynamics, distributed storage, continuous integration, distributed databases...). All these applications will be developed and tested with an academic or industrial partner. In all these collaborations, our goal is to prove that the services that we propose in Section can be integrated as steering tools in already developed software. Our goal is to assert the practical interest of the services we develop and then to integrate and to distribute them as a library for large scale computing.
In order to test our algorithms, we propose to implement these services using Java RMI. The main advantages of Java RMI in our context are the ease of use and the portability. Multithreading is also a crucial feature in order to schedule concurrent communications and it does not interfere with ad-hoc routing protocols developed in the project.
A prototype has already been developed in the project as a steering tool for molecular dynamic simulations (see Section ). All the applications will first be tested on small scale platforms (using desktop workstations in the laboratory). Then, in order to test their scalability, we propose to implement them either on the GRID 5000 platform or the partner's platform.
The optimization schemes for content distribution processes or for handling standard queries require a good knowledge of the physical topology or performance (latencies, throughput, ...) of the network. Assuming that some rough estimate of the physical topology is given, former theoretical results described in Section show how to pre-process the network so that local computations are performed efficiently. Due to the dynamism of large distributed platforms, some requirements on the coding of local data structures and the udpating mechanism are needed. This last process is done using the maintenance of light virtual networks, so-called overlay networks(see Section ). In our approach, we focus on:
Compression.
The emergence of huge distributed networks does not allow the topology of the network to be totally known to each node without any compression scheme. There are at least two reasons for this:
In order to guarantee that local computations are done efficiently, that is avoiding external memory requests, it may be of interest that the coding of the underlying topology can be stored within fast memoryspace.
The dynamism of the network implies many basic message communications to update the knowledge of each node. The smaller the message size is, the better the performance.
The compression of any topology description should not lead to an extra cost for standard requests: distance between nodes, adjacency tests, ... Roughly speaking, a decoding process should not be necessary.
Routing tables.
Routing queries and broadcasting information on large scale platforms are tasks involving many basic message communications. The maximum performance objective imposes that basic messages are routed along paths of cost as low as possible. On the other hand, local routing decisions must be fast and the algorithms and data structures involved must support a certain amount of dynamism in the platform.
Local computations.
Although the size of the data structures is less constrained in comparison with P2P systems (due to security reasons), however, even in our collaborative framework, it is unrealistic that each node manages a complete view of the platform with the full resource characteristic. Thus, a node has to manage data structures concerning only a fraction of the whole system. In fact, a partial view of the network will be sufficient for many tasks: for instance, in order to compute the distance between two nodes (distance labeling).
Overlay and small world networks.
The processes we consider can be highly dynamic. The preprocessing usually assumed takes polynomial time. Hence, when a new process arrives, it must be dealt with in an on-linefashion, i.e., we do not want to totally re-compute, and the (partial) re-computation has to be simple.
In order to meet these requirements, overlay networksare normally implemented. These are light virtual networks, i.e., they are sparse and a local change of the physical network will only lead to a small change of the corresponding virtual network. As a result, small address books are sufficient at each node.
A specific class of overlay networks are small-worldnetworks. These are efficient overlay networks for (greedy) routing tasks assuming that distance requests can be performed easily.
Of course, the main difficulty is to adapt the maintenance of local data structures to the dynamism of the network.
As mentioned in Section
, solutions provided by the parallel algorithm community are dedicated to stable platforms whose resource performances can be
gathered at a single node that is responsible for computing the optimal solution. On the other hand, P2P systems are fully distributed but the set of available queries in these systems is much
too poor for computationally intensive applications. Therefore, actual solutions for large scale distributed platforms such as BOINC
Requests and Task scheduling on large scale platforms;
New services for processing on large scale platforms.
Another interesting scheduling problem is the case of applications sharing (large) files stored in replicated distributed databases. We deal here with a particular instance of the scheduling problem mentioned in Section . This instance involves applications that require the manipulation of large files, which are initially distributed across the platform.
It may well be the case that some files are replicated. In the target application, all tasks depend upon the whole set of files. The target platform is composed of many distant nodes, with different computing capabilities, and which are linked through an overlay network (to be built). To each node is associated a (local) data repository. Initially, the files are stored in one or several of these repositories. We assume that a file may be duplicated, and thus simultaneously stored on several data repositories, thereby potentially speeding up the next request to access them. There may be restrictions on the possibility of duplicating the files (typically, each repository is not large enough to hold a copy of all the files). The techniques developed in Section will be used to dynamically maintain efficient data structures for handling files.
Our aim is to design a prototype for both maintaining data structures and distributing files and tasks over the network.
This framework occurs for instance in the case of Monte-Carlo applications where the parameters of new simulations depend on the average behavior of the simulations previously performed. The general principle is the following: several simulations (independent tasks) are launched simultaneously with different initial parameters, and then the average behavior of these simulations is computed. Then other simulations are performed with new parameters computed from the average behavior. These parameters are tuned to ensure a much faster convergence of the method. Running such an application on a semi-stable platform is a particular instance of the scheduling problem mentioned in Section .
We will focus on a particular algorithm picked from Molecular Dynamics: calculation of Potential of Mean Force (PMF) using the technique of Adaptive Bias Force (ABF). This work is done via a collaboration with Juan Elezgaray, IECB, Bordeaux. Here is a quick presentation of this context. Estimating the time needed for a molecule to go through a cellular membrane is an important issue in biology and medicine. Typically, the diffusion time is far too long to be computed with atomistic molecular simulations (the average time to be simulated is of order of 1s and the integration step cannot be chosen larger than 10 -15, due to the nature of physical interactions). Classical parallel approaches, based on domain decomposition methods, lead to very poor results due to the number of barriers. Another method to estimate this time is by calculating the PMF of the system, which is in this context the average force the molecule is subject to at a given position within or around the membrane. Recently, Darve et al. presented a new method, called ABF, to compute the PMF. The idea is to run a small number of simulations to estimate the PMF, and then add to the system a force that cancels the estimated PMF. With this new force, new simulations are performed starting from different configurations (distributed over the computing platform) of the system computed during the previous simulations and so on. Iterating this process, the algorithm converges quite quickly to a good estimation of the PMF with a uniform sampling along the axis of diffusion. This application has been implemented and integrated to the famous molecular dynamics software NAMD .
Our aim is to propose a distributed implementation of ABF method using NAMD. It is worth noting that NAMD is designed to run on high-end parallel platforms or clusters, but not to run efficiently on instable and distributed platforms. The different problems to be solved in order to design this application are the following:
Since we need to start a simulation from a valid configuration (which can represent several Mbytes) with a particular position of the molecule in the membrane, and these configurations are spread among participating nodes, we need to be able to find and to download such configuration. Therefore, the first task is to find an overlay such that those requests can be handled efficiently. This requires expertise in overlay networks, compact data structures and graph theory. Olivier Beaumont, Nicolas Bonichon, Philippe Duchon, Nicolas Hanusse, Cyril Gavoille and Ralf Klasing will work on this part.
In our context, each participating node may offer some space for storing some configurations, some bandwidth and some computing power to run simulations. The question arising here is how to distribute the simulations to nodes such that computing power of all nodes are fully used. Since nodes may join and leave the network at any time, redistributions of configurations and tasks between nodes will also be necessary (but all tasks only contribute to update the PMF, so that some tasks may fail without changing the overall result). The techniques designed for content distribution will be used to spread and redistribute the set of configurations over the set of participating nodes. This requires expertise in task scheduling and distributed storage. Olivier Beaumont, Nicolas Bonichon, Philippe Duchon and Lionel Eyraud-Dubois will work on this part.
A prototype of a steering tool for NAMD has been developed in the project, that may be used to validate our approach and that has been tested on GRID'5000 up to 200 processors. This prototype supports the dynamicity of the platform: contributing processors can come and leave. The managment of configurations' location is now performed using a distributed hash table. This was done by integrating the library Bamboo in the prototype. We still have to solve numerical instability.
Continuous Integration is a development method in which developers commit their work in a version control system (such as CVS or Subversion) very frequently (typically several times per day) and the project is automatically rebuilt. One of the advantages of this technique is that merge problems are detected and corrected early.
The build process not only generates the binaries, it also runs automated tests, generates documentation, checks the code coverage of tests and analyzes code style...
The whole process can take several hours for large projects. Therefore, the efficiency of this development method relies on the speed of the feedback. There is a real need to speed up the
build process, and thus to distribute it. This is one of the goal continuous integration server xooctory
In order to obtain an efficient distribution of the build, the build process can be decomposed into nearly independent sub processes, executed on different nodes. Nevertheless, to be completed, a sub process must be run on a node that holds the appropriate version of the tools (compiler, code auditing software, ...), the appropriate version of the libraries, and the appropriate version of source code. Of course, if the target node does not have all these items, it can download them from another node, but these communications may be more expensive than the execution of the sub processes.
This raises several challenging problems:
Build a distributed data structure that can efficiently provide
one of the nodes that stores a certain set
Sof files.
one of the nodes that stores a maximum subset
S'of a set
Sof files.
one of the nodes that can obtain quickly a certain set
Sof files (i.e. a node that can download efficiently the files of
Sthat it does not already holds).
Design distribution strategies of the build that take advantage of the processing and communication capabilities of the nodes.
We are collaborating with Xavier Hanin and Jayasoft in order to solve distribution problems in the context of distributed continuous integration. Our goal is to incorporate some of the services developed in Cepage to obtain a large scale distributed version of the continuous integration server xooctory.
Data cube queries represent an important class of On-Line Analytical Processing (OLAP) queries in decision support systems. They consist in a pre-computation of the different group-bys of a database (aggregation for every combination of GROUP BY attributes) that is a very consuming task. For instance, databases of some megabytes may lead to the construction of a datacube requiring terabytes of memory and parallel computation has been proposed but for a static and well-identified platform . This application is typically an interesting example for which the distributed computation and storage can be useful in an heterogeneous and dynamic setting. We just started a collaboration with Sofian Maabout (Assistant Professor in Bordeaux) and Noel Novelli (Assistant Professor of Marseille University) who is a specialist of datacube computation. Our goal is to rely on the set of services defined in Section to compute and maintain huge datacubes. For the moment, we developped a centralized tool that sums up an whole datacube until dimension 13 and that outperforms usual data cube reduction scheme. The next step consists in turning our approach into a distributed algorithm.
We are working with Cyril Banino (Yahoo Research, Trondheim) on data management for large scale distributed databases. In the context of the Yahoo platform, data is stored among several thousands of nodes, so that centralized solutions are no longer valid, and the system must rely on self-organization to balance the load. In this context, the platform is relatively stable (although nodes frequently experience failures and nodes are frequently added), but the set of stored data is highly dynamic, since data are frequently added and their popularity changes very quickly over time.
We work on data-management issues and the adaptation of CRUSH and Sorrento protocols used to localize data. An important issue is the design of mechanisms to distribute data over the set of participating nodes. The objective is both to balance the load in terms of storage among the different storage devices and to balance the load in terms of processed requests among the different processing units. Given the dynamism of the requests and the files to be stored, the scale of the system and the risk of failure due to the large number of storage and processing units, we believe that the techniques developed in the context of P2P systems may also be used in the context of large distributed databases. To balance both loads (storage and requests), we plan to rely on the services described in Section .
There are several techniques to manage sub-linear size routing tables (in the number of nodes of the platform) while guaranteeing almost shortest paths (cf. for a survey of routing techniques).
Some techniques provide routes of length at most 1 + times the length of the shortest one (which is the definition of a stretch factor of 1 + ) while maintaining a poly-logarithmic number of entries per routing table , , . However, these techniques are not universal in the sense that they apply only on some class of underlying topologies. Universal schemes exist. Typically they achieve -entry local routing tables for a stretch factor of 3 in the worst case , . Some experiments have shown that such methods, although universal, work very well in practice, in average, on realistic scale-free or existing topologies .
While the fundamental question is to determine the best stretch-space trade-off for universal schemes, the challenge for platform routing would be to design specific schemes supporting
reasonable dynamic changes in the topology or in the metric, at least for a limited class of relevant topologies. In this direction
have constructed (in polynomial time) network topologies for which nodes can be labeled once such that whatever
the link weights vary in time, shortest path routing tables with compacity
kcan be designed, i.e., for each routing table the set of destinations using the same first outgoing edge can be grouped in at most
kranges of consecutive labels.
One other aspect of the problem would be to model a realistic typical platform topology. Natural parameters (or characteristic) for this are its low dimensionality: low Euclidean or near Euclidean networks, low growing dimension, or more generally, low doubling dimension.
In 2007, we have improved compact routing scheme for planar networks, and more generally for networks excluding a fixed minor . This later family of networks includes (but is not rectrict to) networks embeddable on surfaces of bounded genus and networks of bounded treewidth. The stretch factor of our scheme is constant and the size of each routing table is only polylogarithmic (independently of the degree of the nodes), and the scheme does not require renaming (or a new addressing) of the nodes: it is name-independent. More importantly, the scheme can be constructed efficiently in polynomial time, and complexities do not hid large constant as we may encounter in Minor Graph Theory. This construction has been achieved by the design of new sparse cover for planar graphs, solving a problem open since STOC '93.
In
, we have shown that routing if outerplanar networks can be done along the shortest paths with
O(log
n)-bit labels, where
nis the number of nodes in the network, extending a result of Fraigniaud
et al.obtained for trees. The solution actually can be generalized to
k-celullar networks, which is roughly a network that is the union of
kouterplanar networks. It is worth to mention that such a scheme can be constructed in quadratic time.
In 2007, we also gave an invited lecture on compact routing schemes at a workshop on Peer-to-Peer, Routing in Complex Graphs, and Network Coding in Thomson Labs in Paris.
In 2008, we proposed a minimum stretch compact name-independent routing .
In order to optimize applications the platform topology itself must be discovered, and thus represented in memory with some data structures. The size of the representation is an important parameter, for instance, in order to optimize the throughput during the exploration phase of the platform.
Classical data structures for representing a graph (matrix or list) can be significantly improved when the targeted graph falls in some specific classes or obeys to some properties: the
graph has bounded genus (embeddable on surface of fixed genus), bounded tree-width (or
c-decomposable), or embeddabble into a bounded page number
,
. Typically, planar topologies with
nnodes (thus embeddable on the plane with no edge crossings) can by efficiently coded in linear time with at most
5
n+
o(
n)bits supporting adjacency queries in constant time. This improves the classical adjacency list within a non negligible
log
nfactor on the size (the size is about
6
nlog
nbits for edge list), and also on the query time
,
,
.
In 2008, we gave a compact encoding scheme of pagenumber
kgraphs
.
The basic routing scheme and the overlay networks must also allow us to route other queries than routing driven by applications. Typically, divide-and-conquer parallel algorithms require to compute many nearest common ancestor (NCA) queries in some tree decomposition. In a large scale platform, if the current tree structure is fully or partially distributed, then the physical location of the NCA in the platform must be optimized. More precisely, the NCA computation must be performed from distributed pieces of information, and then addressed via the routing overlay network (cf. for distributed NCA algorithms).
Recently, a theory of localized data structures has been developed (initialized by ; see for a survey). One associates with each node a label such that some given function (or predicate) of the node can be extracted from two or more labels. Theses labels are usually joined to the addresses or inserted into a global database index.
In relation with the project, queries involving the flow computation between any sink-target pair of a capacitated network is of great interest . Dynamic labeling schemes are also available for tree models , , and need further work for their adaptation to more general topologies.
Finally, localized data structures have applications to platforms implementing large database XML file types. Roughly speaking pieces of a large XML file are distributed along some platform, and some queries (typically some SELECT ... FROM extractions) involve many tree ancestor queries , the XML file structure being a tree. In this framework, distributed label-based data structures avoid the storing of a huge classical index database.
In 2007, we have proved that it is possible to assigned with each node of
n-node planar networks a label of
2log
n+
O(loglog
n)bits so that adjacency between two nodes can be retrieved from there labels
. Classical representations of planar graphs in the distributed setting where based on the Three Schnyder Trees
decomposition, leading to
3log
n+
O(log
*
n)bit labels (FOCS '01). An intriguing question is to know whether
clog
n-bit representation exists for planar graphs with
c<2.
For trees, we have can solve
k-ancestry and distance-
kqueries with shorter labels
,
. Previous solutions achieve
log
n+
O(
k2loglog
n)-bit labels [Alstrup-Bille-Rauhe 2005], whereas we have prove that
log
n+
O(
kloglog
n)-bit labels suffice. For interval graphs, we have given an optimal distance labeling scheme
, and we proposed a localized and compact data structure for comparability graphs
.
In , , , we also analyzed the locality of the construction of sparse spanners. In , we proposed an efficient first-order model checking using short labels.
We also mention a keynote talk at the LOCALITY workshop, an event joint to PODC '07 in Portland, about “Localized Data Structures”.
Finally, we have started a collaboration with Andrew Twigg (Thomson - Labs) and Bruno Courcelle (LaBRI) about connectivity in semi-dynamic planar networks (see preliminary results
here
and here
). In this model, the must precompute some localized data-structure (given as a label associate with each node)
and for a planar graph
G, so that connectivity between any two nodes in
where
Xis any subset of nodes or edges, can be determined from the labels of the two nodes and the labels of the nodes (or end-point of edges) of
X. This field looks promising since it capture a kind of dynamicity of the network, and we hope to generalize this model and our results.
Distributed Greedy Coloring is an interesting and intuitive variation of the standard Coloring problem. It still consists in coloring in a distributed setting each node of a given graph in such a way that two adjacent nodes do not get the same color, but it adds a further constraint. Given an order among the colors, a coloring is said to be greedyif there does not exist a node for which its associated color can be replaced by a color of lower position in this order without violating the coloring property. In , we provide lower and upper bounds for this problem in Linial's model and we relate them to other well-known problems, namely Coloring, Maximal Independent Set (MIS), and Largest First Coloring. Whereas the best known upper bound for Coloring, MIS, and Greedy Coloring are the same, we prove a lower bound which is strong in the sense that it now makes a difference between Greedy Coloring and MIS. In , we analyse the information sensitivity of graph coloring.
In
, we propose a new view selection algorithm. Such algorithm takes as input a fact table and computes a set of
views to store in order to speed up queries. The performance of view selection algorithm is usually measured by three criteria: (1) the amount of memory to store the selected views, (2) the
query response time and (3) the time complexity of this algorithm. The two first measurements deal with the output of the algorithm. No existing solutions give good trade-offs between
amount of memory and queries cost with a small time complexity. We propose in this paper an algorithm guaranteeing a constant approximation factor of queries response time with respect to
the optimal solution. Moreover, the time complexity for a
D-dimensional fact table is
O(
D*2
D)corresponding to the fastest known algorithm. We provide an experimental comparison with two other well known algorithms showing that our approach also give good performance
in terms of memory. Experiments are done in a centralized setting but our algorithm can easily be adapted in a parallel setting.
An overlay network is a virtual network whose nodes correspond either to processors or to resources of the network. Virtual links may depend on the application; for instance, different overlay networks can be designed for routing and broadcasting.
These overlay networks should support insertion and deletion of users/resources, and thus they inherently have a high dynamism.
We should distinguish structuredand unstructuredoverlay networks:
In the first case, one aims at designing a network in which queries can be answered efficiently: greedy routing should work well (without backtracking), the spreading of a piece of information should take a very short time and few messages. The natural topology of these networks are graph of small diameter and bounded degree (De Bruijn graph for instance). However, dynamic maintenance of a precise structure is difficult and any perturbation of the topology gives no guarantee for the desired tasks.
In the case of unstructured networks, there is no strict topology control. For the information retrieval task, the only attempt to bound the total number of messages consists of optimizing a flooding by taking into account statistics stored at each peer: number of requests that found an item traversing a given link, ...
In both approaches, the physical topology is not involved. To our knowledge, there exists only one attempt in this direction. The work of Abraham and Malhki deals with the design of routing tables for stable platforms.
We are interested in designing overlay topologies that take into account the physical topology.
Another work is promising. If we relax the condition of designing an overlay network with a precise topology but with some topological properties, we might construct very efficient overlay networks. Two directions can be considered: random graphsand small-worldnetworks.
Random graphs are promising for broadcast and have been proposed for the update of replicated databases in order to minimize the total number of messages and the time complexity , . The underlying topology is the complete graph but the communication graph (pairs of nodes that effectively interact) is much more sparse. At each pulse of its local clock, each node tries to send or receive any new piece of information. The advantage of this approach is fault-tolerance. However, this epidemic spreading leads to a waste of messages since any node can receive many times the same update. We are interested in fixing this drawback and we think that it should be possible.
For several queries, recent solutions use small-world networks. This approach is inspired from experiments in social sciences . It suggests that adding a few (non uniform) random and uncoordinated virtual long links to every node leads to shrink drastically the diameter of the network. Moreover, paths with a small number of hops can be found , , .
Solutions based on network augmentation (i.e. by adding virtual links to a base network) have proved to be very promising for large scale networks. This technique is referred to as
turning a network into a small-world network, also called the
small-worldizationprocess. Indeed, it allows to transform many arbitrary networks into networks in which search operations can be performed in a greedy fashion and very quickly
(typically in time poly-logarithmic in the size of the network). This property implies that some information can be easily (or locally) accessed like the distance between nodes. More
formally, a network is
f-navigable if a greedy routing can be used to get routing paths of
O(
f)hops. Recently, many authors aim at finding some networks that be turned into
log
O(1)-navigable network.
Our goal is to study more precisely the algorithmic performance of these new small-world networks (w.r.t. time, memory, pertinence, fault-tolerance, auto-stabilization, ...) and to propose new networks of this kind, i.e. to construct the augmentation of the base network as well as to conceive the corresponding navigation algorithm. Like classical algorithms for routing and navigation (that are essentially based on greedy algorithms), the proposed solutions have to take into account that no entity has a global knowledge of the network. A first result in this direction is promising. In , we proposed an economic distributed algorithm to turn a bounded growth network into a small-world. Moreover, the practical challenge will be to adapt such constructions to dynamic networks, at least under the models that are identified as relevant.
Can the small-worldizationprocess be supported in dynamic platforms? Up to now, the literature on small-world networks only deals with the routing task. We are convinced that small-world topologies are also relevant for other tasks: quick broadcast, search in presence of faulty nodes, .... In general, we think that maintaining a small-world topology can be much more realistic than maintaining a rigidly structured overlay network and much more efficient for several tasks in unstructured overlay networks.
In 2007, we have two contributions dealing with overlay networks: (1) in
, there is a formal description of an algorithm turning any network into a
n1/3-navigable network. This article is particularly interesting since it is the first one that considers any input network in the small-worldization process; (2) in
,
, we prove that local knowledge is not enough to search quickly for a target node in scale-free networks. Recent
studies showed that many real networks are scale-free: the distribution of nodes degree follows a power law on the form
with
[2, 3], that is the number of nodes of degree
kis proportional to
. More precisely, we formally prove that in usual scale-free models, it takes
(
n1/2)steps to reach the target.
In 2008, we gave a small stretch polylogarithmic network navigability scheme using compact metrics .
In , we describe a randomized algorithm for assigning neighbors to vertices joining a dynamic distributed network. The aim of the algorithm is to maintain connectivity, low diameter and constant vertex degree. On joining each vertex donates a constant number of tokens to the network. These tokens contain the address of the donor vertex. The tokens make independent random walks in the network. A token can be used by any vertex it is visiting to establish a connection to the donor vertex. This allows joining vertices to be allocated a random set of neighbors although the overall vertex membership of the network is unknown. The network we obtain in this way is robust under adversarial deletion of vertices and edges and actively reconnects itself.
In , we propose a new concept for browsing and searching in large collections of content-based indexed images. Our approach is inspired by greedy routing algorithms used in distributed networks. We define a navigation graphwhose vertices represent images. The edges of the navigation graph are computed according to a similarity measure between indexed images. The resulting graph can be seen as an ad-hoc network of images in which a greedy routing algorithm can be applied for retrieval purposes. Experiments are done in a centralized setting and could be easily adapted to a distributed setting.
In
, we consider networks in which there exists an harmful node, called black hole, destroying any incoming mobile
agent. The black hole search problem consists for a team of mobile agents to locate the black hole in the network. We prove that, for this problem, the pebble model is computationally as
powerful as the whiteboard model; furthermore the complexity is exactly the same. More precisely, we prove that a team of
twoasynchronous agents, each endowed with a single identical pebble (that can be placed only on nodes, and with no more than one pebble per node) can locate the black hole in an
arbitrary network of known topology; this can be done with
(
nlog
n)moves, where
nis the number of nodes, even when the links are not FIFO.
In , we consider the problem of designing the fastest Black Hole Search, given the map of the network, the starting node and, possibly, a subset of nodes of the network initially known to be safe. We study the version of this problem that assumes that there is at most one black hole in the network and there are two agents, which move in synchronized steps. We prove that this problem is not polynomial-time approximable within any constant factor less than (unless P=NP). We give a 6-approximation algorithm, thus improving on the previous 9.3-approximation algorithm from the literature. We also prove APX-hardness and give a -approximation algorithm for a restricted version of the problem, in which only the starting node is initially known to be safe.
In
, we consider a fixed, undirected, known network and a number of “mobile agents” which can traverse the network
in synchronized steps. Some nodes in the network may be faulty and the agents are to find the faults and repair them. The agents could be software agents, if the underlying network
represents a computer network, or robots, if the underlying network represents some potentially hazardous physical terrain. Assuming that the first agent encountering a faulty node can
immediately repair it, it is easy to see that the number of steps necessary and sufficient to complete this task is
(
n/
k+
D), where
nis the number of nodes in the network,
Dis the diameter of the network, and
kis the number of agents. We consider the case where one agent can repair only one faulty node. After repairing the fault, the agent dies. We show that a simple deterministic
algorithm for this problem terminates within
O(
n/
k+
Dlog
f/loglog
f)steps, where
f= min{
n/
k,
n/
D}, assuming that the number of faulty nodes is at most
k/2. We also demonstrate the worst-case asymptotic optimality of this algorithm by showing a network such that for any deterministic algorithm, there is a
placement of
k/2faults forcing the algorithm to work for
(
n/
k+
Dlog
f/loglog
f)steps.
In the effort to understand the algorithmic limitations of computing by a swarm of robots, the research has focused on the minimal capabilities that allow a problem to be solved. The
weakest of the commonly used models is
Asynchwhere the autonomous mobile robots, endowed with visibility sensors (but otherwise unable to communicate), operate in Look-Compute-Move
cycles performed asynchronously for each robot. The robots are often assumed (or required to be) oblivious: they keep no memory of observations and computations made in previous cycles. In
the paper
, we consider the setting when the robots are dispersed in an anonymous and unlabeled graph, and they must
perform the very basic task of
exploration: within finite time every node must be visited by at least one robot and the robots must enter a quiescent state. The complexity measure of a solution is the number of
robots used to perform the task. We study the case when the graph is an arbitrary tree and establish some unexpected results. We first prove that there are
n-node trees where
(
n)robots are necessary; this holds even if the maximum degree is 4. On the other hand, we show that if the maximum degree is 3, it is possible to explore with only
robots. The proof of the result is constructive. Finally, we prove that the size of the team is asymptotically
optimal: we show that there are trees of degree 3 whose exploration requires
robots.
In , , we consider the problem of gathering identical, memoryless, mobile robots in one node of an anonymous unoriented ring. Robots start from different nodes of the ring. They operate in Look-Compute-Move cycles and have to end up in the same node. In one cycle, a robot takes a snapshot of the current configuration (Look), makes a decision to stay idle or to move to one of its adjacent nodes (Compute), and in the latter case makes an instantaneous move to this neighbor (Move). Cycles are performed asynchronously for each robot. For an odd number of robots we prove in that gathering is feasible if and only if the initial configuration is not periodic, and we provide a gathering algorithm for any such configuration. For an even number of robots we decide feasibility of gathering except for one type of symmetric initial configurations, and provide gathering algorithms for initial configurations proved to be gatherable. In , we close the open problem of characterizing symmetric situations on the ring which admit a gathering for configurations of more than 18 robots.
In
, we consider the problem of periodic exploration of all nodes in undirected anonymous graphs by using a finite
state automaton (or robot). The nodes in the graph are neither labelled nor colored. However, while visiting a node
vthe robot can distinguish between edges incident to it. The edges are ordered and labelled by consecutive integers
1, ...,
d(
v)called port numbers, where
d(
v)is the degree of
v. Periodic graph exploration requires that the automaton has to visit every node infinitely many times in a periodic manner. In this work, we are interested in minimisation of the
length of the exploration period. In other words, we want to minimise the maximum number of edge traversals performed by the robot between two consecutive visits of a generic node, in the
same state and entering the node by the same port. We present an efficient deterministic algorithm arranging the port numbers, and a robot equipped with a constant number of bits able to
complete the traversal period in at most
3.75
n-2steps hence disproving a previous conjecture by which
4
n-
O(1)steps are required.
Within the wider context of the project, we have published a book on information dissemination in optical networks , and two book chapters on data gathering and energy consumption in wireless networks, respectively , . We have also considered the problems of modeling of wireless networks , energy efficiency in wireless networks , , and bandwidth allocation and broadcasting in radio networks , . We have also investigated the problems of designing survivable networks , and of constructing identifying codes and locating-dominating codes in graphs .
Even if the application field for large scale platforms is currently too poor, targeted platforms are clearly not suited to tightly coupled codes and we need to concentrate on simple scheduling problems in the context of large scale distributed unstable platforms. Indeed, most of the scheduling problems are already NP-Complete with bad approximation ratios in the case of static homogeneous platforms when communication costs are not taken into account.
Recently, many algorithms have been derived, under several communication models, for master slave tasking , and Divisible Load Scheduling (DLS) , , .
In this case, we aim at executing a large bag of independent, same-size tasks. First we assume that there is a single master, that initially holds all the (data needed for all) tasks. The problem is to determine an architecture for the execution. Which processors should the master enroll in the computation? How many tasks should be sent to each participating processor? In turn, each processor involved in the execution must decide which fraction of the tasks must be computed locally, and which fraction should be sent to which neighbor (these neighbors must be determined too).
Parallelizing the computation by spreading the execution across many processors may well be limited by the induced communication volume. Rather than aiming at makespan minimization, a more relevant objective is the optimization of the throughput in steady-state mode. There are three main reasons for focusing on the steady-state operation. First is simplicity, as the steady-state scheduling is in fact a relaxation of the makespan minimization problem in which the initialization and clean-up phases are ignored. One only needs to determine, for each participating resource, which fraction of time is spent computing for which application, and which fraction of time is spent communicating with which neighbor; the actual schedule then arises naturally from these quantities.
In , we have considered the case task scheduling for parallel multi-frontal methods, what corresponds to map a set of tasks whose dependencies are depicted by a tree. In , we have proposed several distributed scheduling algorithms when several applications are to be simultaneously mapped onto an heterogeneous platform.
In , we discuss complexity issues for DLS on heterogeneous systems under the bounded multi-port model. To our best knowledge, this is the first attempt to consider DLS under a realistic communication model, where the master node can communicate simultaneously to several slaves, provided that bandwidth constraints are not exceeded. We concentrate on one round distribution schemes, where a given node starts its processing only once all data has been received. Our main contributions are (i) the proof that processors start working immediately after receiving their work (ii) the study of the optimal schedule in the case of 2 processors and (iii) the proof that scheduling divisible load under the bounded multi-port model is NP-complete. This last result strongly differs from divisible load literature and represents the first NP-completeness result when latencies are not taken into account.
Another important and still open issue for Divisible Load Scheduling deals with return communication. Under the classical model, it is assumed that the communication time of the results between the slaves and the master node can be neglected, what strongly limits the application field. In particular, the complexity of the problem with return messages is still opened. This question has been studied in cooperation with Abhay Ghatpande, from Waseda University in , , . In particular, we have proposed two heuristics for scheduling return messages with different computational costs.
In this context, we have participated in the writing of two book chapters, about different possible modelisations of communications and about steady-state scheduling.
In many distributed applications on large distributed systems, nodes may offer some local resources and request some remote resources. For instance, in a distributed storage environment, nodes may offer some space to store remote files and request some space to duplicate remotely some of their files. In the context of broadcasting, offer may be seen as the outgoing bandwidth and request as the incoming bandwidth. In the context of load balancing, overloaded nodes may request to get rid of some tasks whereas underloaded nodes may offer to process them. In this context, we propose a distributed algorithm, called dating servicewhich is meant to randomly match demands and supplies of some resource of many nodes into couples. In a given round it produces a matching between demands and supplies which is of linear size (compared to the optimal one), even if available resources of individual nodes are very heterogeneous, and is chosen uniformly at random from all matchings of this size.
We believe that this basic operation can be of great interest in many practical applications and could be used as a building block for writing efficient software on large distributed unstable platforms. We plan to demonstrate its practical efficiency for content distribution, management of large databases and distributed storage applications described in Section .
We also have ongoing work on using this dating service for the maintenance of a randomized overlay network against arbitrary arrivals and departures of nodes, and are trying to remove the requirement for the algorithm to work in a succession of rounds.
In the context of our collaboration with Yahoo!, we have presented in a new algorithm for disk reconfiguration in the context of Vespa, a scalable platform for storing, retrieving processing and searching large amounts large amounts of data developped by Yahoo! Technologies Norway. The corresponding scheduling problem is closely related to independent related tasks scheduling on heterogeneous platforms, when communication costs are taken into account, and when each task can only be processed on a prescribed set of processors. We prove how to derive from a linear programming formulation in rational numbers an approximation algorithm whose approximation ratio is close to 1 in the condition of use of Vespa.
As already noted in Section with the example of WCG call for proposal, the application field of Grid computing is limited by several constraints. In particular, the target application should be easy to divide into small independent pieces of work, so that each individual piece can be executed on a single node. This strongly limits the application field since in many cases, data may be too large to fit into the memory of a single node.
In this context, we would like to propose a distributed algorithm to dynamically build clusters of nodes able to process large tasks. These sets of nodes should satisfy constraints on the overall available memory, on its processing power together with constraints on the maximal latency between nodes and the minimal bandwidth between two participating nodes.
We believe that such a distributed service would enable to consider a much larger application field. We plan to demonstrate first its practical efficiency for the application of molecular dynamics (based on NAMD) described in more detail in Section .
In
we present a modeling of this problem called
bin-covering problem with distance constraintand we propose a distributed approximation algorithm in the case where the elements are in a space of dimension 1. In
, we describe a generic 2-phases algorithm, based on resource augmentation and whose approximation ratio is 1/3.
We also propose a distributed version of this algorithm when the metric space is
(for a small value of
D) and the
norm is used to define distances. This algorithm takes
O((4
D)log
2n)rounds and
O((4
D)
nlog
n)messages both in expectation and with high probability, where
nis the total number of hosts.
In many applications on large scale distributed platforms, the application data files are distributed among the platform and the volatility in the availability of resources forbids to rely on a centralized system to locate data.
In this context, complex queries, such as finding a node holding a given set of files, or holding a file whose index is close to a given value, or a set of (close) nodes covering a given set of files, should be treated in a distributed manner. Queries built for P2P systems are much too poor to handle such requests.
We plan to demonstrate the usefulness and efficiency of such requests on the molecular dynamics application and on the continuous integration application described in Section . Again, we strongly believe that these operations can be considered as useful building blocks for most large scale distributed applications that cannot be executed in a client-server model, and that providing a library with such mechanisms would be of great interest.
A sound approach is to structure them in such a way that they reflect the structure of the application. Peers represent objects of the application so that neighbours in the peer to peer network are objects having similar characteristics from the application's point of view. Such structured peer to peer overlay networks provide a natural support for range and complex queries. We have proposed in to use complex structures such as a Voronoï tessellation, where each peer is associated to a cell in the space. Moreover, since the associated cost to compute and maintain these structures is usually extremely high for dimensions larger than 2, we have proposed to weaken the Voronoï structure to deal with higher dimensional spaces .
We are currently adapting the techniques proposed in these papers to the molecular dynamics application in collaboration with Juan Elezgaray from IECB.
Cyril Banino (Yahoo!, Trondheim, Norway) did his Master degree at the University of Bordeaux in 2002 under the supervision of Olivier Beaumont and his PhD in Trondheim (N.T.N.U.). During his PhD, he worked with Olivier Beaumont on decentralized algorithms for independent tasks scheduling. This collaboration is manifested by several research visits (for a total of 5 weeks since 2003) and several joint papers (IEEE TPDS, Europar'06, IPDPS'03). He has been recently appointed at Yahoo! (Trondheim), and we started an informal collaboration with Yahoo! Research, that led to the publication of . We now plan to establish a formal collaboration on document storage in large distributed databases, request scheduling and independent tasks distribution across large distributed platforms.
We started an informal collaboration with Xavier Hanin (Jayasoft) who has developed Xooctory
Alpage, lead by Olivier Beaumont, focuses on the design of algorithms on large scale platforms. In particular, we will tackle the following problems
Large scale distributed platforms modeling
Overlay network design
Scheduling for regular parallel applications
Scheduling for applications sharing large files.
The project involves the following INRIA and CNRS teams : Cepage, Graal, Mescal, Algorille, ASAP, LRI and LIX
The scientific objectives of ALADDIN are to solve what are identified as the most challenging problems in the theory of interaction networks. The ALADDIN project is thus an opportunity to create a full continuum from fundamental research to applications in coordination with both INRIA projects CEPAGE and GANG.
The objective of the ADT INRIA Aladdin
The objectives of USS SimGrid is to create a simulation framework that will answer (i) the need for simulation scalability arising in the HPC community; (ii) the need for simulation accuracy arising in distributed computing. The Cepage team will be involved in the development of tools to provide realistic model instantiations.
The project involves the following INRIA and CNRS teams: AlGorille, ASAP, Cepage, Graal, MESCAL, SysCom, CC IN2P3.
Travel grant, 2006-2008, on "Models and Algorithms for Scale-Free Structures", in collaboration with the Department of Computer Science, King's College London, and the Department of Computer Science, the University of Liverpool. Funded by the EPSRC. Main investigators on the UK side: Colin Cooper (King's College London) and Michele Zito (University of Liverpool). Ralf Klasing is the principal investigator on the French side.
European COST Action: "COST 293, Graal", 2004-2008. The main objective of this COST action is to elaborate global and solid advances in the design of communication networks by letting
experts and researchers with strong mathematical background meet peers specialized in communication networks, and share their mutual experience by forming a multidisciplinary scientific
cooperation community. This action has more than 25 academic and 4 industrial partners from 18 European countries. (
http://
The COST 295 is an action of the European COST program (European Cooperation in the Field of Scientific and Technical Research) inside of the Telecommunications, Information Science and
Technology domain (TIST). The acronym of the COST 295 Action, is DYNAMO and stands for "Dynamic Communication Networks". The COST295 Action is motivated by the need to supply a convincing
theoretical framework for the analysis and control of all modern large networks induced by the interactions between decentralized and evolving computing entities, characterized by their
inherently dynamic nature. (
http://
Ralf Klasing is a member of the Editorial Board of Networks, Parallel Processing Letters, Algorithmic Operations Research, and Computing and Informatics.
Olivier Beaumont
ISPDC'09 8th International Symposium on Parallel and Distributed Computing, Lisbon, Portugal
RenPar 09, Rencontre du Parallélisme, Toulouse, France, 2009
IPDPS'09 IEEE International Parallel and Distributed Processing Symposium, Rome, Italie, 2009
ICPADS'08, IEEE International Conference on Parallel and Distributed Systems (ICPADS'08), Melbourne, Victoria, AUSTRALIA
ISCIS'08, International Symposium on Computer and Information Sciences co-chairParallel Distributed and Grid Systems Symposium), Istambul, Turquie
PMAA'08 International Workshop on Parallel Matrix Algorithms and Applications, Rennes, France
RenPar'08 Rencontre du Parallélisme, Le Croisic, France
HeteroPar'08 International Workshop on Algorithms, models, and tools for parallel computing on heterogeneous networks (Cork, Ireland)
Philippe Duchon
AlgoTel'08 (May 13-16, Saint-Malo, France) Rencontres Francophones sur les aspects Algorithmiques des Télécommunications
Cyril Gavoille
DISC'08 (Sep. 22-24, Arcachon, France) International Symposium on Distributed Computing
PODC'08 (August 18-21, Toronto, Canada) Annual ACM Symposium on Principles of Distributed Computing
AlgoTel '08 (May 13-16, Saint-Malo, France) Rencontres Francophones sur les aspects Algorithmiques des Télécommunications
JDIR'08 (Jan. 16-18, Villeneuve d'Ascq, France) Journées Doctorales en Informatiques et Réseaux
David Ilcinkas
AlgoTel'09 (June 16-19, Marseille, France) Rencontres Francophones sur les aspects Algorithmiques des Télécommunications
SIROCCO'09 (May 24-26, Piran, Slovenia) International Colloquium on Structural Information and Communication Complexity
Ralf Klasing
SIROCCO'09 (May 24-26, Piran, Slovenia) International Colloquium on Structural Information and Communication Complexity
Cyril Gavoille, Nicolas Hanusse, David Ilcinkas and Ralf Klasing were in charge of the organization of DISC'2008 (International Symposium on Distributed Computing).
David Ilcinkas participated in the organization of IMAGINE'2008 (workshop collocated to ICALP'2008).
Cyril Gavoille, Nicolas Hanusse, Philippe Duchon, Ralf Klasing and Nicolas Bonichon will be part of the organizing commitee of EuroComb 2009, to be held in September 2009 in Bordeaux.
Ralf Klasing was in charge of the organization of the GRAAL Workshop (18/02-20/02/2008, Bordeaux). He was also a member of the Organizing Committees of STACS 2008 (21/02-23/02/2008, Bordeaux) and DISC 2008 (22/09-24/09/2008, Arcachon).
Ralf Klasing is responsible for the working group on "Distributed Algorithms" at the LaBRI.
Ralf Klasing was external Ph.D. reviewer and member of the Ph.D. committee of Peter Korteweg (Eindhoven University of Technology, The Netherlands, April 2008).
Olivier Beaumont was external Ph.D. reviewer (rapporteur) and member of the Ph.D. committee of Abhay Ghatpande (Waseda University, Japan)
Olivier Beaumont was external Ph.D. reviewer (rapporteur) and member of the Ph.D. committee of Tchimou N'Takpe (INRIA Project Team ALGORILLE, LORIA, Nancy, France).
Cyril Gavoille was external Ph.D. reviewer (rapporteur) and member of the Ph.D. committee of Morgan Seston (Université de la Méditerranée, Luminy, France).
Ralf Klasing gave an invited talk on “Searching for black-hole faults in a network using multiple agents” at the 12th COST 293 GRAAL meeting in Bertinoro, Italy.
Hubert Larchevêque gave an invited talk on “Distributed Bin Covering Problems” at the University of Neuchatel, Switzerland.
Noel Novelli (03/05/2008 – 05/05/2008) (Assistant Professor at Marseille University)
1 talk on data mining in 2008
Tobias Mömke, ETH Zürich, Switzerland (21/01/2008 – 20/02/2008, STSM DYNAMO)
collaboration on: "Algorithm design for TSP and related problems in large interaction networks."
1 talk (GT Algorithmique Distribuée 11/02/2008): "Reoptimization Problems: Hardness and Algorithms."
Leszek Gasieniec, University of Liverpool, UK and Jurek Czyzowicz, Université du Québec en Outaouais, Canada (02/05/2008 – 09/05/2008)
collaboration on: "graph exploration with small memory"
Colin Cooper, King's College London, UK (02/05/2008 – 09/05/2008, Royal Society Grant)
collaboration on: "algorithms for the web graph, peer-to-peer networks, graph exploration, black hole search"
1 talk (GT Algorithmique Distribuée): "Multiple random walks on random regular graphs."
Tomasz Radzik, King's College London, UK (05/05/2008 – 04/06/2008, Royal Society Grant + Invited Prof. Université Bordeaux 1)
collaboration on: "algorithms for the web graph, peer-to-peer networks, graph exploration, black hole search"
1 talk (GT Algorithmique Distribuée): "Locating and repairing faults in a network with mobile agents."
Miroslaw Korzeniowski, Wroclaw University of Technology, Poland (30/05/2008 – 13/06/2008, STSM DYNAMO)
collaboration on: "Dating service - a tool to cope with heterogeneity in distributed systems."
1 talk (GT Algorithmique Distribuee 09/06/2008): "Skew-CCC - degree 3 suffices to build an overlay network in a distributed way."
Colin Cooper, King's College London, UK (29/09/2008 – 28/10/2008, Royal Society Grant + Invited Prof. Université Bordeaux 1)
collaboration on: "algorithms for the web graph, peer-to-peer networks, graph exploration, black hole search"
David Ilcinkas, visiting University of Liverpool, UK (04/06 – 11/06/2008)
collaboration on: "graph exploration" and "efficient broadcasting"
Cyril Gavoille, visiting The Weizmann Institute, IL (9-14 Nov. 2008)
collaboration with David Peleg on "graph spanners"
Planned visits:
Olivier Beaumont visiting Miroslaw Korzeniowski, Wroclaw University of Technology (Feb. 2009)
The members of CEPAGE are heavily involved in teaching activities at undergraduate level (Licence 1, 2 and 3, Master 1 and 2, Engineering Schools ENSEIRB). The teaching is carried out by members of the University as part of their teaching duties, and for CNRS (at master 2 level) as extra work. It represents more than 500 hours per year.
At master 2 level, here is a list of courses taught the last two years:
Olivier Beaumont
Routing and P2P Networks (last year of engineering school ENSEIRB, 2008)
Nicolas Bonichon
C++ programming (2nd year MASTER "Computer Science"- 2008, 2009)
Cyril Gavoille
Algorithm Analysis (2nd year MASTER "Models and Algorithms" - 2008)
Communication and Routing (last year of engineering school ENSEIRB 2008)
Real-World Algorithms (2nd year MASTER "Models and Algorithms" - 2008)
Ralf Klasing
Communication Algorithms in Networks (2nd year MASTER "Models and Algorithms" - 2008, 2009)