MESCAL project is a common project supported by CNRS, INPG, UJF and INRIA located in the ID-IMAG labs (UMR 5132).
MESCAL is a new INRIA team, created in 2005. The former APACHE project was closed in 2004 and gave birth to two new teams, MOAIS and MESCAL. These two projects still have strong collaborations, which are pointed out throughout this document.
The recent evolutions in computer networks technology, as well as their diversification, yield a tremendous change in the use of these networks: applications and systems can now be designed at a much larger scale than before. This scaling evolution is dealing with the amount of data, the number of computers, the number of users, and the geographical diversity of theses users.
This race towards large scalecomputing questions many hypotheses underlying parallel and distributed algorithms and common management middlewares. Tools developed for average size systems cannot be run on large scale systems without a significant degradation of their performances.
The goal of the MESCAL project is to design and validate exploitation mechanisms (middleware and system services) for large distributed infrastructures.
MESCAL's target applications are intensive scientific computations such as cellular micro-physiology, protein conformations, particle detection, combinatorial optimization, Monte Carlo simulations, and others. Such applications are constituted of a large set of independent, equal-sized tasks and therefore may benefit from large-scale computing platforms. Initially executed on large dedicated clusters (CRAY, IBM, COMPAQ), they have been recently deployed on collections of homogeneous clusters aggregating a large number of commodity components. The experience showed that such clusters offer a huge computing power at a very reasonable price. MESCAL's target infrastructures are aggregations of commodity components and/or commodity clusters at metropolitan, national or international scale. Examples of target infrastructures are grids obtained through mutualization of available resources inside autonomous computing services, lightweight grids (such as the local CIMENT Grid) which are limited to trusted autonomous systems, clusters of intranet resources (Condor) or aggregation of Internet resources (SETI@home, Xtremweb).
MESCAL's methodology in order to ensure efficiencyand scalabilityof proposed mechanisms is based on systematic modeling and performance evaluation of target architectures, software layers and applications.
Understanding qualitative and quantitative properties of distributed systems and parallel applications is a major issue. The a posteriorianalysis of the behavior of the system or the design of predictive models are notoriously challenging problems.
Indeed, large distributed systems contain many different features (processes, threads, jobs, messages, packets) with intricate interactions between them (communications, synchronizations). The analysis of the global behavior of the system requires to take into account large data sets.
As for a priorimodels, our current research focuses on capturing the distributed behavior of large dynamic architectures. Actually, both formal models and numerical tools are being used to get predictions on the behavior of large systems.
For large parallel systems, the non-determinism of parallel composition, the unpredictability of execution times and the influence of the outside world are usually expressed in the form of multidimensional stochastic processes which are continuous in time with a discrete state space. The state space is often infinite or very large and several specific techniques have been developed to deal with what is often termed as the ``curse of dimensionality''.
MESCAL deals with this problem using several complementary tracks:
Behavior analysis of highly distributed systems,
Simulation algorithms able to deal with very large systems,
Fluid limits (used for simulation and analysis),
Decomposition of the state space,
Structural and qualitative analysis.
The development of highly distributed architectures running widely spread applications requires to elaborate new methodologies to analyze the behavior of systems. Indeed, runtime systems
on such architectures are empirically tuned. Analysis of executions are generally manually performed on post-mortem traces that have been extracted with very specific tools. This tedious
methodology is generally motivated by the difficulty to characterize the resources of such systems. For example, big clusters, grids or peer-to-peer (P2P)
Since the advent of distributed computer systems an active field of research has been the investigation of schedulingstrategies for parallel applications. The common approach is to employ scheduling heuristics that approximate an optimal schedule. Unfortunately, it is often impossible to obtain analytical results to compare the efficiency of these heuristics. One possibility is to conduct large numbers of back-to-back experiments on real platforms. While this is possible on tightly-coupled platforms, it is infeasible on modern distributed platforms (i.e. Grids or peer-to-peer environments) as it is labor-intensive and does not enable repeatable results. The solution is to resort to simulations. Simulations not only enable repeatable results but also make it possible to explore wide ranges of platform and application scenarios.
The SimGridframework enables the simulation of distributed applications in distributed computing environments for the specific purpose of developing and evaluating scheduling algorithms. This software is the result of a long-time collaboration with Henri Casanova(University of California, San Diego).
Using a constructive representation of a Markovian queueing network based on events (often called GSMPs), we have designed a perfect simulation tool computing samples distributed according to the stationary distribution of the Markov process with no bias. Two softwares have been developed. analyzes a Markov chain using its transition matrix and provides perfect samples of cost functions of the stationary state. 2samples the stationary measure of Markov processes using directly the queueing network description. Some monotone networks with up to 10 50states can be handled within minutes over a regular PC.
Web caches as well as peer-to-peer systems must be able to serve a set of customers which is both large (several tens of thousands) and highly volatile (with short connection times). These features make analysis difficult when classical approaches (like Markovian Models or simulation) are used. We have designed simple fluid models to get rid of one dimension of the problem. This approach has been applied to several systems of web caches (such as Squirel) and to peer-to-peer systems (such as BitTorrent). This helps to get a better understanding of the behavior of the system and to solve several optimization problems.
The first class of models we will be using is Continuous time Markov chains (CTMC). The usefulness of Markov models is undisputed, as attested by the large number of modeling tools implementing Markov solvers. However their practical applications are limited by the state-space explosionproblem, which puts excessive demands on memory and execution time when studying large real-life systems. Continuous-time Stochastic Automata Networks describe a system as a set of subsystems that interact. Each subsystem is modeled by a stochastic automaton, and some rules between the states of each automaton describe the interactions between subsystems. The main challenge is to come up with ways to compute the asymptotic (or transient) behavior of the system without ever generating the whole state space. Several techniques have been developed in our group based on bounds, lumpability, symmetry and properties of the Kronecker product. Most of them have been integrated in a software tool ( PEPS) which is openly available.
As seen before, the interaction of several processes through synchronization, competition or superposition within a distributed system is the source of the main difficulties because it induces a state space explosion and a non-linear dynamic behavior. Here, the use of exotic algebras, such as (min,max,plus) can help. Highly synchronous systems become linear in this framework and therefore are amenable to formal solutions in simple cases. More complicated systems are neither linear in (max,plus) nor in the classical algebra. Several qualitative properties have been established for a large class of such systems called free-choice Petri nets (sub-additivity, monotonicity or convexity properties). Such qualitative properties are sometimes enough to assess the class of routing policies optimizing the global behavior of the system. They are also useful to design efficient numerical tools computing the asymptotic behavior.
As already mentioned, the race towards large scalecomputing questions many hypotheses underlying parallel and distributed algorithms and common management middleware. Most distributed systems deployed nowadays are characterized by a high dynamism of their entities (participants can join and leave at will), a potential instability of the large scale networks (on which concurrent applications are running), and the increasing probability of failure. Therefore, as the size of the system increases, it becomes necessary that it adapts automatically to the changes of its components, requiring a self-organization of the system with respect to the arrival and departure of participants, data, or resources.
As a consequence, it becomes crucial to understand and model the behavior of large scale systems, to efficiently exploit these infrastructures. In particular it is essential to design dedicated algorithms and infrastructures handling a large amount of users and/or data.
MESCAL deals with this problem using several complementary tracks:
Fairness in large-scale distributed systems,
Deployment and management tools,
Scalable batch scheduler for clusters and grids.
Large-scale distributed platforms (Grid computing platforms, enterprise networks, peer-to-peer systems) result from the collaboration of many people. Thus, the scaling evolution we are facing is not only dealing with the amount of data and the number of computers but also with the number of users and the diversity of their behavior. In a high-performance computing framework, the rationale behind this joining of forces is that most users need a larger amount of resources than what they have on their own. Some only need these resources for a limited amount of time. On the opposite some others need as many resources as possible but do not have particular deadlines. Some may have mainly tightly-coupled applications while some others may have mostly embarrassingly parallel applications. The variety of user profiles makes resources sharing a challenge. However resources have to be fairlyshared between users, otherwise users will leave the group and join another one. Large-scale systems therefore have a real need for fairness and this notion is missing from classical scheduling models.
The MESCAL project studies and develops a set of tools designed to help the installation and the use of a cluster of PCs. The first version had been developed for the icluster1 platform exploitation. The main tools are a scalable tool for cloning nodes ( KA-Deploy) and a parallel launcher based on the Taktukproject (now developed by the MOAIS project). Many interesting issues have been raised by the use of the first versions among which we can mention environment deployment, robustness and batch scheduler integration. A second generation of these tools is thus under development to meet these requirements.
The new KA-Deployhas been retained as the primary deployment tool for the experimental national grid GRID'5000.
This software is co-developed with MOAIS project.
Most known batch schedulers (PBS, LSF, Condor, ...) are of old-fashioned conception, built monolithically, with the purpose of fulfilling most of the exploitation needs. This results in systems of high software complexity (150000 lines of code for OpenPBS), offering a growing number of functions that are, most of the time, not used. In such a context, it becomes hard to control both the robustness and the scalability of the whole system.
OARis an attempt to address these issues. Firstly, OARis written in a very high level language (Perl) and makes intensive use of high level tools (MySql and Taktuk), thereby resulting in a concise code (around 5000 lines of code) easy to maintain and extend. This small code as well as the choice of widespread tools (MySql) are essential elements that ensure a strong robustness of the system. Secondly, OARmakes use of Sql requests to perform most of its job management tasks thereby getting advantage of the strong scalability of most database management tools. Such scalability is further improved in OARby making use of Taktukto manage nodes themselves.
Current development in OARfocuses on its extension to Grids and advanced scheduling techniques. The extension of OARto Grids has already started by making it support best effort jobs. The integration of advanced scheduling techniques is in progress and aims at adding both state of the art batch scheduling algorithms and new task models to the system.
Making a distributed system reliable has been and remains an active research domain. Nonetheless this has not so far lead to results usable in an intranet or federal architecture for computing. Most propositions address only a given application or service. This may be due to the fact that until clusters and intranet architectures arose, it was obvious that client and server nodes were independent. So, a fault or a predictable disconnection on most of the nodes didn't lead to a complete failure of the system. This is not the case in parallel scientific computing where a fault on a node can lead to a data loss on thousands of other nodes. The reliability of the system is hence a crucial point. MESCAL's work on this topic is based on the idea that each process in a parallel application will be executed by a group of nodes instead of a single node: when the node in charge of a process fails, another in the same group can replace it in a transparent way for the application.
There are two main problems to be solved in order to achieve this objective. The first one is the ability to migrate processes of a parallel, and thus communicating, application without enforcing modifications. The second one is the ability to maintain a group structure in a completely distributed way. The first one relies on a close interaction with the underlying operating systems and networks, since processes can be migrated in the middle of a communication. This can only be done by knowing how to save and to replay later all ongoing communications, independently of the communications. Freezing a process to restore it on another node is also an operation that requires collaboration of the operating system and a good knowledge of its internals. The other main problem (keeping a group structure) belongs to the distributed algorithms domain and is of a much higher level nature.
The resulting software of this research topic is called SAMORYand is able to keep a set of processes alive on a given set of nodes, even in the presence of faults, hardware or software. It has been used on seismic applications (wave propagation) as a part of the RNTL IGGI project ( http://iggi.imag.fr). SAMORYis freely available at http://iggi.imag.fr. It is composed of a Linux kernel module and a daemon that monitors the processes and their communications and reacts when a fault is discovered or suspected.
Future work will concern the behavior analysis of checkpoint systems in order to predict precisely critical operations to optimize resource usage (network and disk bandwidth).
In order to use large data, it is necessary (but not always sufficient, as seen later) to efficiently store and transfer them to a given site (a set of nodes) where it is going to be used. The first step toward this achievement is the construction of a file system that is an extension of NFS for the grid environment. The second step is an efficient transfer tool that provides throughputs close to optimal ( i.e.the capacity of the underlying hardware).
NFSp is a distributed file system for clusters that enables one to store data over a set of nodes (instead of a single one). It was designed to permit the usage of a set of disks to optimize memory allocations. It is important for performance and simplicity that this new file system has little overhead for access and updates. From a user point of view, it is used just as a classical NFS. From the server point of view, however, the storage is distributed over several nodes (possibly including the users).
The mounting point is only in charge of the meta-data, name, owner, access permissions, size, inodes etc., of the files while their content is stored on separate nodes. Every read or write request is received by the meta-server, the mounting point, which sends them to the relevant storage nodes, called IOD for Input/Output Daemon which will serve the request and send the result to the client.
Two implementations were done, one at the user level and one at the kernel level. Performances are good for read operations, for example 150MBs/sec for 16 IODs connected through a 100Mb/s for 16 clients. For write operations performances are limited by the bandwidth available for the meta-server which is a significant bottleneck.
Storage distribution on a large set of disks raises the reliability problem: more disks mean a higher fault rate. To address this problem we introduced in NFSp a redundancy on the IODs, the storage nodes by defining VIOD, Virtual IOD, which is a set of IODs that contain exactly the same data. So when an IOD fails another one can serve the same data and continuity of service is insured though. This doesn't modify the way the file-system is used by the clients: distribution and replication remain transparent. Several consistency protocols are proposed with various levels of performance; they all enforce at least the NFS consistency which is expected by the client.
To efficiently transfer files across a grid a ``beowulf-like'' solution consists in creating a set of point-to-point communications to parallelize the transfer of a file or a set of files. This approach was chosen, for instance, in gridftp . It implies duplicating the data to transfer or distribute them on separate nodes before the transfer begins. We use the distributed storage property of NFSp to be able to do parallel transfer transparently. However, since a grid is heterogeneous from a hardware and a software point of view, we decided to build our own solution in a generic way, it can be used by any kind of data server: SAN, local file systems, NFS or NFSp. The component in charge of transfer across the grid is called Gxfer, for Grid Transfer, its goal is to copy files between sites. A copy is done in a parallel way if both sender and receiver can handle it and have distributed storage capability. Gxfercan be used as an external program, it will then behave like the classic scpcommand or can be used as a library inside an application.
Gxferperformances are good, with a 1Gbytes file transferred in less than 10 seconds, 9.6s, between sites in Grenoble and Lyon connected with a 1Gbits/s link, with NFSp servers on both sides. Further experiments exhibited good scaling properties.
Applications in the fields of numerical simulation, image synthesis, and processing are typical of the user demand for high performance computing. In order to confront our proposed solutions for parallel computing with real applications, the project is involved in collaborations with end-users to help them to parallelize their applications.
This joint work involves the GRAAL project.
The problem of searching large-scale genomic sequence databases is an increasingly important bioinformatics problem. We have obtained results on the deployment of such applications in heterogeneous parallel computing environments. These results are based on the analysis of the GriPPS , protein comparison application. The GriPPS framework is based on large databases of information about proteins; each protein is represented by a string of characters denoting the sequence of amino acids of which it is composed. Biologists need to search such sequence databases for specific patterns that indicate biologically homologous structures. The GriPPS software enables such queries in grid environments, where the data may be replicated across a distributed heterogeneous computing platform.
In fact, this application is a part of a larger class of applications, in which each task in the application workload exhibits an ``affinity'' for particular nodes of the targeted computational platform. In the genomic sequence comparison scenario, the presence of the required data on a particular node is the sole factor that constrains task placement decisions. In this context, task affinities are determined by location and replication of the sequence databanks in the distributed platform.
Such biological sequence comparison algorithms are however typically computationally intensive, embarrassingly parallel workloads. In the scheduling literature, this computational model is effectively a divisible workload schedulingproblem with negligible communication overheads. This framework has enabled us to propose online scheduling algorithms whose output is fairand efficient: the slowdown experienced by every user due to the load incurred by the others is as uniform as possible.
This joint work involves the UMR 8504 Géographie-Cité, LSR-IMAG, UMS RIATE and the Maisons de l'Homme et de la Société.
Improvements in the Web developments have opened new perspectives in interactive cartography. Nevertheless existing architectures have some problems to perform spatial analysis methods that require complex calculus over large data sets. Such a situation involves some limitations in the query capabilities and analysis methods proposed to users. The HyperCarte consortium with LSR-IMAG, Géographie-cité and UMR RIATE proposes innovative solutions to these problems. Our approach deals with various areas such as spatio-temporal modelling, parallel computing and cartographic visualization that related to spatial organizations of social phenomena.
Nowadays, analyses are done on huge heterogeneous data set. For example, demographic data sets at nuts 5 level, represent more than 100.000 territorial units with 40 social attributes. Many algorithms of spatial analysis, in particular potential analysis are quadratic in the size of the data set. Then adapted methods are needed to provide ``user real time'' analysis tools.
Numerical modelling of seismic wave propagation in complex three-dimensional media is an important research topic in seismology. Several approaches will be studied, and their suitability with respect to the specific constraints of NUMA architectures shall be evaluated. These modelling approaches will rely on modern numerical schemes such as spectral elements, high-order finite differences or finite elements applied to realistic 3D models. The NUMASIS project (see Section ) will focus on issues related to parallel algorithms (distribution, scheduling) in order to optimize computations based on such numerical schemes by taking advantage of execution frameworks developed for NUMA architectures.
These approaches will be tested and validated on applications related to seismic risk assessment. Recent seismic events as those in Asia have evidenced the crucial research and development needs in this field. Some regions in France may as well be prone to such risks (French Riviera, Alps, French Antilles,...) and the experiments in the NUMASIS project will be carried out using some of the available data from these regions.
The CIMENT project (Intensive Computing, Numerical Modeling and Technical Experiments, http://ciment.ujf-grenoble.fr/) gathers a wide scientific community involved in numerical modeling and computing (from numerical physics and chemistry to astrophysics, mechanics, biomodeling and imaging) and the distributed computer science teams from Grenoble. Among these various application domains, there is a huge demand to manage executions of large sets of independent jobs. These sets have between 10,000 to 100,000 jobs each. Providing a middleware able to steer such an amount of jobs is a challenge. The CiGri middleware project addresses this issue in a grid infrastructure.
The aim of the CiGri project is to gather the unused computing resource from intranet infrastructure and to make it available for large scale applications. This grid is based on two software tools. The CiGri server software is based on a database and offers a user interface for launching grid computations (scripts and web tools). It interacts with the computing clusters through a batch scheduler software. CiGri is compatible with classical batch systems like PBS but an efficient batch software ( OAR, http://oar.imag.fr/) has been developed by MESCAL and MOAIS Project for easy integrations and experimentations of scheduling tools.
The large-sized clusters and grids show serious limitations in many basic system softwares. Indeed, the launching of a parallel application is a slow and significant operation in heterogeneous configurations. The broadcast of data and executable files is widely under the control of users. Available tools do not scale because they are implemented in a sequential way. They are mainly based on a single sequence of commands applied over all the cluster nodes. In order to reach a high level of scalability, we propose a new design approach based on a parallel execution. We have implemented a parallelization technique based on spanning trees with a recursive starting of programs on nodes. Industrial collaborations were carried out with Mandrake, BULL, HP and Microsoft.
KA-Deployis an environment deployment toolkit that provides automated software installation and reconfiguration mechanisms for large clusters and light grids. The main contribution of KA-Deploy2 toolkit is the introduction of a simple idea, aiming to be a new trend in cluster and grid exploitation: letting users concurrently deploy computing environments exactly fitted to their experiment needs on different sets of nodes. To reach this goal KA-Deploymust cooperate with batch schedulers, like OAR, and use a parallel launcher like Taktuk(see below).
Taktukis a tool to launch or deploy efficiently parallel applications on large clusters, and simple grids. Efficiency is obtained thanks to the overlap of all independent steps of the deployment. We have shown that this problem is equivalent to the well known problem of the single message broadcast. The performance gap between the cost of a network communication and of a remote execution call enables us to use a work stealing algorithm to realize a near-optimal schedule of remote execution calls. Currently, a complete rewriting based on a high level language (precisely Perl script language) is under progress. The aim is to provide a light and robust implementation. This development is lead by the MOAIS project.
When deploying a cluster of PCs there is a lack of tools to give a global view of the available space on the drives. This leads to a suboptimal use of most of this space. To address this problem NFSpwas developed, as an extension to NFS that divides file system handling in two components: one responsible for the data stored and the other for the metadata, like inodes, access permission.... They are handled by a server, fully NFS compliant, which will contact associated data servers to access information inside the files. This approach enables a full compatibility, for the client side, with the standard in distributed file systems, NFS, while permitting the use of the space available on the clusters nodes. Moreover efficient use of the bandwidth is done because several data servers can send data to the same client node, which is not possible with a usual NFS server. The prototype has now reached a mature state.
As clusters use grows, lots of scientific applications (biology, climatology, nuclear physics ...) have been rewritten to fully exploit this extra CPU power and storage capacity. This kind of software uses and creates huge amounts of data with typical parallel I/O access patterns. Several issues, like out-of-core limitationor efficient parallel input/output accessalready known in a local context (on SMP nodes for example), have to be handled in a distributed environment such as a cluster. The effective local hardware facilities which reduced response time and access constraints on SMP could not provide optimal performances with respect to CPU and network power available in a cluster. Several solutions have been proposed by the scientific community to handle these issues, like Parallel File systems or Parallel I/O Libraries, but their specific API limits portability and requires good knowledge of their internal mechanisms.
We have designed aIOLi, an efficient I/O library for parallel access to remote storages in SMP clusters. Thanks to the SMP kernel features, our framework provides parallel I/O without inter-processes synchronization mechanisms as well as a simple interface based on the classic UNIX system calls (create/open/read/write/close). The aIOLisolution allows us to achieve performance close to the limits of the remote storage system. This was done in several steps:
build a local framework that can do aggregation of requests at the application level. This is done by putting a layer between the application and the kernel in charge of delaying individual requests in order to merge them and thus improve performances. The key factor here is the delay that should be large enough to discover aggregation patterns with a limit to avoid leading to excessive delay. This is done by bounding this delay to a maximum and minimum value that the aIOLilayer has to respect.
schedule all I/O requests on a cluster in a global way in order to avoid congestion on a server that leads to bad performances. The idea was to set up a server that would do this work by knowing every I/O operations planned in the system and could thus schedule them. This solution didn't work because false predictions on the duration of a request, necessary to schedule them without overlaps, lead to a suboptimal use of the server and thus low performances.
schedule I/O requests locally on the server so that methods of aggregation and mixing of client requests can be used to improve performances. For that aIOLihad to be ported to the kernel and placed at both the VFS level and the lower file system one.
The results have been impressive, with aIOLigiving better performances than the best MPI/IO implementation without any modification of the applications sometimes with a factor of 4. aIOLican be downloaded from the address http://aioli.imag.fr, both the user library and the Linux kernel module versions.
SAMORYis an architecture to provide resiliency to parallel applications running on top of virtual clusters, typically built from an intranet or an enterprise network.
SAMORYis a runtime aiming at providing resiliency to high performance computing applications running on a virtual cluster, typically hosts of an intranet. It is composed of a Linux kernel module that must be loaded at runtime and a distributed architecture for monitoring, checkpointing and restarting communicating processes. The size of the replication group for a process, the number of copies that will be done for a process, is a parameter that can be fixed, and modified, at runtime depending on the availability of the hosts. The communications between processes are also taken into account by SAMORYand, when a process fails, all pending communications will be transferred transparently to the site where a backup of it will resume.
The main advantage of SAMORYis its total transparency with respect to the applications: starting the monitoring of an application is solely telling the runtime to do so and the applications isn't changed in any way. This can be done at runtime. Introduction of the checkpointing hinders performances but in a reasonable way, the cost of checkpointing a process is directly proportional to the amount of memory it uses with, for example, a checkpoint time of 400ms for a 100Mbytes process on a PC with a Pentium 4 at 2Ghz and 512Mbs of RAM. Virtual memory management, and the appearance of faults, increases these values for large processes, 10s for a 400Mbytes process. By saving only the state related part of the process, excluding code and shared library for example, this cost can be reduced by 75%. The time necessary to restart a process is low, typically milliseconds, since most data will be loaded when necessary and so the execution can resume soon but will generate page faults later. Since the final time will heavily depend on the behavior of the applications it is not possible to give generic performance results for this step. The overhead on the communications is negligible for small amounts of data and becomes significant for messages of size 8Mbytes.
This software was formerly developed by members of the Apache project. Even if no real research effort is anymore done on this software, many members of the MESCAL project use it in their everyday research and promote its use. This software is now mainly maintained by Benhur Stein from Federal University Santa Monica (UFSM), Brazil.
Pajeallows applications programmers to define what is visualized and how new objects should be drawn. To achieve such flexibility, the hierarchy of events and the visualization commands may be defined by the programmers inside the applications. The visualization of parallel execution of Athapascanapplications was achieved without any new addition into Pajesoftware. Inserting few events trace into the Athapascanruntime allows the visualization of different facets of the program: application computation time but also user task graph management and scheduling of these tasks. Pajeis also, among others, used to visualize Java program execution and large cluster monitoring. Pajeis actively used by the SimGridusers' community and the NUMASIS project (see Section ).
OARis a batch scheduler that emphasizes simplicity, extensibility, modularity, efficiency, robustness and scalability. It is based on a high level conception that reduces drastically its software complexity. Its internal architecture is built on top of two main components: a generic and scalable tool for the administration of the cluster (launch, nodes administration, ...) and a database as the only way to share information between its internal modules. Completely written in Perl, OARis also extremely modular and straightforward to extend. Thus, it constitutes a privileged platform to develop and evaluate several scheduling algorithms and new kinds of services.
Most known batch schedulers (PBS, LSF, Condor, ...) are of old-fashioned conception, built monolithically, with the purpose of fulfilling most of the exploitation needs. This results in systems of high software complexity (150000 lines of code for OpenPBS), offering a growing number of functions that are, most of the time, not used. In such a context, it becomes hard to control both the robustness and the scalability of the whole system.
The OARproject focuses on robust and highly scalable batch scheduling for clusters and grids. Its main objectives are the validation of grid administration tools such as Taktuk, the development of new paradigms for grid scheduling and the experimentation of various scheduling algorithms and policies.
The grid development of OARhas already started with the integration of best effort jobs whose purpose is to take advantage of idle times of the resources. Managing such jobs requires a support of the whole system from the highest level (the scheduler has to know which tasks can be cancelled) down to the lowest level (the execution layer has to be able to cancel awkward jobs). The OARarchitecture is perfectly suited to such developments thanks to its highly modular architecture. Moreover, this development is used for the CiGri grid middleware project.
The OARsystem can also be viewed as a platform for the experimentation of new scheduling algorithms. Current developments focus on the integration of theoretical batch scheduling results into the system so that they can be validated experimentally.
SimGridimplements realistic fluid network models that enable very fast yet precise simulations. SimGridenables the simulation of distributed scheduling agents, which has become critical for current scheduling research in large-scale platforms.
Sources and documentations of SimGridare available at the following address http://simgrid.gforge.inria.fr/.
and 2are two software implementing perfect simulation of Markov Chain stationary distributions using the coupling from the past technique. starts from the transition kernel to derive the simulation program while 2uses a monotone constructive definition of a Markov chain. They are available at http://www-id.imag.fr/Logiciels/psi/.
The main objective of PEPS is to facilitate the solution of large discrete event systems, in situations where classical methods fail. PEPS may be applied to the modelling of computer systems, telecommunication systems, road traffic, or manufacturing systems. The software is available at http://www-id.imag.fr/Logiciels/peps/.
The Hyperatlas software have been jointly developed with LSR-IMAG in the framework of the ESPON European project part 3.1 and 3.2. It includes visualization and analysis of socio-economical data in Europe at Nuts 1, Nuts 2 or Nuts 3 level providing analysis of dependence and spatial interaction. This software is available for European partners at http://www-lsr.imag.fr/HyperCarte/.
As explained in Section , some systems cannot be modeled with classical approaches due to their size and their dynamic. By mixing fluid models and discrete models, it is possible to alleviate the combinatorial explosion of such systems. This year, we have successfully used this approach in the two following settings.
This is a collaborative work with Landy Rabehasaina (LMC Laboratory).
We used recent results on admission control to solve problems coming from fluid flow models and from risk theory in
. More precisely we consider a stochastic
process
Q(
t)satisfying a linear differential equation or a stochastic differential equation, driven by a jump process which is modulated by a binary sequence. We prove
multi-modular properties related to the process
Q(
t)and show that the expectation of a functional of the jump process is minimized by a randomized bracket sequence, when that sequence has to satisfy a constraint on
its Cesaro limit. Applications related to optimal strategies in storage systems models and ruin problems are given.
In , we introduce a rather general hybrid system made of deterministic differential equations and random discrete jumps. We then show how to construct a perfect simulation, i.e., simulations providing samples of the system states which distributions have no bias from the asymptotic distribution of the system) using coupling from the past.
The applicability of the method is then illustrated by showing how this framework can be used to model peer to peer systems with hybrid systems. We model two peer-to-peer systems, a cooperative web cache (called Squirrel) and a P2P file-sharing system, by hybrid systems with a continuous part corresponding to a fluid limit of files and a discrete part corresponding to customers.
An experimental study is then carried to show the respective influence that the different parameters (such as time-to-live, rate of requests, connection time) play on the behavior of large peer to peer systems, and also to show the effectiveness of this approach for numerical solutions of stochastic hybrid systems.
Perfect simulation enables to compute samples distributed according to the stationary distribution of the Markov process with no bias. The following sections summarize the various new results obtained using this technique,or on this technique.
Simulation of Markov chains are usually based on an algorithmic representation of the chain. This corresponds to stochastic recurrent equation and could be interpreted as random iterated systems of functions (RIFS). In particular, for perfect simulation of Markov chains, the RIFS structure has a deep impact on execution time of the simulation. Links between the structure of the RIFS and coupling time of algorithm are detailed in . Conditions for coupling and upper bound for simulation time are given for Doeblin matrices. Finally, it is shown that aliasing techniques build an RIFS with a particular binary structure.
By combining coupled trajectories and the perfect simulation scheme, it is possible to reduce the variance of samples. This technique is presented in . The constraint is to find the number of antithetic variates that minimize the error for a fixed amount of computation. This method has been tested on queueing networks models in the framework for simulation 2.
In , monotonicity properties of index based routing queueing networks are established and developed in the perfect simulation framework psi. This has been applied in the context of grid scheduling to compare several scheduling heuristics.
This is a collaborative work with Jantien Dopper (Leiden University) and Anne Bouillard (IRISA).
In the duration of perfect simulations for Markovian finite capacity queuing networks is studied. This corresponds to hitting time (or coupling time) problems in a Markov chain over the Cartesian product of the state space of each queue. We establish an analytical formula for the expected simulation time in the one queue case which provides simple bounds for acyclic networks of queues with losses. These bounds correspond to sums on the coupling time for each queue and are either almost linear in the queue capacities under light or heavy traffic assumptions or quadratic, when service and arrival rates are similar.
This is a collaborative work with Anne Bouillard (IRISA).
In , we show how to design a perfect sampling algorithm for stochastic Free-Choice Petri nets by backward coupling. For Markovian event graphs, the simulation time can be greatly reduced by using extremal initial states, namely blocking marking, although such nets do not exhibit any natural monotonicity property. Another approach for perfect simulation of non-Markovian event graphs is based on a (max,plus) representation of the system and the theory of (max,plus) stochastic systems. Next, we show how to extend this approach to one-bounded free choice nets to the expense of keeping all states. Finally, experimental runs show that the (max,plus) approach needs a larger simulation time than the Markovian approach.
The current methods used to test and study peer-to-peer systems (namely modeling, simulation, or execution on real testbeds) often show limits regarding scalability, realism and accuracy. In we describe and evaluate P2PLab, a framework to study peer-to-peer systems by combining emulation (use of the real studied application within a configured synthetic environment) and virtualization. P2PLab is scalable (it uses a distributed network model) and has good virtualization characteristics (many virtual nodes can be executed on the same physical node by using process level virtualization). Experiments with the BitTorrent file sharing system complete this paper and demonstrate the usefulness of this platform.
This is a collaborative work with Dinard Van der Laan (Vrije University) and Arie Hordijk (Leiden University).
In
, we consider deterministic (both fluid and
discrete) polling systems with
Nqueues with infinite buffers and we show how to compute the best polling sequence (minimizing the average total workload). With two queues, we show that the best
polling sequence is always periodic when the system is stable and forms a regular sequence. The fraction of time spent by the server in the first queue is highly non continuous in the
parameters of the system (arrival rate and service rate) and shows a fractal behavior. Moreover, convexity properties are shown and are used in a generalization of the computation of the
optimal control policy (in open-loop) for the stochastic exponential case.
This is a collaborative work with Vandy Berten (Université Libre de Bruxelles).
We show in how index routing policies can be used in practice for task allocation in computational grids. We provide a fast algorithm which can be used off-line or even on-line to compute the index tables. We also report numerous simulations providing numerical evidence of the great efficiency of our index routing policy as well as its robustness with respect to parameter changes.
This is a collaborative work with Frédéric Vivien (GRAAL project) and Alan Su (Google).
In , we have considered the problem of scheduling comparisons of motifs against biological databanks. distributed biological sequence comparison applications. This problem lies in the divisible load framework with negligible communication costs. Thus far, very few results have been proposed in this model. We discuss and select relevant metrics for this framework: namely max-stretch and sum-stretch. We explain the relationship between our model and the preemptive uni-processor case, and we show how to extend algorithms that have been proposed in the literature for the uni-processor model to the divisible multi-processor problem domain. We recall known results on closely related problems, derive new lower bounds on the competitive ratio of any on-line algorithm, present new competitiveness results for existing algorithms, and develop several new on-line heuristics. Then, we extensively study the performance of these algorithms and heuristics in realistic scenarios. Our study shows that all previously proposed guaranteed heuristics for max-stretch for the uni-processor model prove to be particularly inefficient in practice. In contrast, we show our on-line algorithms based on linear programming to be near-optimal solutions for max-stretch. Our study also clearly suggests heuristics that are efficient for both metrics, although a combined optimization is in theory not possible in the general case.
This is a collaborative work with Yves Robert (GRAAL project), Olivier Beaumont (SCALAPPLIX project) and Larry Carter and Jeanne Ferrante (University of California San Diego).
In the previous study, all tasks are supposed to originate from different users. Therefore, all these tasks are in competition and our algorithms ensure that the slowdown incurred by others is as ``uniform'' as possible. We also have studied a situation where the number of users is small but the number of tasks is very large. Indeed, many applications (cellular micro-physiology, protein conformations, particle detection or others) are constituted of a very large set of independent, equal-sized tasks. All the tasks are generally held by a master who is in charge of distributing it to the different slaves. Some results have been proposed in the past for optimizing the throughput of a single application on complex heterogeneous platforms. We have extended these results to the multiple application case . In the single-application setting with a tree-shaped platform graph, i.e. when the underlying interconnection network is an oriented tree rooted at the master, it is possible to derive a closed-form formula that characterizes the optimal steady state, which can then be computed via a simple bottom-up traversal of the tree. In fact, this property enables to derive directly autonomous (i.e. where decisions are based only on local informations) task scheduling algorithms. In the multiple applications setting, deriving efficient local algorithms seems however to be much more difficult.
Last, we have studied the situation where multiple applications execute concurrently on an heterogeneous platforms competing for CPU and network resources. In
we analyze the behavior of
Knon-cooperative schedulers using the optimal strategy that maximize their efficiency. Meanwhile fairness is ensured at a system level ignoring applications
characteristics. We limit our study to simple single-level master-worker platforms and the case where applications consist of a large number of independent tasks. The tasks of a given
application all have the same computation and communication requirements, but these requirements can vary from one application to another. Therefore, each scheduler aims at maximizing its
throughput. We give closed-form formula of the equilibrium reached by such a system and study its performances. We characterize the situations where this Nash equilibrium is Pareto-optimal
and show that even though no catastrophic situation (Braess-like paradox) can occur, such an equilibrium can be arbitrarily bad for any classical performance measure.
Large scale distributed systems like Grids are difficult to study only from theoretical models and simulators. Most Grids deployed at large scale are production platforms that are inappropriate research tools because of their limited reconfiguration, control and monitoring capabilities. In , we present Grid5000, a 5000 CPUs nation-wide infrastructure for research in Grid computing. Grid5000 is designed to provide a scientific tool for computer scientists similar to the large-scale instruments used by physicists, astronomers, and biologists. We describe the motivations, design considerations, architecture, control, and monitoring infrastructure of this experimental platform. We present configuration examples and performance results for the reconfiguration subsystem.
Focused around the field of the exploitation and the administration of high performance large-scale parallel systems, describes the work carried out on the deployment of environment on high computing clusters and grids. We initially present the problems involved in the installation of an environment (OS, middleware, libraries, applications...) on a cluster or grid and how an effective deployment tool, Kadeploy2, can become a new form of exploitation of this type of infrastructures. We present the tools design choices, its architecture and we describe the various stages of the deployment method, introduced by Kadeploy2. Moreover, we propose methods on the one hand, for the improvement of the deployment time of a new environment; and in addition, for the support of various operating systems. Finally, to validate our approach we present tests and evaluations realized on various clusters of the experimental grid Grid5000.
Computing Grids are often used to perform high-performance computations with pre-existing applications that require to access data in various ways (e.g. files or more evolved requests). Their needs differ from one application to an other but most of the time, they need to select data according to complex criteria on meta-data (e.g. file name or even applicative meta-data). Grids are potentially interesting as they harness a large amount of storage capacity but new middleware are required to provide efficient usage of these capacities.
In , we present Gegeon: a data management middleware specifically designed for scientific data management on grids. Gedeon comprises a low-level I/O library, a remote access broker and various data access interfaces. Gedeon relies on a hierarchy of caches of data and requests to provide efficient data-access. Bio-informatics applications and microscop image databases are among the set of targeted applications.
Caches are generally used to improve performances of applications and are often very specific and thus hard to reuse. For that reason, designing cache services is generally very costly. In , we propose Adaptable Cache Service (ACS): a generic software template to design adaptive cache services. ACS has been successfully used in a data middleware for grids to design a dual cache. Using a semantic approach, two caches (one for the requests and one for the objects) built with ACS cooperate to provide a high level of performances.
Distributed applications, especially the ones being I/O intensive, often access the storage subsystem in a non-sequential way (stride requests). Since such behaviors lower the overall system performance, many applications use parallel I/O libraries such as ROMIO to gather and reorder requests. In the meantime, as cluster usage grows, several applications are often executed concurrently, competing for access to storage subsystems and, thus, potentially canceling optimizations brought by Parallel I/O libraries. The aIOLi project aims at optimizing the I/O accesses within the cluster and providing a simple POSIX API. In and , we present an extension of aIOLi to address the issue of disjoint accesses generated by different concurrent applications in a cluster. In such a context, good trade-off has to be assessed between performance, fairness and response time. To achieve this, an I/O scheduling algorithm together with a requests aggregatorthat considers both application access patterns and global system load, have been designed and merged into aIOLi. This improvement led to the implementation of a new generic framework pluggable into any I/O file system layer. A test composed of two concurrent IOR benchmarks has shown improvements on read accesses by a factor ranging from 3.5 to 35 with POSIX calls and from 3.3 to 5 with ROMIO, both reference benchmarks have been executed on a traditional NFS server without any additional optimizations.
The leveraging of existing storage space in a cluster is a desirable characteristic of a parallel file system. While undoubtedly an advantage from the point of view of resource management, this possibility may face the administrator with a wide variety of alternatives for configuring the file server, whose optimal layout is not always easy to devise. Given the diversity of parameters such as the number of processors on each node and the capacity and topology of the network, decisions regarding the locality of server components like metadata servers and I/O servers have a direct impact on performance and scalability. In , we explore the capabilities of the dNFSp file system on a large cluster installation, observing how scalable the system behaves in different scenarios and comparing it to a dedicated parallel file system. Our obtained results show that the design of dNFSp allows for a scalable and resource-saving configuration for clusters with a large number of nodes.
Scientific computing has evolved for last few years and commodity-based components can run a wide range of applications. In we focus on the software environment required to carry out large scale parametric simulations, but we will also report on experience of the deployment of such an infrastructure in the day to day life of an intranet.
The IGGI software suite is mainly based on two components: ComputeModeTM which smoothly aggregates idle user machine to a virtual computing cluster. This is done through a transparent switch of the PC to a secondary, protected mode from which it boots from the ComputeMode server taking advantage of the PXE protocol. The second IGGI component is CIGRI which scheduled and executed computing tasks on the idle nodes of clusters.
Special attention has been paid on the end-users ability to perform easily parametric simulations. Transparent access to the batch-scheduler, checkpoint and migration of the application is be exposed for the test-case of the analysis of uncertainty with TOUGHREACT : "Uncertainty in predictions of transfer model response to a thermal and alkaline perturbation in clay" .
The calculations reported were carried out using idle computing capacity on personal computers inside the BRGM. This new architecture is particularly well suited to explore the wide range of perturbations in highly coupled problems.
With excellent cost/performance tradeoff and good scalability, multiprocessor systems are becoming attractive alternatives when high performance, reliability and availability are needed. They are now more popular in universities, research labs and industries. In these communities, life-critical applications requiring high degrees of precision and performance are executed and controlled. Thus, it is important to the developers of such applications to analyze during the design phase how hardware, software and performance related failures affect the quality of service delivered to the users. This analysis can be conducted using modeling techniques such as transition systems. However, the high complexity of such systems (large state space) makes them difficult to analyze. In , we have presented an efficient way to model and analyze multiprocessor system in a structured and compact manner using a Stochastic Automata Network (SAN). A SAN is a high-level formalism for modeling very large and complex Markov chains. The formalism permits complete systems to be represented as a collection of interacting sub-systems. The basic concept which renders SAN powerful is the use of tensor algebra for its representation and analysis. Furthermore, a new modeling alternative has been recently incorporated into SANs: the use of phase-type distributions, which remains a desirable objective for the more accurately modeling of numerous real phenomena such as the repair and service time in multiprocessor systems.
Performance is a critical issue in the context of massively parallel programs. In practice, it is almost impossible to observe and understand the behavior of such programs without the assistance of automated tools which offer the program developer insights into the execution behavior of complex applications. Most of these tools are based on event-based and tracing techniques. As the size of the parallel system grows, it generates a huge size of events. Visualizing and animating gathered traces often produce a complex and non understandable diagrams and displays. Recently, the Pajé visualization framework has been developed in our laboratory ID (Informatique et Distribution). This framework provides interactive and scalable behavioral visualizations of parallel and distributed applications. In , we present an empirical performance monitoring and visualization study with two large scale applications (JonAS (200.000 LOC1) and Jboss (400.000 LOC)). We have developed an event-based tracing framework for large scale applications monitoring. We use Pajé framework to observe and visualize a large amount of harvested events.
, Nowadays, Peer to Peer systems are largely studied. But in order to evaluate them in a realistic way, a better knowledge of their environments is needed. In we focus on the computers availability in these systems. We characterize this availability behind ADSL lines and we link it with the availability of Peer to Peer systems participants. We focus on the methodology as generalized in other systems such as grids or ad-hoc systems. We finally show how users of ADSL lines are related to Peer to Peer users and we give some examples of the possible practical use of theses results. The results are based on trace datasets obtained over the first five months of 2003 with around 5000 hosts.
In High Performance Computing research, the modeling of the systems performances allows to know the behavior of a system according to the users requirements. Also, a model gives a specification for the design, maintenance, scalability of a high performance computing infrastructure. In we present a model of high bandwidth data transfer. The proposed model is used for analysis of performances based on experimental data from two clusters IDPOT and I-Cluster2 both part of the Grid5000 project.
SMP clusters are one of the most common HPC platform used by scientific applications. The nodes of SMP cluster contain several computing elements. Scientific applications may be executed over a large number of such nodes introducing complex communication behaviors. Using for instance MPI, communications on a same node with a common interval time create concurrent accesses to resources of nodes. On SMP nodes, concurrent access implies resource sharing depending on the underlying network architecture and the MPI implementation used. In , we present a model to predict communication times of simultaneous MPI communications over SMP clusters. This model considers the concurrency over resources of nodes and network predicting accurately communication time for many communications in conflict.
In the context of a global partnership between BULL and INRIA, BULL and the MESCAL project collaborate to develop clustering software solutions aimed at very large computing infrastructures. These clusters feature a complete software environment including management tools, efficient storage solutions and resource management. The partnership promotes the cluster architectures based on the Intel Itanium 2 processor which has established new records for floating point processing. This processor provides the 64-bit wide addressing scheme needed by large data sets of scientific applications and has up to 6 MB of on-chip cache to give the processor superfast access to data. BULL has developed FAME (Flexible Architecture for Multiple Environment) by using standard component assemblies as the building block of larger systems.
IGGI stands for infrastructure for grids, cluster and intranet. This research project partially funded by the French government is aiming at developing technologies allowing the access and the gathering of the whole computing resources spread over the intranet of a company. This could include dedicated computing power or personal computers. The project is a collaboration between BRGM, INRIA and Mandriva.
Adrien Lebre is doing his PhD thesis in a CIFRE contract with the BULL company. His work started in march 2003 and will finish in march 2006 and address the topic of high performance I/O for clusters. Adrien Lebre has defended his PhD in Septembre 2006
Maxime Martinasso has started a PhD thesis in January 2004 involving MESCAL and BULL under the terms of a Cifre contract. The topic of this thesis deals with the behavior analysis of Parallel Applications on SMP/NUMA clusters and more specifically on performance modeling of communication contentions and memory accesses.
Estelle Gabarron has started a PhD thesis in January 2004 involving MESCAL and BULL under the terms of a Cifre contract. The subject addresses issues of the resources management in grids with the presence of large amounts of job in the system. Main issues studied are scalability, error recovery and data management.
Yiannis Georgiou is doing his PhD thesis in a CIFRE contract with the BULL company. His work started in september 2006 and will finish in september 2009 and address the topic batch scheduling on Grids.
Ahmed Harbaoui is doing his PhD thesis in a CIFRE contract with the France Télécom R&D company. His work started in september 2006 and will finish in september 2009 and deals with load injection and performance evaluation issues in networks.
Vincent Roqueta is doing his PhD thesis in a CIFRE contract with the BULL company. His work started in september 2006 and will finish in september 2009 and address the localization and replication issues on Grids.
The CIMENT project (Intensive Computing, Numerical Modeling and Technical Experiments, http://ciment.ujf-grenoble.fr/) gathers a wide scientific community involved in numerical modeling and computing (from numerical physics and chemistry to astrophysics, mechanics, biomodeling and imaging) and the distributed computer science teams from Grenoble. Several heterogeneous distributed computing platforms were set up (from PC clusters to IBM SP or alpha workstations) each being originally dedicated to a scientific domain. More than 600 processors are available for scientific computation. The MESCAL project provides expert skills in high performance computing infrastructures.
MENRT-UJF-INPG (800KF), Rhône-Alpes Region (1.2MF), INRIA (2.5MF), ENS-Lyon (300KF) have funded a 4.8 MF cluster composed of 110 bi-processors Itanium2 connected with a Myrinet (donation of MyriCom) high performance network. This project is lead by MESCAL, MOAIS, ReMaP and SARDES. It is part of the CIMENT project which aims at building high performance distributed grids between several research labs (see above).
The MESCAL project is member of the regional ``cluster'' project on computer science and applied mathematics, the focus of its participation is on handling large amount of data large scale architecture. Other members of this subproject are the INRIA GRAAL project, the LSR-IMAG and IN2P3-LAPP laboratories.
Partners (INRIA-Apache, IRISA-Armor, PRISM-Epri).
In the area of distributed systems and networking, the objective of the project is to apply an expertise in mathematical tools, techniques, algorithms and software packages for performance, reliability or dependability studies.
Partners (LRI, LIP).
The goal of Data Grid Explorer is to build an emulation environment to study large scale configurations. Today, it is difficult to evaluate new models for data placement and caching, network content distribution, peer-to-peer systems, etc. Options include writing simulation environments from scratch, employing detailed packet-level simulation environments such as NS, local testing within a controlled cluster setting, or deploying live code across the Internet or a Testbed. Each approach has a number of limitations. Custom simulation environments typically simplify network and failure characteristics. Packet-level simulators add more realism but limit system scalability to a few hundred of simultaneous nodes. Cluster-based deployment adds another level of realism by allowing the evaluation of real code, but unfortunately the network is highly over-provisioned and uniform in its performance characteristics. Finally, live Internet and Testbed deployments provide the most realistic evaluation environment for wide-area distributed services. Unfortunately, there are significant challenges to deploying and evaluating real code running at a significant number of Internet sites. The main benefit of emulation is the ability to reproduce experimental conditions and results.
The project is structured horizontally into transverse working groups: Infrastructure, Emulation, Network, and Applications. The Regal team is leader for the Emulation working group.
Partners (IMAG-LSR).
File systems (FS) are commonly used to store data. Especially, they are intensively used in the community of large scientific computing (astronomy, biology, weather prediction) which needs the storage of large amounts of data in a distributed manner. In a GRID context (cluster of clusters), traditional distributed file systems have been adapted to manage a large number of hosts (like the Andrew File System). However, such file systems remain inadequate to manage huge data. They are suited for traditional Unix (small) files. Thus, the grain of distribution is typically an entire file and not a piece of file which is essential for large files. Furthermore, the tools for managing data (e.g, interrogation, duplication, consistency) are unsuited for large sizes.
Database Management Systems (DBMS) provides different abstraction layers, high level languages for data interrogation and manipulation etc. However, the imposed data structure, the low distribution, and the usually monolithic architecture of DBMSs limit their utilization in the scientific computing context.
The main idea of the Gedeon project is to merge the functions of file systems and DBMS, focusing on structuration of meta-data, duplication and coherency control. Our goal is NOT to build a DBMS describing a set of files. We will study how database management services can be used to improve the efficiency of file access and to increase the functionality provided to scientific programmers.
Partners (INRIA FUTURS, INRIA Sophia, IRISA, LORIA, IRIT, LABRI, LIP, LIFL).
The foundations of Grid'5000 have emerged from a thorough analysis and numerous discussions about methodologies used for scientific research in the Grid domain. A report presents the rationale for Grid'5000. In addition to theory, simulators and emulators, there is a strong need for large scale testbeds where real life experimental conditions hold. The size of Grid'5000, in terms of number of sites and number of CPUs per site, was established according to the scale of the experiments and the number of researchers involved in the project.
Partners (INRA-FUTURS).
DSLlab is a research project aiming at building and using an experimental platform about distributed systems running on DSL Internet. The objective is twofold:
provide accurate and customized measures of availability, activity and performances in order to characterize and tune the models of the ADSL resources;
provide a validation and experimental tool for new protocols, services and simulators and emulators for these systems.
DSLlab consists of a set of low power, low noise computers spread over the ADSL. These computers are used simultaneously as active probes to capture the behavior traces, and as operational nodes to launch experiments. We expect from this experiment a better knowledge of the behavior of the ADSL and the design of accurate models for emulation and simulation of these systems which represents now a significant capability in terms of storage and computing power.
Future generations of multiprocessors machines will rely on a NUMA architecture featuring multiple memory levels as well as nested computing units (multi-core chips, multithreaded
processors, multi-modules NUMA, etc.). To achieve most of the hardware's performance, parallel applications need powerful software to carefully distribute processes and data so as to limit
non-local memory accesses. The ANR NUMASIS
The new algorithmic challenges associated with large-scale platforms have been approached from two different directions. On the one hand, the parallel algorithms community has largely concentrated on the problems associated with heterogeneity and large amounts of data. Algorithms have been based on a centralized single-node, responsible for calculating the optimal solution; this approach induces significant computing times on the organizing node, and requires centralizing all the information about the platform. Therefore, these solutions clearly suffer from scalability and fault tolerance problems.
On the other hand, the distributed systems community has focused on scalability and fault-tolerance issues. The success of file sharing applications demonstrates the capacity of the resulting algorithms to manage huge volumes of data and users on large unstable platforms. Algorithms developed within this context are completely distributed and based on peer-to-peer communications. They are well adapted to very irregular applications, for which the communication pattern is unpredictable. But in the case of more regular applications, they lead to a significant waste of resources.
The goal of the ALPAGE project is to establish a link between these directions, by gathering researchers (ID, LIP, LORIA, LaBRI, LIX, LRI) from the distributed systems and parallel algorithms communities. More precisely, the objective is to develop efficient and robust algorithms for some elementary applications, such as broadcast and multicast, distribution of tasks that may or may not share files, resource discovery. These fundamental applications correspond well to the spectrum of the applications that can be considered on large scale, distributed platforms.
The ACI SMS, ``Simulation et Monotonie Stochastique en évaluation de performances'', is composed by two teams: Performance Evaluation team from PRiSM Laboratory (ACI Leader) and the MESCAL project. The main objective is to study monotonicity properties of computer systems models in order to speed up the simulations and estimate performance indexes more accurately.
The composition formalisms we have contributed to develop during the recent years allow to build large Markov chains associated to complex systems in order to analyze their performance. However, it is often impossible to solve the stationary or transient distributions. Analytical methods and simulations fail for different reasons.
However brute performances are not really useful. We need the proof that the system is better than an objective. Therefore it is natural to use comparison of random variables and sample-paths. Two important concepts appear: stochastic ordering and stochastic monotony. We chose to develop these two important concepts and apply them to perfect simulation, distributed simulation and product form queuing network. These concepts seem to appear frequently in various techniques in performance evaluation. Using the monotony property, one can reduce the computation time for perfect simulation with coupling from the past. Coupling from the past allows to sample the steady-state distribution in a finite time. Thus we do not encounter the same stopping problem that hold for ordinary simulations. Furthermore, some results show that the monotony property is often present in queuing network even if they do not have product form. We simply have to renormalize them to let the property appear. Using both properties it is also possible to derive distributed simulations which will be more efficient. We will develop two ideas: sample-path transformations to avoid rollback in optimistic simulations (and we compute a bound) and regenerative simulations.
Finally, these concepts can be used for product form queuing network to explain why some transformation applied on customer synchronization can provide product form solution and also how we can compute a solution of the traffic equation when they are unstable.
The ANR Sceptre is a joint effort between STMicroelectronics (Divisions STS and HEG), INRIA Rhône-Alpes (MOAIS, Mescal, Arenaire, CompSys), TIMA/SLS, Verimag, CAPS-Entreprise and IRISA (CAPS). Participants work on tools and methods to develop embedded systems. The main working directions are software and hardware integration, scalable and configurable architectures, real time constraints, heterogeneous multiprocessing, and load-balancing.
The ``ACI blanche'' MEG, ``...'', is composed by two teams: physicists working on electromagnetism from the LAAS (Toulouse) and the MESCAL project. The main objective is to study scaling properties in electromagnetism simulation applications and grids.
The project MESCAL participates to the Network Of Excellence CoreGrid.
The project MESCAL participates to the Network Of Excellence EuroNGI (Next Generation Internet).
The MESCAL project participates to the ESPON (European Spatial Planning Observation Network) http://www.espon.eu/It is involved in the action 3.1 on tools for analysis of socio-economical data. This work is done in the consortium hypercarte including the laboratories LSR-IMAG (UMR 5526), Géographie-cité (UMR 8504) and RIATE (UMS 2414). The Hyperatlas tools have been applied to the European context in order to study spatial deviation indexes on demographic and sociologic data at nuts 3 level.
MESCAL takes part in the SARIMA
NSF Project with W. Stewart (NC State University), G. Ciardo (College William and Mary), S. Donatelli (U. de Turin), 2002-2006. The purpose of the project is to study structured methods for Markov chains in order to evaluate the performances of distributed systems.
PICS (2005-2007) CADIGE funded by the CNRS with the universities of Rio Grande do Sul, Brazil (UFRGS, UFSM, PUC, UNISINOS), around PC clusters, grid and performance evaluation tools.
CAPES/COFECUB grant (2006-2008) with the UFRGS, Porto Alegre, Brazil around grid and PC clusters.
Colombia: collaboration with the universities of Los Andes, Bogota, and UIS, Bucaramanga, on the topic of grids for computation and data management.
The MESCAL project manages a cluster computing center on the Grenoble campus. The center manages different architectures: a 48 bi-processors PC (ID-POT), and the center is involved with a cluster based on 110 bi-processors Itanium2 (ICluster-2) located at INRIA.
More than 60 research projects in France have used the architectures, especially the 204 processors Icluster-2. Half of them have run typical numerical applications on this machine, the remainder has worked on middleware and new technology for cluster and grid computing.
The ICluster2 and IDPot platforms are now integrated the Grid'5000 grid platform.
In the context of our collaboration with BULL (LIPS, NUMASIS), the MESCAL project acquired a Novascale NUMA machine. The configuration is based on 8 Itanium II processors at 1.5 Ghz and 16 GB of RAM. This platform is mainly used by the BULL PhD students. This machine is also connected to the CIMENT Grid.
The MESCAL project is involved in development and management of Grid'5000 platform. The ICluster2 and IDPot clusters are integrated in Grid'5000. Moreover, these two clusters take part in CIMENT Grid. More precisely, their unused resources may exploited to executed job form partners of CIMENT project (see Section ).
Researchers of the MESCAL project have been members of the following program committees:
Value Tools 2006
Bruno Gaujal is an editor of the special issue of the Journal of Discrete Event Dynamic Systems on Valuetools.
This seminar on probabilities and applications is targeted toward computer scientists as well as mathematicians. One of the goals is to encourage collaborations between people from different laboratories with varied backgrounds. More informations are available at http://www-fourier.ujf-grenoble.fr/~dpiau/page/.
This seminar is organized by Jean-Marc Vincent and Bruno Gaujal. It is tightly coupled with the PAGE group and its main goal is to organize meetings between the various researchers of Grenoble using the same kind of mathematical tools (stochastic models, queuing networks, Petri networks, stochastic automata, Markovian process and chains, (max,+) algebra, fluid systems, ...). On the long term, this seminar should lead to inter-laboratory working groups on precise themes. More informations are available at http://www-id.imag.fr/Laboratoire/Membres/Vincent_Jean-Marc/EPG/.
Members of the MESCAL team are actively involved in teaching. Their activities are balanced between graduate students and post-graduate students. Here are a few examples of their responsibilities:
2 ndyear of Research Master (Grenoble): Operating Systems and Softwarehead of the SAP track (operating systems, parallel and distributed applications, networks and multimedia). Here is a list of courses taught by researchers of the MESCAL project:
Cluster architectures for high-performance computing and high throughput data management.
Data measurement and analysis for network and operating systems performance evaluation.
Modeling and simulation for network and operating systems performance evaluation.
Building parallel and distributed applications (contributor).
Algorithms and basic techniques for parallel computing (contributor).
2 ndyear of Research Master (Paris): MPRINetwork algorithms
2 ndyear of Research Master (Yaoundé)Operating systems and networks.
Magistère d'informatique Licence (Université Joseph Fourier)