The Paris Project-Team was created at Irisa in December 1999. In November
2001, it has been established as a joint project-team (projet
commun) between Irisa and the Brittany Extension of Ens Cachan. Since, the
project activity is jointly supervised by a ad-hoc Committee on an annual
basis. Regarding 2003, this committee met at Ens Cachan on April 28.
The Paris Project-Team aims at contributing to the programming of parallel
and distributed systems for large scale numerical simulation applications.
Its goal is to design operating systems and middleware to ease the use of
such computing infrastructure for the targeted applications. Such
applications allow to speed-up the design of complex manufactured products,
such as cars or aircrafts, thanks to numerical simulation techniques. As the
computer performance increases rapidly, it is possible to foresee in the near
future comprehensive simulations of these designs that encompass
multi-disciplinary aspects (structural mechanics, computational fluid
dynamics, electromagnetism, noise analysis, etc.). Numerical simulation of
these different aspects will not be carried out by a single computer due to
the lack of computing and memory resources. Instead, several clusters of
inexpensive PCs, and probably clusters of clusters (aka Grids), will
have to be used simultaneously to keep simulation times within reasonable
bounds. Moreover, simulation will have to be performed by different research
teams, each of them contributing with its own simulation code. These teams
may all belong to a single company, or to different companies owning
appropriate skills and computing resources. thus adding geographical
constraints. By their very nature, such applications will require the use of
a computing infrastructure that is both parallel and distributed.
The Paris Project-Team is engaged in research along four themes:
Operating System and Runtime for Clusters, Middleware for
Computational Grids, Large-scale Data Management for Grids and
Advanced Models for the Grid. These research activities encompass both
basic research, seeking conceptual advances, and applied research, to
validate the proposed concepts against real applications. The project-team is
also involved in setting-up a national grid computing infrastructure
(Grid 5000) enabling large-scale experiments.
As the performance of microprocessors, computer architectures and networks increase, a cluster of standard personal computers provides the level of performance to make numerical simulation a handy tool. This tool should be not be used only by researchers, but also by a large number of engineers designing complex physical systems. Simulation of mechanical structures, fluid dynamics or wave propagation can nowadays be carried out in a couple of hours. This is made possible by exploiting multi-level parallelism, simultaneously at a fine grain within a microprocessor, at a medium grain within a single multi-processor PC, or at a coarse grain within a cluster of such PCs. This unprecedented level of performance no doubt makes numerical simulation available for a larger number of users such SMEs. It also generates new needs and demands for more accurate numerical simulation. But traditional parallel processing alone cannot meet this demand.
These new needs and demands are motivated by the constraints imposed by a
worldwide economy: make things faster, better and cheaper. Large scale
numerical simulation will no doubt become one of the key technologies to meet
such constraints. In traditional numerical simulation, only one simulation code
is executed. In contrast, it is now needed to couple several such codes
together in a single simulation. A large-scale numerical simulation
application is typically composed of several codes, not only to simulate one
physics, but to perform multi-physics simulation. One can imagine that the
simulation times will be in the order of weeks and sometimes months depending
on the number of physics involved in the simulation, and depending on the
available computing resources. Parallel processing extends the number of
computing resources locally: it cannot significantly reduce simulation times,
since the simulation codes will not be localized in a single geographical
location. This is particularly true with the global economy where complex
products (such as cars, aircrafts, etc.) are not designed by a single company,
but by several of them, through the use of subcontractors. Each of these
companies brings its own expertize and tools such as numerical simulation
codes, and even their private computing resources. Moreover, they are reluctant
to give access to their tools as they may at the same time compete for some
other projects. It is thus clear that distributed processing cannot be avoided
to manage large-scale numerical applications
The design of large-scale simulation applications raises technical and
scientific challenges, both in applied mathematics and computer science. The
Paris Project-Team mainly focuses its effort on computer science. It
investigates new approaches to build software mechanisms that hide the
complexity of programming computing infrastructures that are both
parallel and distributed. Our contribution to the field can thus be summarized
as follows: combining parallel and distributed processing whilst
preserving performance and transparency. This contribution is
developed along four directions.
The challenge is to design and build an operating system for clusters that will hide distributed resources (processors, memories, disks) to the programmers and the users. A PC cluster with such an operating system will look like a traditional multi-processor running a Single System Image (SSI).
The challenge is to design a middleware implementing a component-based approach for grids. Large-scale numerical applications will be designed by combining together a set of components encapsulating simulation codes. The challenge is to mix both parallel and distributed processing seamlessly.
One of the key challenge in programming grid computing infrastructures is the data management. It has to be carried out at an unprecedented scale, and to cope with the native dynamicity of grids.
This theme aims at contributing to study unconventional approaches for the programming of grids based on the chemical metaphors.
Clusters, made up of homogeneous computers interconnected via high performance networks, are now the most widely used general, high-performance computing platforms for scientific computing. While the cluster architecture is attractive with respect to price/performance there still exists a great potential for efficiency improvements at the software level. System software requires improvements to better exploit the cluster hardware resources. Programming environments need to be developed with both the cluster and human programmer efficiency in mind.
We believe that cluster programming is still difficult as clusters suffer from a lack of dedicated operating system providing a single system image (SSI). A single system image provides the illusion of a single powerful and highly available computer to cluster users and programmers rather than the vision of a set of independent computers, each with resources locally managed.
Several attempts to build an SSI have been made at the middleware level as
Beowulf Mpi Paris project-team is to design and implement a full
SSI in the operating system. Our objective is to combine ease of use, high
performance and high availability. All physical resources (processor,
memory, disk) and kernel resources (process, memory pages, data streams,
files) need to be visible and accessible from all cluster nodes. Cluster
reconfigurations due to a node addition, eviction or failure need to be
automatically dealt with by the system transparently to the applications.
Our SSI operating system is designed to perform global, dynamic and
integrated resource management.
As the execution time of scientific applications may be larger than the cluster mean time between failures, checkpoint/restart facilities need to be provided not only for sequential applications but also for parallel application whatever the communication paradigm they are based on. Even, if backward error recovery (BER) has extensively been studied from the theoretical point of view, it is still challenging to efficiently implement BER protocols transparently to the applications. There are very few implementations of recovery for parallel applications. Our approach is to identify and implement as part of the SSI OS a set of building blocks that can be combined to implement different checkpointing strategies and their optimization for parallel applications whatever inter-process communication (IPC) layer they use.
In addition to our research activity on operating system, we also study the
design of runtimes for supporting parallel languages on clusters. A runtime
is a software offering services dedicated to the execution of a particular
language. Its objective is to tailor the general system mechanisms (memory
management, communication, task scheduling, etc.) to achieve the best
performance from the target machine and its operating system. The main
originality of our approach is to use the concept of distributed shared
memory as the basic communication mechanism within the runtime. We are
essentially interested in Fortran and its OpenMP
extensions Paris project-team. In particular, the
execution of OpenMP programs on a cluster requires a global address space
shared by threads deployed on different cluster nodes. We rely on the two
distributed shared memory systems we have designed, one at user level
implementing weak memory consistency models, and one at operating system
level implementing the sequential consistency model.
Computational grids are very powerful machines as they aggregate huge computational resources. A lot of work has been carried out with respect to grid resource management. Existing grid middleware systems focus a lot on resources management like discovery, registration, security, scheduling, etc. However, they provides very few support for grid-oriented programming model.
A suitable grid programming model should be able to take into account the dual nature of a computational grid which is a distributed set of (mainly) parallel resources. Our general objective is to propose such a programming model and to provide adequate middleware systems. Distributed object or component models seems to be a pertinent solution. However, they need to be tailored for scientific applications, in particular with respect of the encapsulation of parallel codes into objects or components, the communications between ``parallel'' objects or components, the required runtime support, the deployment and the adaptability.
The first issue is the relationship between object or component models, which should handle the distributed nature of grid, and the parallelism of computational code, which should take into account the parallelism of resources. It is thus required to efficiently combine both world into a coherent one.
The second issue concerns the simplicity and the scalability of communications between parallel codes. As the available bandwidth is larger than what a single resource could consume, parallel communication flows should allow a more efficient utilization of network resources. Advanced control flow should be used not to congest networks. A crucial aspect of this issue is the support for data redistribution which should be involved in the communication between parallel codes.
Promoting a programming model that simultaneously supports distributed as
well as parallel middleware systems, independently of the actual resources,
raises three new issues. First, middleware systems should be decoupled from
the actual networks so as to be deployed on any kind of network. Second,
several middleware systems should be simultaneously active within a
same process. Third, solutions to the two previous issues should support high
performance constraints to be accepted by users.
The deployment of applications is another issue. Not only, it is important to constrain the deployment by specifying the requirement in term of the computational resource (Gflop/s, amount of memory, etc.), but it is also crucial to specify the constraints related to communication resources such as the amount of bandwidth or the latency between computational resources.
The last issue deals with the dynamic nature of computational grids. As targeted applications may last for very long time, grid environment is expected to change. Not only middleware systems should support adaptability but they should be able to detect variations and should be able to self-adapt. For example, an application may be partially redeployed to take benefit of new resources.
A major contribution of the grid computing environments developed so far is
to have decoupled computation from deployment. Deployment is
typically considered as an external service provided by the underlying
infrastructure, in charge of locating and interacting with the physical
resources. In contrast, as of today, no such sophisticated service exists
regarding data management on the grid: the user is still left to
explicitly store and transfer the data needed by the computation between
these sites. Like deployment, we claim that an adequate approach to this
problem consists in decoupling data management from
computation, through an external service tailored to the
requirements of scientific computation. We focus on the case of a grid
consisting of a federation of distributed clusters. Such a data sharing
service should meet two main properties: persistence and
transparency.
First, the data sets used by the grid computing applications may be very
large. Their transfer from one site to another may be costly (in terms of
both bandwidth and latency), so such data movements should be carefully
optimized. Therefore, the data management service should allow data to be
persistently stored on the grid infrastructure independently of the
applications, in order to allow their reuse in an efficient way.
Second, a data management service should provide transparent access to
data. It should handle data localization and transfer without any help from
the programmer. Yet, it should make good use of additional information and
hints provided by the programmer, if any. The service should also
transparently use adequate replication strategies and consistency protocols
to ensure data availability and consistency in a large-scale, dynamic
architecture.
Given that our target architecture is a federation of clusters, a few constraints need to be addressed. The clusters which make up the grid are not guaranteed to remain constantly available. Nodes may leave due to technical problems or because some resources become temporarily unavailable. This should obviously not result in disabling the data management service. Also, new nodes may dynamically join the physical infrastructure: the service should be able to dynamically take into account the additional resources they provide.
On the other hand, it should be noted that the algorithms proposed for
parallel computing have often been studied on small-scale configurations. Our
target architecture is typically made of thousands of computing nodes, say
tens of hundred-node clusters. It is well-known that designing low-level,
explicit Mpi programs is most difficult at such a scale. In contrast,
peer-to-peer approaches have proved to remain effective at large scales and
can serve as inspiration source.
Finally, in grid applications, data are generally shared and can be modified by multiple partners. Traditional replication and consistency protocols designed for DSM systems have often made the assumption of a small-scale, static, homogeneous architecture. These hypotheses need to be revisited and this should lead to new consistency models and protocols adapted to a dynamic, large-scale, heterogeneous architecture.
Till now, research activities related to the Grid have been focused on the
design and implementation of middleware and tools to experiment grid
infrastructure with applications. Very few attention has been paid to
programming models suitable for such widely computing infrastructures.
Programming of such infrastructures is still very low-level. This
situation may somehow be compared to using assembly language to program
complex processors. Our objective is to study approaches for Grid
programming that do not expose the architectural details of the computing
infrastructure to the programmers. In particular, we are considering
unconventional approach based on the chemical reaction paradigm, and
more precisely the Gamma Model
Gamma is based on multiset rewriting. The unique data structure in Gamma is
the multiset (a set than can contain several occurrences of the same
element) which can be seen as a chemical solution. A simple program
is a set of rules chemical reaction). The result is obtained when a stable
state is reached, that, when no more reactions applies. Our objective is to
express the coordination of Grid components or services through a set of
rules while the multiset represents the services that have to be
coordinated.
Research activities within the Paris Project-Team encompass several areas:
operating systems, middleware and programming models. We have chosen to provide
a brief presentation of some of the scientific foundations associated with
them.
A shared virtual memory system provides a global address space for a system
where each processor has physical access only to its local memory.
Implementation of such a concept relies on the use of complex cache coherence
protocols to enforce data consistency. To allow the correct execution of a
parallel program, it is required that a read access performed by one
processor returns the value of the last written operation performed by
another processor previously. Within a distributed or parallel a system, the
notion of the last memory access is sometimes undefined since there is
no global clock that gives a total order of the memory operation.
It has always been a challenge to design a shared virtual memory system for
parallel or distributed computers with distributed physical memories, capable
of providing comparable performance with other communication models such as
message-passing. Sequential consistency
Several other memory models have thus been proposed to relax the requirements
imposed by sequential consistency. Among them, Release
Consistency acquire and
release. The aim of these two operations is to specify when to
propagate the modifications made to the shared memory systems. Several
implementations have been proposed of Release Consistency eager one, for which modifications are propagated at the time of a
release operation; and a lazy one, for which modifications are
propagated at the time of an acquire operation. These two alternative
implementations differ in the number of messages that needs to be
sent/received, and in the complexity of the implementation hierarchical, so that the
consistency model should better take advantage of it.
``A distributed system is one that stops you getting any work done when a machine you've never even heard about crashes.''(Leslie Lamport)
The availability fails when it does not behave in a manner
consistent with its specifications. An error is the consequence of a
fault when the faulty part of the system is activated. It may lead to
the system failure. In order to provide highly available systems,
fault tolerance techniques
Error detection is the first step in any fault tolerance strategy.
Error treatment aims at avoiding that the error leads to the system
failure.
Fault treatment consists in avoiding that the fault is activated again.
Two classes of techniques can be used for fault treatment: reparation
which consists in eliminating or replacing the faulty module; and
reconfiguration which consists in transferring the load of the faulty
element to valid components.
Error treatment can be of two forms: error masking or error
recovery. Error masking is based on hardware or software redundancy in order
to allow the system to deliver its service despite the error. Error recovery
consists in restoring a correct system state from an erroneous state. In
forward error recovery techniques, the erroneous state is transformed
into a safe state. Backward error recovery consists in periodically
saving the system state, called a checkpoint, and rolling back to the
saved state if an error is detected.
A stable storage guarantees three properties in presence of failures:
(1) integrity, data stored in stable storage is not altered by failures;
(2) accessibility, data stored in stable storage remains accessible
despite failures; (3)atomicity, updating data stored in stable storage
is a all or nothing operation. In the event of a failure during the update of a
group of data stored in stable storage, either all data remain in their initial
state or they all take their new value.
Recent research on emerging peer-to-peer (P2P)
systems Napster) still
use a centralized directory for data localization, then switch to direct
P2P interaction for the actual data transfers. Later, fully distributed,
flooding-based approaches (e.g., Gnutella) have been proposed. A
second generation of P2P systems (e.g., KaZaA) have combined the
previous techniques by integrating the notion of super-peer: localization
is flooding-based between the super-peers, which serve as local directories
for groups of regular peers. However, flooding strategies have one main
weakness: since they generate a lot of traffic, a limit has to be set on
the number of times queries are re-propagated. As a result, queries for
data may fail, whereas the data are actually stored in the system.
In order to provide both high fault tolerance and the guarantee to always
reach data available in the network, recent research has focused on
localization schemes based on Distributed Hash Tables (DHT). This
promising approach is illustrated by Chord (MIT), Pastry
(Microsoft Research) and Tapestry (UC Berkeley) and has also been
used for the latest major version (2.0) of the JXTA generic
environment for P2P services started by Sun Microsystems. Efforts are
currently under way, in order to define a common API for such DHT-based
systems
High-performance communication SCI, Myrinet-1, VIA,
Myrinet-2000, InfiniBand, etc. A dedicated low-level
communication library is often required to fully benefit from the hardware
specific feature: GM or BIP for Myrinet, SISCI
for SCI, etc. To face the diversity of low level communication
libraries, research has focused on generic high-performance environments such
as Active Message (Univ. of Berkeley), Fast Message (Univ. of
Illinois), Madeleine (LaBRI, Bordeaux), Panda/Ibis
(Univ. of Amsterdam) and Nexus (Globus Toolkit). Such generic
environments are usually not assumed to be directly used by a
programmer. Higher-level communication environment are specifically designed:
PVM, Mpi or software DSM such as TreadMarks are such
examples in the field of parallel computing.
While high performance communication research has mainly focused on
system-area networks, the emergence of grid computing enlarges its focus to
wide-area networks and, more specifically, to high-bandwidth, wide-area
networks. Research is needed to efficiently utilize such networks. Some
examples are adaptive dynamic compression algorithms, and especially parallel
stream communication.
Previous work thread scheduler.
It is particular important as more and more middleware systems as well as
applications are multithreaded. Another related issue, which devotes further
research, is to minimize network reactivity without generating too much
overhead.
Past research on distributed data management led to 3 main approaches.
Currently, the most widely-used approach to data management for distributed
grid computation relies on explicit data transfers between clients and
computing servers. As an example, the Globus Globus Access to Secondary Storage),
based on the GridFTP protocol. Other explicit approaches (e.g.,
IBP) provide a large-scale data storage system, consisting of a set of
buffers distributed over Internet. The user can ``rent'' these storage areas
and use them as temporary buffers for efficient data transfers across a
wide-area network. Transfer management is still at the user's charge.
Besides, IBP does not handle dynamic join/departure of storage nodes and
provides no consistency guarantee for multiple copies of the same data.
In contrast, Distributed Shared Memory (DSM) systems provide
transparent data sharing, via a unique address space accessible to physically
distributed machines. Within this context, a variety of consistency models
and protocols have been defined. These systems do offer transparent access to
data: all nodes can read and write data in a uniform way, using a unique
identifier or a virtual address. It is the responsibility of the DSM system
to localize, transfer, replicate data, and guarantee their consistency
according to some semantics. Nevertheless, existing DSM systems have
generally shown satisfactory efficiency only on small-scale configurations,
typically, a few tens of nodes.
Recently, peer-to-peer (P2P) has proven to be an efficient approach
for large-scale data sharing Napster, Gnutella and
KaZaA. Such systems have proven able to manage very large
configurations (millions of nodes) with a very high volatility. However, we
can note that most P2P systems focus on sharing immutable files: the shared
data are read-only and can be replicated at ease. Recently, some
mechanisms for sharing mutable data in a P2P environment have been
proposed by systems like OceanStore and Ivy, with restricted use (no multiple
writers nor conflict resolution).
Software component technology assembly, that is, manufacturing, rather than by
development. The goals are to focus expertise on domain fields, to
improve software quality and to decrease the time to market thanks to reuse
of existing codes.
The CORBA Component Model CORBA Ccm does not provide any support for parallel components.
The CCA Forum both parallel and distributed
applications.
The project-team research activities address in priority scientific computing and specifically numerical applications that require the execution of several codes simultaneously. This kind of applications requires both the use of parallel and distributed systems. Parallel processing is required to address performance issues and distributed processing is needed to fulfill the constraints imposed by the localization and the availability of resources or for confidentiality reasons. Such applications are being experimented within contracts with the industry or through our participation to application-oriented research grants.
If scientific computing is our primary target to apply the results gained by the project-team, we do not exclude other kind of applications such as multimedia or discrete-event distributed applications for which our research can be applied.
Christine Morin
Registered at APP, under Ref. IDDN.FR.001.480003.005.S.A.2000.000.10600
GNU General Public License version 2. Kerrighed is a registered trademark.
Kerrighed (formerly known as Gobelins) is a
Single System Image (SSI) operating system for high-performance
computing on clusters. It provides the user with the illusion that a cluster
is a virtual SMP machine.
In Kerrighed, all resources (processes, memory segments, files, data
streams) are globally and dynamically managed to achieve all the SSI
properties. Global resource management enables transparent distribution of
resources throughout the cluster nodes and take advantage of the whole
cluster hardware resources for demanding applications. Dynamic resource
management enables transparent cluster reconfigurations (node addition or
eviction) for the applications and high availability in the event of node
failures. In addition, a checkpointing mechanism is provided by Kerrighed to
avoid to have to restart applications from the beginning when node failure
happens.
To avoid mechanism redundancy and conflicting decisions in different
distributed resource management services and to decrease the software
complexity of such services, Kerrighed resource management services are
built in an unified and integrated way.
Kerrighed preserves the interface of a standard single node operating
system, which is familiar to programmers. Legacy sequential or parallel
applications running on this standard operating system may be executed
without modification on top of Kerrighed and further optimized if needed.
Kerrighed is not an entirely new operating system developed from scratch. In
the opposite, it has been designed and implemented as an extension to an
existing standard operating system. Kerrighed only addresses the distributed
nature of the cluster, while the native operating system running on each node
remains responsible of the management of local physical resources. Our
current prototype is based on Linux, which is extended using the
standard module mechanism. The Linux kernel itself has only been slightly
modified.
A public mailing list (kerrighed.users@irisa.fr) is available to support
users of Kerrighed.
Kerrighed includes 80,000 lines of code (mostly in C).
It represents 140 person-months of effort. The development of Kerrighed started in late 1999. The stable release of Kerrighed is Version V0.72
Pthread support, allowing to execute legacy OpenMP and multithreaded
applications on a cluster without any recompilation.
Kerrighed currently includes 6 Linux modules and a limited patch to Linux
Kernel 2.2.13. A port to Linux Kernel 2.4.20 is currently in progress, and a
new version of Kerrighed will be released in early December.
Several demonstrations of Kerrighed have been presented this year at
Linux Expo (Paris, February 2003, Louis Rilling and Christine Morin),
the IPDPS Conference (Nice, April 2003, Geoffroy Vallée and Louis
Rilling), Edf R&D Printemps de la recherche (Clamart, May 2003,
Renaud Lottiaux) and Euro-Par Conference (Klagenfurt, Austria, August
2003, Pascal Gallard and Gaël Utard) Kerrighed is currently experimented by Cap Gemini Ernst and Young,
ONERA CERNT, Dga CELAR in the framework of COCA contract, as
well as by Edf. More than 100 external downloads of Kerrighed have
been recorded since November 2002.
Christian Pérez
Registered at APP, under Ref. IDDN.FR.001.260013.000.S.P.2002.000.10000.
GNU General Public License version 2.
PadicoTM is an open integration framework for
communication middleware and runtimes. It enables several middleware systems
(such as CORBA, Mpi, Soap, etc.) to be used at the same time. It provides
an efficient and transparent access to all available networks with the
appropriate method.
PadicoTM is composed of a core, which provides a high-performance framework
for networking and multi-threading, and services, plugged into the core.
High-performance communications and threads are obtained thanks to Marcel and Madeleine, provided by Pm. The
An extended set of commands is provided with PadicoTM to ease the
compilation of its modules (padico-cc, padico-c++, etc.). In
particular, a very useful one aims at hiding the differences between
different CORBA implementation. The first version was called Ugo
(available in the 0.1.x Series). It has since been replaced by myCORBA
in the next version.
PadicoControl is a Java application that helps to control the
deployment of PadicoTM application. It allows a user to select the
deployment node and to perform individual or collective operation like
loading or running a PadicoTM module.
PadicoModule (still under development) is a Java application which
assists the low-level administration of a PadicoTM installation. It allows
to check module dependency, to modify module attributes, etc. It can work on
local file system as well as through a network thanks to a Soap daemon being
part of the service.
A public mailing list (padico-users@listes.irisa.fr) is available to
support users of PadicoTM.
The development of PadicoTM has started end of 2000. It
represents XXX person-month effort.
The stable release of PadicoTM is Version 0.1.5 (November 2002). The
unstable version (CVS version) is 0.2.0beta3.
The stable version (0.1.x series) includes the PadicoTM core, PadicoControl,
Ugo and external software: a PadicoTM-enabled version of omniORB
(3.0.2), a PadicoTM-enabled version of MPICH (1.1.2), a customized
version of Pm
PadicoTM 0.1.5 (without external software) includes 31,000 lines of C and
C++ (ca. 900 kB), 2,300 lines of Java (ca. 70 kB) and
7,000 lines ofshell, make and configure scripts (ca. 200 kB).
The CVS version (0.2.x series) includes an updated version of PadicoTM core
(bug fixes as well as some internal rewriting), PadicoControl,
myCORBA (replaces Ugo) and includes external software: a
customized version of Pm and a regular version of
90 external downloads whose 63 unique IPs between July 2002 and November 2003.
PadicoTM has been funded by the French ACI GRID RMI. As we are aware of, it
is currently used by several French projects: ACI GRID HydroGrid, ACI GRID EPSN.
PadicoTM has been demonstrated at the SuperComputing 2002 Conference,
in Baltimore, MD, USA.
Christian Pérez
Prototype under development.
Yet to be decided.
The PaCO++ objectives are to allow a simple and
efficient embedding of a Spmd code into a parallel CORBA object and to
allow parallel communication flows and data redistribution during an
operation invocation on such a parallel CORBA object.
PaCO++ provides an implementation of the concept of parallel object
applied to CORBA. A parallel object is an object whose execution model is
parallel. It is accessible externally through an object reference whose
interpretation is identical to a standard CORBA object.
PaCO++ extends CORBA but not to modify the model because we aim at
defining a portable extension to CORBA so that it can be added to any
CORBA implementation. This choice stems also from the consideration that the
parallelism of an object appears to be an implementation issue of the object.
Thus, the Omg Idl is not required to be modified.
PaCO++ is made of two components: a compiler and a runtime library.
The compiler generates parallel CORBA stub and skeleton from an Idl file
which describes the CORBA interface and from an Xml file which describes
the parallelism of the interface. The compilation is done in two steps. The
first step involves a Java Idl-to-Idl compiler based on SableCC, a
compiler of compiler, and Xerces for the Xml parser. The second part,
written in Python, generates the stubs files from templates configured with
inputs generated during the first step.
The runtime, currently written in C++, deals with the parallelism of
the parallel CORBA object. It is very portable thanks to the utilization of
abstract APIs for communications, threads and redistribution libraries.
PaCO++ is still in a development phase. A public
version should be released during the winter 2003-2004. PaCO++ has
been successfully tested on top of three CORBA implementations: Mico,
omniORB3 and omniORB4. Moreover, it supports PadicoTM.
The current version of PaCO++ includes 7,600 lines of Java (ca. 252 kB), 6,900 lines of Python (ca. 390 kB), 13,400 lines of C++ (ca. 390 kB) and 2,200 lines of shell, make and configure scripts
(66 kB).
PaCO++ is supported by the French ACI GRID RMI. Non-public beta
versions are currently used by several French projects: ACI GRID HydroGrid,
ACI GRID EPSN and Rntl VTHD ++.
Jean-Louis Pazat,
LGPL
CASPer aims at providing an Application Service
Provider (ASP) for Grid computing.
The server side is based on the Globus Toolkit (GTK 3): CASPer is
made of services that communicate with well-defined protocols, mainly
Xml-RPC calls (for Grid Services), JDBC connections (for databases)
and HTTP connections. The ASP manages authentication, user interface,
persistent data storage, job scheduling. Batch queues provide computing
power for jobs submitted by users through the ASP.
On the Client side, CASPer can work with any standard Web Browser, this
ensures that CASPer will be usable from most platforms.
CASPer is partly built using components off the shelf (COTS) for the
Web browser, Web server (TOMCAT), SQL database (MySQL). The
job managers currently targeted are OpenPBS, LSF and LoadLeveler. A
Distributed Job Manager (XtremWeb) will be also be integrated within
CASPer as a special job manager.
Security in CASPer is managed at different layers: first, we secure the HTTP
connection between the client and the ASP (SSL and certificates), then we
secure the communications between the ASP and batch queues (services have
certificates). This is needed because batch queues may spread across Virtual
Organizations).
CASPer provides a Job Scheduler as a service responsible for scheduling job
requests from users to the appropriate batch queue. Criterion for scheduling
include: the required architecture, the list of queues the user is authorized
to run jobs on, the current state of the queue, the job type (e.g., parallel
or distributed), etc.
User management is done by a module that takes care of generating certificates, and updating access control lists (ACLs).
A CASPer application is made of a GUI which main functions are selecting job
submission parameters, and a Job Runner that requests a job submission to the
job scheduler in order to submit the code that executes the simulation.
Computations results are transferred from the batch queue to the ASP using
the RFT Grid Service (which relies on a secured FTP protocol). These files
will be stored on the ASP, with owner information. The result files can be
remotely viewed (if a suitable viewer applet is available), downloaded, or
deleted. The CASPer security manager controls access to the files.
The CASPer ASP is under development. The current
release is for internal testing only. This project started in October 2003
and is supported by a Rntl contract. The main industrial contractor is
EADS-CCR.
Yvon Jégou
Prototype under development
Yvon Jégou,
APP registration in the future, license type not yet defined (LGPL?).
The Mome DSM
provides a shared segment space to parallel programs running on distributed
memory computers or clusters. Individual processes can freely request
mappings between their local address space and Mome segments. The DSM
handles the consistency of mapped memory regions at the page-level. Two
consistency models are currently implemented and can be selected by the user
programs at the page level: the classical sequential model and an explicit
weak model. Mome initial target was the execution of programs from the high
performance community which exploit loop-level parallelism using a Spmd computation model. the current release of Mome supports
page aliasing: the same page can be mapped twice (or more) in the same address space. This feature is used in the implementation of a DSM-based coupling library.
heterogeneous applications: different programs (binaries) can share data (be coupled) through the same instance of the DSM.
checkpointing: on checkpointing request from the application, the DSM guarantees that two copies of each DSM page are present on two different nodes. In case of failure, it is possible to restart the DSM and to recover all the pages of the last checkpoint as long as one node maximum that participated to the checkpoint has crashed.
dynamic connection of processes: Mome can be started as a background
daemon and supports the dynamic connection of processes. This possibility
allows the implementation of persistent data repositories.
The current developments around Mome involve the implementation of an OpenMP
runtime system, the implementation of a DSM-based coupling library and the
development of a persistent data repository for the grid.
Mome is implemented in C (50,000 lines) and represents
a 24-person-month effort. The current release is Mome 0.8. The DSM is used
in Alcatel collaboration (checkpointing), in VTHD contracts (code coupling
using a DSM), in e-Toile project (DSM-based data-repository for grid
computing) and in the POP project (OpenMP runtime).
Gabriel Antoniu,
Not yet defined.
JuxMem is an experimental platform for building a
data-sharing service for grid computing. The service addresses the problem of
managing mutable data on dynamic, large-scale configurations. It can be seen
as a hybrid system combining the benefits of Distributed Shared Memory (DSM)
systems (transparent access to data, consistency protocols) and Peer-to-Peer
(P2P) systems (high scalability, support for resource volatility). The target
applications are numerical simulations, based on code coupling, with
significant requirements in terms of data storage and sharing. Several
studies on replication strategies for fault tolerance and consistency
protocols for volatile environments are under way within the framework
provided by JuxMem. A more detailed description of the approach followed by
JuxMem is given in .
Implemented in Java, based on the JXTA generic
platform for P2P services. 5,000 lines of code. Implementation started in
February 2003. JuxMem is the starting point of a Grid Data Service (GDS)
that will be built in collaboration with the ReMaP/GRAAL (Lyon) and REGAL
(Paris) research groups, within the framework of the GDS project of the
ACI MD (see Section ).
Clusters are not only the most widely used general high-performance computing
platforms for scientific computing, but they have also become the most dominant
platform for high-performance computing today, according to the
Since 1999, the Paris Project-Team is engaged in the design and development of
Kerrighed, a genuine Single System Image cluster operating system for
general high-performance computing Resource distribution transparency, i.e., offering
processes transparent access to all resources, and resource sharing between
processes whatever the resource and process location; (2) High
performance; (3) High availability, i.e., tolerating node failures
and allowing application checkpoint and restart; and (4) Scalability,
i.e., dynamic system reconfiguration, node addition and eviction, transparently
to applications.
In 2003, thanks to the recruitment of two expert engineers within the COCA
Contract, we have integrated the results obtained last year by several PhD
students into a unique prototype. Two major releases (Kerrighed V0.70 Kerrighed has been significantly enhanced, and several new
functionalities have been implemented. Kerrighed V0.72 is now suitable for
the execution of applications provided by our industrial partners (Edf, Dga).
In 2003, we have focused on the implementation of a configurable global
scheduler for Kerrighed Kerrighed scheduler can be hot-plugged or hot-stopped.
A development framework allowing to easily implement global scheduling policies
has been designed and implemented. These schedulers may rely on new components,
developed without any kernel modification or on components existing in
Kerrighed. Some preliminary global scheduling policies have been experimented
In the near future, we plan to implement a number of global scheduling policies
in Kerrighed and to experiment them with respect to workloads made up of real
sequential and parallel applications provided by Edf. We will also design and
implement a simple batch system exploiting Kerrighed features to meet users
requirements.
In order to cope with the migration of communicating processes, we have
designed the Kernet System socket, pipe, char
devices) in Kerrighed. In 2003, unix et inet sockets have been
implemented on top of Kernet. Other data stream interfaces will be
implemented in 2004. Performance evaluations have been carried out with Mpi applications based on the MPICH environment. (Note that no modification of the
MPICH environment is needed to execute it within Kerrighed.) They demonstrate
that a Mpi process can be migrated in Kerrighed without any performance
degradation incurred by communications taking place after migration
Kernet relies on theGimli/Gloïn System. It is a portable,
high-performance communication system, providing a kernel-level send/receive
interface to Kerrighed distributed services. We have revisited the design of
Gimli/Gloïn to obtain better performance, and to extend its
interface with active messages and pack/unpack primitives. The implementation
of the new Gimli/Gloïn architecture, which offers
high-performance communication both at kernel level and at user level, is in
progress. In addition, we plan to implement a new Gloïn device to
better exploit the Myrinet technology.
Today, Kerrighed offers a complete support of the Posix thread
standard. This important result has been obtained thanks to our previous
results on distributed shared memory. It also crucially relies on the work
carried out in 2003 on the design and implementation of distributed mechanisms
for proper thread termination, cluster-wide signal management, and distributed
synchronization facilities (locks, barriers, etc.) compatible with preemptive
thread migration. Kerrighed Pthread Interface has been validated by
executing existing OpenMP applications compiled with the unmodified Omni
1.4 OpenMP compiler targeting pthread in SMP
multiprocessors Omni compiler to ensure its correct installation on a given
architecture demonstrates that Kerrighed is now mature to support
multithreading on a cluster. The OpenMP version of the HRM1D Edf application has also been successfully executed on Kerrighed. It includes
7,000 lines of Fortran code. However, performance has to be improved. In the
future, we plan to explore cluster-aware compilation methods, that produce more
efficient code for OpenMP programs. They should also provide OpenMP developers
with tools to better understand the performance bottlenecks of their
applications so that they can tune their parallelized algorithms.
Future work on Kerrighed is three-fold. First, we will continue the
development of Kerrighed with the design of a distributed file system based on
containers Kerrighed will be ported to a 64-bit architecture based
on Opteron processors. Second, we will work further on high-availability
issues. On the one hand, we will continue the work started this summer in
cooperation with Rutgers University during the internship of Pascal Gallard. It
is devoted to high-availability issues in exploiting the read and write RDMA
features provided in the last generation of Myrinet adapters. On the other
hand, we will work on dynamic reconfiguration mechanisms for Kerrighed distributed services. Last, we will pursue our efforts to extend the community
of Kerrighed users. In cooperation with Edf, we plan to build an OSCAR
package based on Kerrighed (SSI-OSCAR) during the post-doctoral internship of
Geoffroy Vallée at Oak Ridge National Laboratory.
High performance I/O are of primary importance for the applications executed on clusters. Some applications, like numerical computation or VOD, demand high-bandwidth sequential accesses. Other, like mail or Web servers, benefit from low-latency data and meta-data operations. As of today, no cluster file system provides performance for the whole range of access patterns.
Rather than putting forward another middleware, we explore a new approach to
make the operating system capable of efficient distributed I/O. We
propose to manage a cluster-wide cache, consisting of both data and meta-data,
through Distributed Shared Memory (DSM) techniques
A first prototype has been implemented based on a modified version of the Linux Kernel. We plan to complete this prototype to validate our approach with respect to standard benchmarks. We also plan to provide an MPI-IO interface.
Backward error recovery involving checkpointing and restart of tasks is an
important component of any system providing fault tolerance to applications
distributed over a network. In Kerrighed, one of our objectives is to be able
to checkpoint and restart any kind of scientific application: they may range
from sequential applications to parallel applications, with communication based
on message passing shared memory, or even both of them.
We have identified common mechanisms for implementing a wide variety of
checkpointing and rollback recovery protocols for both message passing and
distributed shared memory systems direct dependencies among tasks and memory pages. This mechanism
is thus common to both distributed shared memory and message-passing systems.
It can moreover be efficiently implemented, since the overhead of each
interaction is very light, both in terms of computation and control
information. The proposed mechanism can finally support several optimizations
discussed in the literature.
We have carried out a first implementation of a coordinated checkpointing
protocol within Kerrighed cluster operating system for multithreaded
applications incrementally
has been implemented. As the checkpoint storage has a high impact on
performance during fault-free executions, Kerrighed can save checkpoints
either in the memory of two nodes or into disks. Synchronization is an
important issue when designing a checkpoint/recovery protocol for shared memory
applications. In fact, parallel applications traditionally use locks and
barriers, that may incur causality dependences between processes. We have
studied how to extend our previous work on dependence tracking to deal with
synchronization
This work will be continued next year. We will finalize the implementation in
Kerrighed of all the mechanisms needed to checkpoint and recover parallel
applications communicating by message (for instance, Mpi applications) or
shared memory (for instance, OpenMP applications). A preliminary global
coordinated checkpointing strategy will be evaluated. Comparison with other
systems will be carried out in the framework of the Procope 2004
bilateral collaboration with the University of Ulm, Germany.
Federations of clusters (aka clusters of clusters) are very useful for
applications like large-scale code coupling. Faults may appear very frequently,
so that checkpointing strategies should definitely be provided to restart the
applications in the event of a node failure in a cluster. To take into account
the constraints introduced by clusters federation architecture, we propose a
hierarchical checkpointing protocol
Large companies exploit several medium-size clusters distributed on several
geographic sites. Some applications, such as those using code coupling, may
overcome the capacities of one single cluster. A solution is to run each
component of the code on a different cluster. Moreover, for a given cluster, a
limited amount of applications can be run at the same time. However, other
clusters in the same company may be idle or underloaded at the same time, and
hence could be enrolled in the computation. What is needed here is a
Grid-aware operating system, that could federate clusters in order to
make them cooperate, in particular for sharing resources.
We have worked on the design of such a Grid-aware operating system. It
should be able to manage a large number of nodes, and to deal with the
dynamicity inherent to a federation, where multiple reconfigurations (node
connection, disconnection, or failure) may be in progress at the same time.
Our proposal is based on a peer-to-peer infrastructure. The idea is to
build a virtual overlay network for a federation. Such a network
provides a key-based routing protocol, making transparent the physical location
of any object named by a key. The Grid-aware operating system would
encompass several distributed services such as, for instance, services
assembling a federation, managing and scheduling applications, controlling
resource access, managing a virtual shared memory and a distributed file
system, etc.
This year, we have implemented (using C) the peer-to-peer overlay network
inspired from Pastry
The first service that we have studied is a service for executing distributed
applications using the shared memory paradigm in a cluster federation. This
raises the problem of executing shared memory parallel applications on dynamic
and large-scale systems. The shared memory is private to each application, it
is volatile, and the application components transparently access shared memory
objects via their usual address space. The peer-to-peer system tolerates up to
simultaneous reconfiguration events (node failure, disconnection, or join) and
an infinite number of reconfigurations. We have designed a coherence protocol
similar to Kai Li's protocols for replicas of memory
objects
Optimizations to this protocol will be studied in the near future and both theoretical and experimental evaluations will be performed. Furthermore, we plan to study a home-based coherence protocol implementing a release consistency memory model.
Mome initial target was the execution of programs from the high performance
community which exploit loop-level parallelism using a Spmd execution model.
This implementation makes a clear distinction between shared and private data.
The allocation of the shared data in the shared space must be explicitly
requested by the application. This execution model is consistent with the HPF
language where the variables are implicitly private and the shared variables
must be explicitly specified.
The OpenMP specification targets SMP architectures: shared memory multiprocessors. In the OpenMP model, all variables are implicitly shared. The private variables (one instance per thread) must be explicitly specified. It is not possible through static analysis to decide at compile-time which objects are shared and which ones are private.
The Mome DSM implementation and the associated runtime system have been
adapted in order to support standard OpenMP codes without adding complexity to
compilers: the thread stacks can now be allocated in the shared space, the
signal handlers are executed on private stacks, the DSM internal code never
read or write in the application space, the distributed synchronization objects
are allocated in the shared space but the primitives do not touch the objects
etc.
A new implementation of the nth_lib runtime system from the IST POP project
on the Mome DSM is under progress and the experimentations will start in the
near future. The integration of the release consistency model in Mome is
planned.
Computational grids differ from previous computing infrastructure as they
exhibit parallel and distributed aspects: a computational grid is a set of
various and widely distributed computing resources, which are often
parallel. Therefore, a grid usually contains various networking
technologies — from system area network through wide area network.
PadicoTM is a communication framework that decouples application middleware
systems from the actual networking environment. Hence, applications become able
to transparently and efficiently utilize any kind of communication middleware
(either parallel or distributed-based) on any network that they are deployed
on. Moreover, to support advanced grid programming models, PadicoTM is able to
concurrently support several communication middleware systems.
PadicoTM achieves these functionalities by implementing a
dual-abstraction model which is organized in 3 layers: arbitration,
abstraction, and personalities.
The two paradigms, parallel and distributed, are present at each level.
Therefore, cross-paradigm translation is performed only when required (i.e.
distributed middleware atop parallel hardware or parallel middleware atop
distributed networks) with no bottleneck of features. The lowest level layer is
the arbitration layer. It goal is to provide arbitrated interfaces, i.e. a
consistent, reentrant and multiplexed access to every networking resources,
each resource is utilized with the most appropriate driver and method.
On top of the arbitration layer, an abstraction layer aims at providing
abstract interfaces well suited for their use by various middleware systems,
independently from the hardware. The abstract layer should be fully
transparent: the interfaces are the same whatever the underlying network is.
The last layer is a personality layer which is able to supply various
standard APIs on top of the abstract interfaces. Personalities are thin
wrappers which adapt a generic API to make it look like another API. They do no
protocol adaptation nor paradigm translation; they only adapt the syntax.
During the year 2003, we have finalized the specification of the three-layer
dual-abstraction model and have modified PadicoTM accordingly. The arbitration
layer in PadicoTM is called NetAccess, which contains two subsystems: SysIO for access to system I/O (sockets, files), and MadIO, based on Madeleine, for
multiplexed access to high-performance networks. A core handles a
consistent interleaving among the concurrent polling loops. NetAccess is open
enough so as to allow the integration of other subsystems beside MadIO and
SysIO for other paradigms such as Shmem SMP nodes for example.
The two abstract interfaces in PadicoTM are VLink for distributed computing,
and Circuit for parallelism. The VLink interface has been implemented on top
of several drivers: MadIO, SysIO, Parallel Streams, AdOC (a dynamic adaptive
compression library) and loopback. The Circuit interface, which manages
communications on a definite set of nodes, has been implemented on top of
MadIO, SysIO, loopback and VLink. As a given instance of Circuit can
use different adapters for different links, it is possible to build a circuit
using different kinds of communication.
The concept of (distributed) parallel object appears to be a key technology for
programming (distributed) numerical simulation. It joins the well known object
oriented model with a parallel execution model. Hence, a data distributed
across a parallel object can be sent and/or received almost like a regular
piece of data while taking advantage of (possible) multiple communication flows
between the parallel sender and receiver.
The Paris Project-Team has been working on such a topic for several years.
PaCO was the first attempt to extend CORBA with parallelism. PaCO++ is a second attempt that supersedes PaCO in several points. It targets a
portable extension to CORBA so that it can be added to any implementation of
CORBA. It advocates the parallelism of an object is mainly an implementation
issue: it should not be visible to users but in some special occasions. Hence,
the Omg Idl is no longer modified.
In 2002, the development of PaCO++ was started to validate the
portable parallel CORBA object concept. A first implementation was done
which was validated with an EADS application that manipulated block-cyclic
distributed matrices of complex number. It was composed of an Idl-toIdl compiler and a (C++) runtime part. Both tools only required a compliant
CORBA implementation. Moreover, the PaCO++ runtime handled threads,
intra-parallel object communications and data distributions thanks to abstract
interfaces.
Continuing the development effort, we have worked on stabilizing the code of
PaCO++ so that to be able to deliver it to our partners in the HydroGrid
ACI GRID project and to the ReMaP/GRAAL research team (LIP, Lyon, France). We
plan to release a public version during the winter 2003–2004.
The data distributed abstraction of PaCO++ was quite primitive. Thanks
to the support of the Inria ARC RedGrid, we are working on refining
the role of the PaCO++ runtime. An important motivation was to be able
to integrate a communication scheduling functionality to take into account the
capabilities of the underlying networks. We succeed in integrating the
scheduling library done by the Algorille research team (Nancy, France) into an
experimental prototype. Its communication performance does not decrease when
the number of sender increases contrary to other prototypes that face network
congestion when the number of sender becomes too large with respect to the WAN
bandwidth.
Another issue with parallel object we have started to study concerns the management of exception. The problem is to define the semantic of an exception raised during the invocation of a parallel operation invocation. We have identified various scenarios (a single exception, several identical exceptions, several different exceptions, etc) and the semantic that can be associated to them (standard exception mechanism, group of exceptions, priority of exceptions, etc). It is still on going work.
Future work can be divided into three parts. First, we will continue the
development of PaCO++ with the integration of a support for
communication scheduling as well as a more open model for distributed data.
Second, the work on parallel exceptions will be finished and implemented into
PaCO++. Last, we would like to show the concept of parallel object can
be applied to other technologies than CORBA, like for example Web Services or
Peer-to-Peer.
Software component technology represents the next attempt to deal with the
complexity of software programming. CORBA component model (Ccm) is the Omg standard for component-based distributed programming. Like CORBA objects,
CORBA components suffer limitation with respect to parallelism.
Our goal being to study the concept of parallel component, Ccm appears to be a
reasonable technological choice as it specifies the whole life cycle of a
component, including its packaging and its deployment.
We have proposed to define a parallel component as a collection of identical
sequential components that executes all or some parts of its services in
parallel. This definition allows us to apply the experience acquired with
PaCO++. The Omg Idl3, which is the component abstract view, does
not need to be modified. The parallelism, specified into an auxiliary file, can
be attached to the implementation definition language (Cidl). Hence, it should
be possible to implement parallel components as an extension of Ccm implementations.
To evaluate the pertinence of our parallel component definition, we have
implemented two prototypes of parallel CORBA components based on two
preliminary Ccm implementations: OpenCCM, a Java implementation, and MicoCCM,
a C++ implementation. For both prototypes, the definition of parallel
component was pertinent as no particular difficulty was found. Moreover, both
prototypes perform as expected. The latency measurements do not show any
significant overhead with respect to the latency of the plain Ccm implementation. The bandwidth was correctly aggregated: it grows from 9.8 MB/s
for the C++ implementation (resp. 8.3 MB/s for the Java implementation)
for a one-node to one-node parallel component configuration to 78.4 MB/s (resp.
66.4 MB/s) for a 8-node to 8-node parallel component configuration. These
number have been obtained using a Fast-Ethernet network for CORBA communications.
When using PadicoTM to route CORBA communications through a Myrinet network,
the bandwidth for the C++ version scales from 43 MB/s (one-node to
one-node) to 280 MB/s (8-node to 8-node). These numbers are better than for a
Fast-Ethernet network. However, there are not very good with respect to the
performance of the Myrinet network. As we have shown with the PadicoTM experiments, the problem lies in the data copies generated by Mico which limit
the bandwidth. High performance parallel components, like high performance
parallel objects, require a high performance CORBA implementation. OmniORB is
such an implementation for CORBA objects. Such a CORBA component
implementation is still missing.
In the future, we will focus on the relationship between parallel components
and the Ccm containers. It seems parallel versions of existing containers
should be define in order to add parallelism support to container operations.
Another direction concerns the adaptability of parallel components to their environment. An example of adaptation is a modification in the number of components that belong to a parallel component. Adaptability appears as a new type of service brought by parallel containers.
Last, we target to develop a fully operational prototype called GridCCM which
will extend Ccm with parallel components. It will be based on PaCO++ and some tools like a compiler for Idl3 transformation and new code
generators.
The deployment of parallel component based applications is a critical issue in
the utilization of computational Grids. It consists in selecting a number of
nodes and in launching the application on them. We have started to work on an
accurate description of the resources. Previous work succeeds in describing
properly the compute nodes (CPU speed, memory size, operating system, etc), but
generally fails to describe the network topology and its characteristics in a
simple, synthetic and complete way.
We have proposed a description model for grid networks. This model provides a
synthetic view of the network topology. In particular, it is able to
describe non-hierarchical topologies. It is also a simple, namely thanks
to the possibility for a network group to inherit properties from its parent
network groups. However, this simplicity does not hinder the description of
complex network topologies (asymmetric links, firewalls, non-IP
networks, non-hierarchical topologies). Finally, our description model aims to
be complete by specifying the necessary information about the software
available to access particular network technologies.
The proposed model has been successfully integrated into the MDS2 of Globus. It
has mainly consisted in defining around 40 LDAP schema entries that describes
natures of networks, lists of open or closed ports, average latencies and
bandwidths, network software information, etc.
We foresee two complementary future work. On one hand, we have to specify the
interactions between a Grid middleware such as OGSA and the CORBA component
model. On the other hand, we have to integrate our parallel component model to
some planner tools such as Sekitei.
In the area of wireless computing where resources are a key issue, many techniques of dynamic adaptation have been developed: from the observation of the environment, codes can adapt their behavior to fit the resource constraints. An efficient way to allow an application to evolve according to its environment is to provide mechanisms that permit dynamic self-adaptation by changing the behavior depending on the currently available resources. Since Grid architectures are also known to be highly dynamic, using resources efficiently on such architectures is a challenging problem too. Software must be able to dynamically react to the changes of the underlying execution environment.
In order to help developers to create reactive software for the Grid, we are investigating a model for the adaptation of parallel components.
We have combined a dynamic adaptation framework with parallelism and
distribution allowing its use for Grid programming. Our prototype is built
using the ACEEL adaptation engine built for wireless and mobile environments.
Our tool takes into account the parallelism that can reside in applications.
We have defined a parallel self-adaptable component as a component composed of
several processes working together which is able to change its behavior
according to the changes of the environment. The structure of such a component
includes an adaptation policy, a set of available implementations,
called behaviors, and a set of reaction steps. Reaction steps
are the means by which the component adapts itself. It can be for example the
replacement of the active behavior, the tuning of some parameters, the
redistribution of arrays. The platform we have built mainly provides two kinds
of objects: the decider and the coordinators. The decider is the
object that makes the decisions: it decides when (events to watch) and how
(reactions to execute) the component should adapt itself according to the
adaptation policy. The coordinators execute the directives given by the
decider: they serve as intermediaries between the code of the component and the
platform. Their role is to synchronize the adaptation mechanism with the
functional code and to coordinate the execution of the reactions.
We plan to define more formally the properties that the component is required to satisfy to adapt itself. This includes the properties of global states where an adaptation can occur and the constraints on behavior replacement. Studying the relationship between fault tolerance systems that use checkpointing and adaptation in the context of Grid computing is an important perspective too. Finding shared properties between checkpoints and adaptation points would be of great help in establishing properties and constraints on adaptation point placement.
Our long term goal is to build a generic platform to develop parallel adaptable components for the Grid. This platform would include both the toolbox to build parallel self-adaptable components and their runtime environment. Such a platform should ease the building of efficient applications for Grid architectures.
Providing the data to the applications is a major problem in grid computing. The execution of an application on some site is possible only when the data of the application are present on the ``data-space'' of this site. It is necessary to move the data from the production sites to the execution sites. Moreover, in high performance simulation domain, the applications are themselves parallel programs and the grid sites are clusters of computation nodes. Each process of the parallel application needs only part of the input data and produces a part of the results. Duplicating the input data from a central server and then gathering the results after the execution can be expensive.
The participation of the Paris Project-Team to the e-Toile project
(Mome DSM. A Mome daemon process is launched
in the background on each node of the grid. When the execution of some
application starts on Mome-aware computation nodes, each of its parallel
processes connects to local Mome daemon. The data-repository interface
provides entry-points for the creation and for the localization of segments in
the DSM (through a kind of directory), and for the mapping of these segments in
the local address space of the process. The data repository is persistent: the
segments retain their data after all application processes have disconnected.
The application processes can fail safely (or be killed) without impacting the
DSM. The system provides a kind of uniform data space to the grid
applications.
The Mome DSM behaves as a COMA (Cache Only Memory) for page management: a copy
of a page is present on the nodes recently using the page; a page-fault is
served directly by one of the nodes with a valid copy of the page. This
strategy is well suited for the case where the computation nodes are clusters.
The data never transit through a centralized server.
The current version of Mome considers a flat organization of the DSM nodes.
On a grid infrastructure, the performance of the communication system inside a
grid node (a cluster) is higher than between grid nodes. The DSM should be
aware of this structure. A new hierarchical organization of the Mome DSM has
been defined and will be implemented in the near future. This new organization
will favor local communications: inter-cluster communications will be avoided
as long as page-faults can be served locally. This hierarchical structure will
also allow the exploitation of the DSM on a large number of nodes (hundreds of
DSM nodes).
Mome currently runs in the user space. Its implementation necessitates no
modifications of the kernels. But the applications must use specific library
calls in order to exploit the Mome data space. In the future, we plan to
interface the Mome daemons with the Linux kernel through a kernel module. The
DSM space will become accessible through a classical file system interface
without modifications of the applications.
With JuxMem, we propose the concept of data sharing service for grid
computing, as a compromise between two rather different kinds of data sharing
systems: (1) DSM systems, which propose consistency models and protocols
for efficient transparent management of mutable data, on static,
small-scaled configurations (tens of nodes); (2) P2P systems, which
have proven adequate for the management of immutable data on
highly dynamic, large-scale configurations (millions of nodes).
These two classes of systems have been designed and studied in very different contexts. In DSM systems, the nodes are generally under the control of a single administration, and the resources are trusted. In contrast, P2P systems aggregate resources located at the edge of the Internet, with no trust guarantee, and loose control. Moreover these numerous resources are essentially heterogeneous in terms of processors, operating systems and network links, as opposed to DSM systems, where nodes are generally homogeneous. Finally, DSM systems are typically used to support complex numerical simulation applications, where data are accessed in parallel by multiple nodes. In contrast, P2P systems generally serve as a support for storing and sharing immutable files.
Our data sharing service targets physical architectures with features intermediate between DSM and P2P systems. We address scales of the order of thousands of nodes, organized as a federation of clusters, say tens of hundred-node clusters. At a global level, the resources are thus rather heterogeneous, while they can probably be considered as homogeneous within the individual clusters. The control degree and the trust degree are also intermediate, since the clusters may belong to different administrations, which set up agreements on the sharing protocol. Finally, we target numerical applications like heavy simulations, made by coupling individual codes. These simulations process large amounts of data, with significant requirements in terms of data storage and sharing.
The main contribution of such a service is to decouple data management from
grid computation, by providing location transparency as well as
data persistence in a dynamic environment.
In order to tackle the issues described above, we have defined an architecture
proposal for a data sharing service. This architecture mirrors a federation of
distributed clusters and is therefore hierarchical and is illustrated
through a software platform called JuxMem Juxtaposed Memory). A detailed description of this architecture is given
in cluster groups), each of which generally corresponds to a cluster
at the physical level. All the groups are inside a wider group which includes
all the peers which run the service (the juxmem group). Each cluster group
consists of a set of nodes which provide memory for data storage (called
providers). In each cluster group, a node manages the memory made
available by the providers of the group (the cluster manager). Any node
(including providers and cluster managers) can use the service to allocate,
read or write to data as a client. All providers which host copies of
the same data block make up a data group, to which is associated an ID. To
read/write a data block, clients only need to specify this ID: the platform
transparently locates the corresponding data block. Consistency of replicated
blocks is also handled transparently (according to the sequential consistency
model, in the current version). In order to tolerate the volatility of peers, a
dynamic monitoring of the number of copies of data block is used and new copies
are created when necessary, in order to maintain a given redundancy degree.
Cluster manager roles are also replicated, to enhance cluster availability.
As a proof of concept, we have built a software prototype using the
JXTA
We are considering unconventional approaches for Grid programming and, more generally, for the programming of distributed applications.
It is well known that the task of programming is very difficult in general and even harder when the environment is distributed. As usual, the best way to proceed is by separation of concerns. Programs are first expressed in a model independent of any architecture, and then are refined taking into account the properties of the (distributed) environment. Several properties have to be taken into account, for example correctness, coordination/cooperation, mobility, load balancing, migration, efficiency, security, robustness, time, reliability, availability, computing/communication ratio, etc.
The models that we investigate are based on multisets. These models are often presented through metaphors which make understanding easier and may provide new sources of inspiration. One well known metaphor is the chemical one but other metaphors can also be considered such as biology (like cells or DNA), animal societies (like ants or bees colonies), etc.
Our present work relies on the chemical reaction paradigm and more precisely on
the Gamma model of programming. Our recent contributions, carried out in close
cooperation with Pascal Fradet, now at Inria Rhône-Alpes (Project-Team
POP ART), include the extension of Gamma to higher-order and the
generalization of multiplicity.
The extension of the basic Gamma model to a higher-order Gamma makes it
possible to consider a Gamma program as a member of a multiset, thus eligible
for reactions as any other element of the multiset. Such a facility will be
used to express such properties as code mobility. We have called that model,
the Workshop on Membrane Computing 2003, and a full
presentation of this work will be available as an Inria Research Report.
The generalization of multiplicity of multisets of the
Apart from completing the above research activities, our perspectives concern coordinations in Grid applications and the definition of a chemical object model for Grids programming.
The VTHD Project (
March 2002.
March 2004.
France Télécom Recherche & Développement (FT R&D),
Inria, École Nationale Supérieure des Télécommunications (ENST),
École Nationale Supérieure des Télécommunications de Bretagne, IMAG
and EURECOM institute.
Rnrt funding, Platform program
The Paris Project-Team is involved in the
VTHD ++ Metacomputing Sub-Project. We study code coupling and
high-bandwidth data transfer between distant clusters. We have demonstrated
a sustained transfer rate of 1.9 Gb/s between a PC cluster located in
Rennes and another cluster located in Sophia-Antipolis (1000 km far apart)
within a coupled numerical simulation.
The e-Toile Project
(
December 2001.
December 2003 (extension until April 2005).
CEA, CNRS, CS (Communication & Systems), Edf, ENS Lyon
(LIP), Université de Versailles Saint-Quentin (PRISM), Inria (Irisa,
ID-IMAG, RESO), Sun France, IBCP Lyon (since May 2002).
Rntl funding, Platform program
The contribution of the Paris Project-Team to
the e-Toile Project focuses on the development of a Distributed
Shared Memory (DSM) environment for the implementation of a persistent
and distributed data repository for the Grid.
The CASPer Project aims at defining a Web-based
computing portal to use distributed computing resources.
October 2002
September 2004
EADS CCR, Alcatel Space Industries, IDEAMECH, Université de
Paris Sud (LRI)
Rntl funding
The Paris Project-Team defines the overall
architecture and implements an OGSA based system for the core services of
CASPer.
The participation of the Paris Project-Team to the
Software-bus project aims at the evaluation of cluster technology and
cluster softwares for Internet routers.
March 2001.
May 2003.
Alcatel, Inria (Irisa), ENS Lyon (LIP).
Alcatel funding
The Paris Project-Team
has developed a parallel multi-criteria routing algorithm and implemented
it on a cluster of PC using the Mome DSM. Cluster technology allows
incremental adaption of the computation power (nodes can be added) and
hardware redundancy in case of failure. Our implementation of the routing
algorithm exploits the checkpointing capacity of Mome in case of the
failure of some node: the routing algorithm is automatically restarted from
the last checkpoint on the remaining nodes. New nodes can also be added
during restart. The Mome-based software demonstrator was delivered to
Alcatel at the end of the collaboration in May 2003.
The collaboration with Edf R&D aims at designing and
implementing an environment and tools for PC cluster management and use in
the area of high performance computing.
December 1st, 2000
November 30th, 2003
Edf R&D, Inria (Irisa, RESO)
Edf R&D funding, PhD CIFRE Grant (Geoffroy Vallée)
The work carried out by the Paris Project-Team
relates to the design and implementation of Kerrighed Single System Image
(SSI) operating system for high-performance computing on clusters. In the
framework of the Edf project, Kerrighed configurable global scheduler has
been designed and implemented as well as efficient global process
management mechanisms to replicate, migrate and checkpoint processes. A
development framework facilitating the implementation of dynamic scheduling
policies in Kerrighed has been developed. Experimentations carried out
with applications provided by Edf R&D (HRM1D, Aster, Cyrano3, Cathare)
have been conducted.
The COCA contract comprises of two parts. The first one aims
at designing, evaluating and optimizing a prototype high performance
computing infrastructure well-suited for scientific numerical simulation.
The second one relates to the problematic of the reusability of numerical
models. The Paris project-team contributes to the first part of the COCA
contract.
March 1, 2003
July 31, 2005
Dga, CGEY, ONERA-CERT
Dga Funding
The high-performance computing infrastructure
considered in the COCA contract is a federation of medium-size clusters,
each cluster running a Single System Image (SSI) operating system. The work
carried out by the Paris Project-Team relates to the design and
implementation of Kerrighed SSI cluster operating system. Four successive
releases of Kerrighed will be delivered as part of the COCA contract with
an increasing set of functionalities: (1) Global memory management (V0.70);
(2) Global management of memory, processes, data streams and files (V1.0);
(3) Checkpointing mechanisms for parallel applications (V.1.10, based on
V1.0); and (4) Full-fledged SSI, highly available system (V2.0, based on
V.1.10). Moreover, the Paris Project-Team will study extensions to
Kerrighed operating system to make it a Grid-aware operating system
for cluster federations.
the Paris Project-Team received a 50,000 Euros grant from
the Brittany Regional Council for its participation to the GRID 5000
Platform. The Scientific Council of University Rennes 1 also supported this initiative
by allocating an amount of 8,000 Euros through the local BQR Program.
The Brittany Regional Council provides half of the financial support for the PhD theses of Mathieu Jan (starting on October 1, 2003, for 3 years) and André Ribes (starting on October 1, 2001, for 3 years). This support amounts to a total of 28,000 Euros/year.
The Paris Project-Team is deeply involved in national initiatives related to
the Grid. An initiative was launched by the Ministry of Research through
the ACI program (Action Concertée Incitative). The ACI GRID (for
Globalisation des Ressources Informatiques et des Données) aims at
fostering French research activities in the area of Grid computing by providing
financial support to the best research groups. The ACI GRID initiative was
launched in 2001 and issued three calls for proposal (one every year). The
Paris Project-Team submitted proposals for each of them. The following
paragraphs present an overview of the projects funded by the ACI GRID in which
the project-team is involved.
The goal of this project is to promote a programming model for computing Grids.
This model should combine both parallel programming and distributed programming
models. It is based on the concept of distributed objects and software
components for distributed programming. This project also aims at designing and
experimenting a high-performance communication software framework, enabling
both efficient communication between objects or components, and parallel
programming. This 2-year project, coordinated by the Paris Project-Team, ends
November 2003. The other partners are the Runtime Project-Team in Bordeaux, the
Jacquard Project-Team in Lille, and the Oasis Project-Team in Sophia-Antipolis.
The HydroGrid project is a 3-year multidisciplinary project, started in
September 2002. It aims at modeling and simulating fluid and solute transport
in subsurface geological media using a multiphysic approach. Such multiphysic
numerical simulation involve code featuring different languages and
communication libraries (FORTRAN, Mpi, OpenMP, etc.), to be run on a commun
computational Grid. Therefore, the project relies on the results of the
ACI GRID RMI project. A strong point of the HydroGrid project is to group
together teams with different areas of expertise (from applications, scientific
computing and computer science). The partners are the Paris, Aladin and Estime
Project-Teams at Irisa, the Hydrodynamique et Transferts en Milieux
Poreux Team (IMFS Strasbourg) and the Transferts physiques et
chimiques Team (Géosciences Rennes).
This project gathers together the national research communities interested in large-scale data management. This includes communities involved in grid computing, operating systems, distributed systems and databases. The projects aims at understanding the common research issues and at defining a common terminology. An important goal of the project is encourage the emergence of software prototypes resulting from collaboration efforts of these communities.
The setup of the GDS Project of the ACI MD, coordinated by the Paris project-team, in collaboration with the ReMaP/GRAAL and REGAL Research Groups,
is a result of the discussions that took place within the framework provided by
DataGRAAL. The DataGRAAL community is the initiator of a project for a Spring
School on large-scale data management (DRUIDE 2004), which will take
place in May 2004 at Port-aux-Rocs (Le Croisic, Brittany). Gabriel Antoniu
chairs the Program Committee and the Organizing Committee of this school. The
project started in November 2002 and will end in November 2004. Gabriel Antoniu
is local correspondent of DataGRAAL for the Paris Project-Team.
Jean-Louis Pazat is at the head of the GRID2 project. At many as 10
laboratories from various parts of France are involved in this 150,000-Euro
project granted by the Ministry of Research for 3 years. Christian Pérez is in
charge of the Run-Time System and Middleware Working Group. The
objective of this project is to federate the Computing GRID research community
by organizing meetings between researchers, teaching for young researchers and
by achieving information dissemination.
GRID2 is divided in the following working groups: (1) Software architecture and
languages; (2) Run-time systems and middleware; (3) Algorithms and models; (4)
Algorithms and high performance applications. This project has organized a
Winter School on Grid Computing in Aussois in December 2002 and two
workshops during the RenPar Conference in 2002 and 2003. A number of
Hands-On Days have taken place this year, enabling researcher to gain
practical experience of topics such as JXTA, CORBA, numerical
computing, etc.
Alta is a 2-year joint project funded by the ACI GRID of the French Ministry of
Research, in cooperation with the Inria Cooperative Research Initiatives. The
Paris Project-Team coordinates the project. It also involves Runtime
Project-Team in Bordeaux, and the Distribution and Parallelism Team in
Lille. It aims at studying the impact of tolerant loss-control in the context
of asynchronous iterative algorithm. An objective is to define and to implement
a dedicated API.
This project (Paris Project-Team to prepare a Network Of Excellence proposal in the area of
Grid and Peer-to-Peer Computing. Four European meetings were organized
in 2003 and supported by this program: Rennes, April 15-16; Paris, July 15-16;
Klagenfurt, August 27; Venice, September 1. These meetings led to the redaction
of the Core Grid proposal that was submitted to the 2nd Call (October 15) of
the IST Program (Framework Programme 6, European Commission) under Strategic
Objective 2.3.2.8 Grid-based systems for Complex Problems Solving.
GRID5000 is a nation-wide initiative to build a research platform (ca. 5000
processors) for Grid computing. This large-scale distributed platform will
enable experimentations on operating systems, middlewares, and communications
libraries by the computer-science research community in France. In 2003, the
Paris Project-Team submitted a proposal for building a GRID5000 node in
Rennes. The project has been selected by the French Ministry of Research
(ACI GRID) to be one of the 7 initial nodes of the Grid5000 Computing
Infrastructure and received a three-year grant of 200 kEuros. The integration
of the first 66 processor boards (dual Xeon) to the Paris cluster (50 PC and
Xserve dual-processor nodes) has been initiated during November 2003.
The Paris project-team is involved in the ACI MD (for Masses de
Données). It aims at fostering research activities in the area of
large-scale data management, including Grid computing. The first call for
proposal was issued in 2003. The following paragraphs give a short overview of
the project-team involvement in this initiative.
The GDS Project of the ACI MD gathers 3 research teams: Paris (Irisa), REGAL
(LIP6) and ReMaP/GRAAL (LIP). The main goal of this project is to specify,
design, implement and evaluate a data sharing service for mutable data and
integrate it into the DIET ASP environment developed by ReMaP/GRAAL. This
service will be built using the generic JuxMem platform for peer-to-peer data
management (currently under development within the Paris project-team, see
section ). JuxMem will serve to implement and compare
multiple replication and data consistency strategies defined together by the
Paris and REGAL research groups. The project started in September 2003 and
will end in September 2006. It is coordinated by Gabriel Antoniu (Paris).
Project site:
The main objective of the ACI MD MDP2P project is to provide high-level
services for managing text and multimedia data in large-scale P2P
systems. The Paris Project-Team contributes for the development of
DSM-based (Mome and Kerrighed) data management techniques on clusters of
clusters for large-scale multimedia indexing.
The Data Grid Explorer (GdX) Project aims to implement a large-scale
emulation tool for the communities of a) distributed operating systems, b)
networks, and c) the users of Grid or P2P systems. This large-scale emulator
consists of a database of experimental conditions, a large cluster of 1000 PCs,
and tools to control and analyze experiments. The project includes studies
concerning the instrument itself, and others that make use of the instrument.
The GDS project of the ACI MD, coordinated by Paris, is partner of GdX, as a
user project. The project started in September 2003 and will end in
September 2006. Gabriel Antoniu is local correspondent of GdX for the Paris Project-Team.
This 2-year project is funded by the Inria Cooperative Research Initiative
(ARC) whose partners are the ReMaP, Paris, Algorille and Scalapplix
Project-Teams. Its objective is to study the issues related to data
redistribution in a Grid environment, to develop data redistribution libraries
and to apply the results in the environments develop by the partners (DIET,
PaCO++, GridCCM and EPSN).
This one-year project, called Étude préparatoire pour une plate-forme de
grille expérimentale, aims to identify issues (scientific and technical) and
propose solutions in the perspective of building an experimental Grid platform
gathering nodes geographically distributed in France.
This is a one-year project, called Méthodologies de programmation des
grilles, that aism at identifying future research directions related to
computational Grids programming. It encompass partners involved in
applications, algorithms, runtimes/middleware systems and network protocols.
The POP Project (IST Project 2000-29245) targets performance portability of
OpenMP application. It is a 3-year project which has started in December 2001.
The partners are the European Center of Parallelism of Barcelona
(CEPBA-UPC, Barcelona, Spain), the Istituo di Cibernitica (IC-CNR,
Naples, Italy), the High Performance Information System Laboratory
(LHPCA-UP, Patras, Greece) and Inria.
The POP Project was motivated by the adoption by the industry of the OpenMP
language as a standard for shared memory programming. However, this standard is
restricted to hardware shared memory machine. The POP project objective is to
build an environment that, starting from an OpenMP application, is able to
generate efficient code for different kind of machine architectures. In
addition to hardware shared memory machine, the targeted architectures include
distributed memory machines and multithreaded machined.
In particular, the project focus on three main goals. The first goal deals with the extension of OpenMP expressiveness to exploit parallelism in irregular task graphs, to improve work-distribution schemes among groups of processors so as to enforce data locality and to add a support for inspector/executor techniques. The second goal is to study the dynamic adaptability of the runtime to use self-analysis to modify the behavior of the application on runtime and to run the same binary file regardless of the underlying architecture, the input data, and the dynamic variation of available resources. The third goal concerns architectural modifications to efficiently execute OpenMP application on distributer memory machine or multithreaded machine.
The POP Project is based on the results of the Nanos European project. In
particular, an OpenMP compilation and execution environment was developed for
shared memory machines like the Origin 2000.
The Paris Project-Team focus on the architectural modifications of existing
software DSM to provide an adequate support for an efficient execution of
OpenMP application on cluster. The set of critical SDSM features, we have
identified during Year 2002, are being applied to the Mome SDSM, used in
conjunction with PadicoTM. A first complete prototype of the POP Runtime is
available on top of Mome.
Thierry Priol is the Scientific Coordinator of a Network of Excellence
proposal, called Core Grid, in the area of Grid and Peer-to-Peer (P2P). This
proposal was submitted on October 15, 2003. As many as 42 partners, mostly from
European countries are involved. The Core Grid Network of Excellence aims at
building a European-wide research laboratory that will achieve scientific and
technological excellence in the domain of large-scale distributed, Grid, and
Peer-to-Peer computing. It is the primary objective of the Core Grid Network of
Excellence to build solid foundations for Grid and Peer-to-Peer computing both
on a methodological basis and a technological basis. This will be achieved by
structuring research in the area, leading to integrated research among experts
from the relevant fields, and more specifically distributed systems and
middleware, programming models, knowledge discovery, intelligent tools, and
environments.
The research program is structured around five complementary research areas that have been selected on the basis of their strategic importance, their research challenges and the European expertise in these areas to develop next generation Grids: knowledge and data management, programming model, system architecture, resource management and scheduling, problem solving environments, tools and Grid systems.
The Specific Support Action (SSA) ERA pilot on a co-ordinated
Europe-wide initiative in Grid Research addresses the Strategic Objective
2.3.2.8 Grid-based Systems for solving complex problems and the
Strategic Objective 2.3.6 General Accompanying actions as described in
the IST Work Programme 2003-04.
Currently several Grid Research initiatives are on-going or planned at national and European Community level. These initiatives propose the development a rich set of advanced technologies, methodologies and applications, however enhanced co-ordination among the funding bodies is required to achieve critical mass, avoid duplication and reduce fragmentation in order to solve the challenges ahead. However, if Europe wishes to compete with leading global players, it would be sensible to attempt to better coordinate its various, fragmented efforts toward achieving a critical mass and the potential for a more visible impact at an international level.
The goal of the GridCoord SSA proposal is namely to achieve such a
coordinated approach. It will require both: (1) Co-ordination among the funding
authorities; (2) Collaboration among the individual researchers; (3) A
visionary research agenda. This proposal is thus tightly connected to the
Core Grid Network of Excellence proposal above, led by Thierry Priol at
the European level.
The GridCoord SSA proposal is led by Marco Vanneschi, University of
Pisa, Italy. It includes 10 partners from
The participation of the Paris Project-Team in the GridCoord proposal
from 6 European countries. The French partners are Inria (Leader: Michel
Cosnard) and Ens Cachan (Leader: Luc Bougé).
A proposal for a bi-lateral research
collaboration with the distributed system group of the University of Ulm has
been submitted to the 2004 Procope Program. During the collaboration,
we will study, design and implement new checkpointing strategies for real
applications running in different DSM environments. Three DSM systems will
be considered: Plurix system, developed at the University of Um, which
is based on a DSM implemented at the lowest possible level; the Kerrighed system, which implements a kernel-level DSM in Linux; and the Mome DSM,
implemented in user space on top of Linux. The two latter systems are
developed in the Paris Project-Team.
This funding has supported
our collaboration with the Parallel Computing group of the CS Department of
Univ. of New Hampshire (Phil Hatcher and Bob Russell, Professors at UNH). The
collaboration has focused on the Hyperion project, a distributed
compiling and execution environment for clusters. Gabriel Antoniu, Mathieu
Jan and Sebastien Monnet have visited the UNH team for 10 days in
October 2003, in order to discuss about further collaboration topics related
to large-scale data management. It has been decided that David Noblet (one of
Phil Hatcher's undergraduate students) will visit Irisa in May 2004 for a
2-month internship within the Paris Project-Team. He will be supervised by
Luc Bougé and Gabriel Antoniu. The internship will be funded by the
International Research Opportunities Program (IROP) de UNH
(
Pascal Gallard has spent 3 months (May–August
2003) at Rutgers University in the Discolab Research Team leaded by
Liviu Iftode. This internship has been funded by Discolab. Pascal
Gallard has worked on the use of the R-DMA technology for the implementation
of a highly-available cluster architecture. He contributed to the design of
transparent migration mechanisms for TCP/IP streams, that do not require any
modification of the TCP/IP protocol or in the client. A prototype has been
developed for a PC cluster based on the Myrinet XP networking technology.
Liviu Iftode visited Paris in November 2003 to discuss about a further
collaboration on the design and implementation of a novel, highly-available
cluster architecture based on the concept of remote healing.
Ramamurthy Badrinath,
assistant professor at the IIT Kharagpur, has been funded by the MAE and
Inria as an invited researcher in the Paris Project-Team from May 2002 to
May 2003. He has worked with Christine Morin and Geoffroy Vallée on the
design of basic fault tolerance building blocks to support various backward
error recovery strategies for different kinds of parallel applications
executed in a cluster. The proposed mechanisms have been implemented as part
of the Kerrighed single system image operating system and experimented for
multithreaded applications.
The Paris Project-Team, with the
ReMaP/GRAAL Project-Team located at Inria Rhône-Alpes, have been selected by
the STAR program of the French Embassy in Seoul to conduct a 2-year
cooperation with the Department of Aerospace Engineering (Prof. Seung Jo Kim)
of the Seoul National University. This cooperation, starting in June 2003,
aims at experimenting a Grid infrastructure, made with the computing
equipments of the two participants, with aerospace applications (SNU) and
middleware and programming tools designed by Inria. Four researchers from
the two Inria project-teams visited SNU in December 2003 to start the
cooperation and attended a French-Korea workshop.
The Paris Project-Team hosted the 3-month
internship (May–July 2003) of Maka Hitashyam from IIT Kharagpur, in the
framework of the Inria International Internship Program. Maka
Hitashyam worked on the implementation of the Kerrighed Monitor, a tool to
help programmers to tune parallel applications executed on a cluster running
Kerrighed Single System Image operating system.
Michaël Schöttner, Junior Professor at the
University of Ulm, Germany, was invited by the Paris Project-Team in
January 2003 and presented the Plurix Project at the Irisa Network and System Seminar.
Liviu Iftode, Associate-Professor at
Rutgers university and Head of the Discolab Laboratory, visited us
in November 2003 to discuss about further cooperation about self-healing
cluster operating systems and to work on a joint proposal on this subject
to be submitted to the NSF/Inria bi-lateral program.
Stephen Scott is a Senior
Researcher at Oak Ridge National Laboratory. He was invited by the Paris project in October 2003. The goal of his visit was to discuss about a
collaboration between Inria, Edf R&D and ORNL for the integration of
Kerrighed into OSCAR (Kerrighed would
be integrated as a new package, SSI-OSCAR. Stephen Scott also presented
the activities of his research group and OSCAR distribution at the Irisa Network and System Seminar.
Paulo Afonso Lopes is a PhD
student under the supervision of Prof. Dr. José C. Cunha, in the
Distributed System Group of the Computer Science Department of the
Universidade Nova de Lisboa, Portugal. He visited the Paris Project-Team for 3 days in November 2003. The goal of his visit was to
discuss about high performance I/O in clusters with Gaël Utard and
persons involved in the design and implementation of Kerrighed. Paulo
Lopes intends to use the Kerrighed software in the high-performance I/O
system prototype he will develop during his thesis.
L. Bougé chairs the CNRS Research Co-operative Federation
(Groupement de recherche, GDR) on Architecture, Networks and Systems,
and Parallelism (ARP, GDR d'animation) run by the
Department. It has been renewed for another 4-year term in 2002. Virtually
all the French academic researchers active in these areas are registered in
the GDR. As of today, this amounts to ca. 900 persons.
J.-L. Pazat is the coordinator of the G2C (Grids and Clusters for
Computing) Working Group of GDR ARP. This working group aims at
information dissemination and contacts between researchers in the area of
Cluster and Grid computing.
L. Bougé chairs the (informal!) Co-ordination Committee of the 6 GDR of the CNRS STIC Department. He has been serving since Year 2001.
L. Bougé et Th. Priol are members of the
Scientific Committee of the ACI GRID. The ACI GRID is the national French
initiative in the area of Grid computing
(Inria Sophia. This initiative was launched in April 2001.
Since then, several call for proposals (one per year) were issued, and
evaluation was carried out by the Scientific Committee.
L. Bougé serves as the Vice-Chair of the
Steering Committee of the Euro-Par annual conference series on
parallel computing (ca. 250 attendees,
J.-L. Pazat serves as the Chair of the
Steering Committee of the RenPar (Rencontres francophone du
parallélisme,
Ch. Morin co-organized with Jean-Yves Berthou,
Edf R&D, a workshop on Operating systems, tools and methods for
high-performance computing on Linux clusters. It was held in Clamart in
October 2003, in the framework of the Edf R&D conference series
Printemps de la recherche. About 180 persons attended the workshop
(
L. Bougé is a member of the Steering
Committee of the IPDPS (International Parallel and Distributed
Processing Symposium, Inria Sophia, they co-organized the
2003 Edition in Nice in April 2003, supported by the Inria Research Unit of
Sophia-Antipolis, the I3S CNRS Research Unit and the University of Nice.
More than 500 researchers attended.
Several members of the Paris Project-Team are
involved in the Organization Committee of the 18th ACM International
Conference on Supercomputing (ICS '04). It will be held in Saint-Malo,
France, in June 2004, co-supported by Irisa/Inria and ENS Lyon. Ch. Morin
holds the Local Arrangement Co-Chair, L. Bougé the Finance Chair, Ch. Pérez
the Publication Chair, and Th. Priol the Workshop Chair. About 150
participants are expected (
J.-L. Pazat heads the GRID2 Project
ACI GRID. This project is devoted to
dissemination and co-ordination of academic French research groups interested
in Grid computing.
Ch. Pérez serves co-ordinates of the Communication and middleware
systems Working Group in the GRID2 project.
Ch. Pérez heads the ACI GRID RMI Project. It started in
December 2001, for 2 years (
Ch. Pérez heads the Alta Project, co-supported by
ACI GRID and Inria. Alta started in 2003, for 2 years
(
G. Antoniu heads the GDS (Grid Data Service) Project
supported by ACI MD. GDS started in September 2003, for 3 years
(
G. Antoniu is the local correspondent of the GdX (Data
Grid Explorer) Project supported by ACI MD. GdX started in September 2003,
for 3 years (
Th. Priol is the European Scientific Coordinator
of the Core Grid Proposal (
Ch. Pérez is the Inria Scientific Correspondent of
the European IST POP project, started in December 2001 for 3 years.
L. Bougé is a member of the Steering Committee of CNRS
Thematic Committee (RTP 8) High-Performance and Distributed Computing
led by Yves Robert, Lyon, and Brigitte Plateau, Grenoble.
Ch. Pérez is the local correspondent of CNRS Specific
Action (AS 115) of RTP 8 Methodology of Grid programming, let by
Raymond Namyst, Bordeaux.
served in the Program Committees for the following conferences:
International Workshop on Distributed Shared Memory on Clusters
(DSM 2003), organized in conjunction with IEEE International Symposium on
Cluster Computing and the Grid (CCGrid '03), Tokyo, Japan, May 2003.
IEEE International Conference on Distributed Computing Systems
(ICDCS-23), Providence, RI, USA, May 2003.
6th IEEE International Symposium on Object-Oriented Real-Time
Distributed Computing (ISORC '03), Hakodate, Hokkaido, Japan, May 2003.
2nd International Conference on Web-Based Learning (ICWL 2003),
Melbourne, Australia, August 2003.
International Workshop on Distributed Shared Memory on Clusters
(DSM 2004), organized in conjunction with IEEE International Symposium on
Cluster Computing and the Grid (CCGrid '04), Chicago, IL, USA, May 2004.
18th ACM International Conference on Supercomputing (ICS '04),
Saint-Malo, France, June 2004.
served in the Program Committees of the following conferences:
ParCo 2003, Dresden, September 2003.
RenPar 15, La Colle sur Loup, October 2003.
was one of the two co-chairs of Topic Grid Computing and
Middleware Systems of the Euro-Par 2003 Conference that was held in
Klagenfurt, Austria, in August 2003. He will be the global chair of this
topic for Euro-Par 2004, to be held in Pisa, Italy, in August 2004.
Th. Priol is member of the Editorial Board of the Parallel Computing
journal.
He served in the Program Committees of the following conferences:
11th Euromicro Conference on Parallel, Distributed and
Network-based Processing, Genova, February 2003.
Workshop on Innovative Solutions for Grid Computing (InnoGrid),
Melbourne, Australia, June 2003.
International Conference on Parallel Processing (ICPP),
Kaohsiung, Taiwan, October 2003.
Intl. Symposium on High-Performance Computing (ISHPC-V), Tokyo,
Japan, October 2003.
2nd European Across Grids Conference, Nicosia, Cyprus, January
2004.
IEEE International Symposium on Cluster Computing and the Grid
(CCGRID), Chicago, IL, USA, April 2004.
First International Workshop on Programming Paradigms for Grids
and Metacomputing Systems, Krakóow, Poland, June 2004.
International Meeting on High Performance Computing for
Computational Science (VecPar), Valencia, Spain, June 2004.
Fifth EuroGraphics Workshop on Parallel Graphics and
Visualization, Grenoble, France, June 2004.
Third Workshop on Advanced Collaborative Environments, Seattle,
WA, USA, June 2004
belongs to the Editorial Advisory Board of the
Scientific Programming Journal, IOS Press.
He chairs the Program Committee of Topic 18 Peer-to-Peer Computing of
the Euro-Par 2003 Conference in Klagenfurt, Austria, August 2003.
He will chair the Program Committee of the International Conference on
High Performance Computing (HiPC 2004), to be held in Bengalore, India, in
December 2004.
He served in the Program Committees for the following conferences:
8th International Workshop on High-Level Parallel Programming
Models and Supportive Environments (HIPS 2003), held in conjunction
with the IPDPS 2003 Conference, Nice, April 2003.
15e Rencontres francophones du parallélisme (RenPar 15), La
Colle-sur-Loup, October 2003.
has been solicited by the Selection Committee
(Commission de spécialistes, CSE) for Computer Science, University Rennes 1, as
an external reviewer.
is a member of the Selection Committee (Commission de
spécialistes, CSE) for Computer Science at University Paris Sud (UPS,
Orsay) and University of South Brittany (UBS, Vannes).
has been one of the two co-chairs of an Expert Group
convened by the DG Information Society of the European Commission (EC) to
outline a vision for Grid research priorities over the coming 5 to 7 years
(
He was a member of an International Co-operative Working Group on Grid
research infrastructures convened by the DG Information Society of the EC to
for the creation of a framework for international cooperation in the area of
Grid computing.
He served as a reviewer to the following EC-funded projects: IST DAMIEN, EUROGRID and P2PEOPLE. He was also involved in the evaluation of IST proposals.
serves in the Evaluation Committee of the Rntl Program.
He serves in the Expert Group convened by the Ministry of Research
(MSTP 9) to review the application for the Doctoral Supervision and Research
Awards (Primes d'encadrement doctoral et de recherche, PEDR).
He serves in the Scientific Committees of ACI GRID and ACI MD Programs
of the Ministry of Research.
has served as an external expert for the following programs:
Rntl and ACI MD programs of Ministry of Research, Regional funding program
of the Franche-Comté Region Authority, Post-Doctoral Grant program of Inria.
is responsible for a graduate teaching module High
Performance Computing on Clusters and Grids of the Master Program,
University Rennes 1. Within this module, she gave lectures on distributed operating
systems for clusters.
She gave a lecture on clusters, taught in the final year of the
Network Architecture Track at the Institut National des
Télécommunications (INT) in Évry in October 2003.
She gave lectures on synchronization, I/O and file systems within the
Operating system module, taught for the students of the Software
Engineering Diploma, University Rennes 1.
gave lectures on Distributed Shared Memory within the
High Performance Computing on Clusters and Grids Module of the
Master Program, University Rennes 1.
leads the Master Program of the 5th year of Computer
Science at Insa of Rennes.
He is responsible for a teaching module on Parallel Processing for
engineers at Insa of Rennes. Within this module, he gave lectures on
parallel and distributed programming.
He is responsible for a graduate teaching module Objects and
components for distributed programming for 5th-year students of Insa of
Rennes. Within this module, he gave lectures on Enterprise Java Beans.
gave lectures to 5th-year students of Insa of Rennes on
CORBA and Ccm within the course Objects and components for
distributed programming.
He gave a lecture in the course Multidisciplinary object-oriented
programming of the École Doctorale of the Ens Cachan.
He has given a seminar entitled PadicoTM: a component based
infrastructure for Grid Computing at ENS Lyon, Magistère
d'Informatique et Modélisation (MIM).
He gave tutorials on Object-Oriented Middleware and Components for
the Grid at two conferences: Middleware 2003 (Rio de Janeiro,
Brazil, June 2003) and IPDPS 2003 (Nice, France, April 2003). The
tutorials were jointly presented with Denis Caromel (Oasis Project-Team,
Inria Sophia).
has taught the tutorials of the Operating Systems
module of the DESS CCI Master Program (IFSIC). He is teaching part
of the Operating Systems Module at IUP 2 MIAGE, IFSIC. He is
giving lectures on peer-to-peer systems within the High Performance
Computing on Clusters and Grids Module of the Master program, University Rennes 1,
and within the Distributed Systems Module taught for the final year
engineering students of Insa Rennes.
leads the Master Program in Computer Science at the Brittany
Extension of Ens Cachan (Magistère Informatique et Télécommunications, MIT
Rennes!). This program is co-supported with University Rennes 1. It was launched in
September 2002. Olivier Ridoux, Lande Project-Team, Irisa, co-supervises
the program for University Rennes 1.
Only the events not listed elsewhere are listed below.
G. Antoniu, Ch. Pérez and
J.-L. Pazat have organized two tutorial days to introduce the CORBA components and the JXTA peer-to-peer environment
(ACI GRID.
About 30 participants attended.
G. Antoniu and L. Bougé were invited and gave a talk at the
CEA seminar organized in Saint-Malo on Large-scale data management in
June 2003. Title: Large-Scale Data Management: a Peer-to-Peer Approach
Based on the JXTA Environment.
G. Antoniu and L. Bougé were invited and gave talks at
the Dagstuhl Seminar on Hardware and Software Consistency Models:
Programmability and Performance in October 2003. Talks: Peer-to-Peer
Distributed Shared Memory? (G. Antoniu), Hierarchy-Aware Consistency
Protocols for Large-Scale DSM (L. Bougé).
G. Antoniu gave an invited talk for Prof.
Arvind's group at LCS (MIT, Boston) in October 2003. Title:
Peer-to-Peer Distributed Shared Memory?.
G. Antoniu gave an
invited talk at a seminar of the local ACM organization of the University of
New Hampshire in October 2003. Title: Peer-to-Peer: Getting Serious?.
G. Antoniu visited the JXTA Team of Sun
Microsystems in Santa Clara, CA, where he presented the current status of the
JuxMem Project, in November 2003. Title: JuxMem: a JXTA-Based Service
for Data Sharing on the Grid.
Th. Priol gave an invited talk entitled Programming
High Performance Applications using Components at the 10th European
PVM/MPI Users Group conference in Venice, Italy in September 2003.
Ch. Pérez gave an invited talk entitled
Programming the Grid with parallel software components at the Sparse
Days and Grid Computing at Saint-Girons, organized by the CERFACS and the
ENSEEIHT-IRIT.
Ch. Pérez gave an invited talk entitled Parallel computing
and Clusters at the Conference on cluster computing organized by Apple at
the Apple Executive Briefing Center, Paris, France, December 2003.
Ch. Morin gave an invited talk in the
Computer Science Department, at UCC, Cork, Ireland, in February 2003. Title:
Kerrighed: a Single System Image Operating System for High Performance
Computing on Clusters.
Ch. Morin was invited to participate to the
7th Workshop on Distributed Computing (SOS7) organized by Sandia Labs
in Durango, Colorado, and to present a talk entitled Design and
Implementation of a Single System Image Operating System for High
Performance Computing on Clusters in March 2003.
Ch. Morin gave a talk at the Linux
Clusters Workshop organized at Edf R&D in Clamart on Operating
Systems, Tools and Methods for High Performance Computing on Linux
Clusters in October 2003. Title: Single System Image OS for
Clusters: Kerrighed Approach.
Ch. Morin gave an invited talk at the
Computer Science department of the University of Ulm, Germany, in November
2003. Title: Kerrighed: a Linux based Operating system to Ease
Cluster Programming.
Geoffroy Vallée has been invited to give a talk
entitled Global process Management in Kerrighed Cluster Operating
System at ID-IMAG, Grenoble, in June 2003.
Geoffroy Vallée has been invited by the RESO Inria Project-Team to give a talk entitled Global process Management in
Kerrighed Cluster Operating System at LIP, ENS Lyon, June 2003.
is in charge of the European Affairs within the Department
for European and International Relations (DREI) at Inria.
chairs the Computer Science and Telecommunication Department
(Département Informatique et Télécommunications, DIT) of the Brittany
Extension of Ens Cachan on the Ker Lann Campus in Bruz, in the close suburb of
Rennes.
is a member of the Inria Evaluation Committee.
She chairs the local Irisa Computing Infrastructure User Committee
(Commission des utilisateurs des moyens informatiques,CUMI).
is a member of the Administrative Committee of Insa of
Rennes.
He served as a member of the Administrative Committee of The Computing
Resource Center of Insa of Rennes.
was Deputy Director of the Inria Evaluation Committee till
June 2003.
was a member of the 2003 Selection Committee for the Junior
Researcher permament position (CR2) at Irisa.
He is a member of the Project-Team Committee of Irisa, standing for the
Paris Project-Team.
He is a deputy-member of the Selection Committee (Commission de
spécialistes, CSE) for Computer Science at Ens Cachan.
is a member of the Irisa Local Committee Temporay
Positions (Commission locale des postes d'accueil, CPLA).
She has been a member of the editorial board of Inedit, the Inria Newsletter since September 2003.
She has been enrolled in September 2003 as an external member of the
Course Advisory Board of the Information Technology School of Deakin
University (Australia).
is a deputy-member of the Selection Committee
(Commission de spécialistes, CSE) for Computer Science at University Rennes 1,
since December 2001.