Residual distribution schemes : current status and future trends

ScAlApplixHigh Performance Schemes and Algorithms for Complex Scientific ApplicationsNUMJeanRomanINRIA Research scientist (on partial secondment) since september 2004, Professor at ENSEIRB (École Nationale Supérieure d'Électronique, d'Informatique, et de Radiocommunications de Bordeaux), LaBRIJosyBaronAdministrative assistantChristopheBerthonResearch scientist (on partial secondment) until end of august 2004, Assistant professor at University of Bordeaux 1, MABOlivierCoulaudResearch directorPascalHénonResearch scientistFrançoisPellegriniResearch scientist (on secondment) until end of august 2004, Assistant professor at ENSEIRB, LaBRIPierreRametResearch scientist (on partial secondment) since september 2003, Assistant professor at University of Bordeaux 1 (IUT), LaBRIRémiAbgrallProfessor, University of Bordeaux 1 and Institut Universitaire de France, MABBonifaceNkongaSenior assistant professor, MABOlivierBeaumontSenior assistant professor, LaBRINikolaïAndrianovfunding from HYKE until end of june 2004, MABGuillaumeAnciauxfunding from INRIA and CEA, INRIABenjaminBraconnierfunding from CEA, MABCédricChevalierfunding from CNRS and Conseil Régional d'Aquitaine since october 2004, LaBRIAurélienEsnardfunding from Ministry of Research and Technology, LaBRIPierreFortinfunding from CNRS and Conseil Régional d'Aquitaine, LaBRISébastienGaucelfunding from Ministry of Research and Technology, MABMikaëlPapinfunding from CEA, MABCédricTavéfunding from CNES since october 2004, MABGaëlTessierfunding from Conseil Régional d'Aquitaine, LaBRIMichaëlDussèreProject technical staff until end of september 2005DimitriLecasJunior technical staff until end of december 2004PatrickAmestoyProfessor, ENSEEIHT (École Nationale Supérieure d'Électrotechnique, d'Électronique, d'Informatique, et d'Hydraulique de Toulouse), IRIT (Institut de Recherche en Informatique de Toulouse, UMR 5505)GuillaumeLatuAssistant professor, University Louis Pasteur of Strasbourg, LSIIT (Laboratoire des Sciences de l'Image, de l'Informatique et de la Télédétection, UMR 7005)GéraldMonardAssistant professor, University Henri Poincaré of Nancy, GCTN (Equipe de Chimie et Biochimie Théoriques de Nancy, UMR 7565)MohamedAllaouifunding from INRIA since february 2004 until end of june 2004, ENSEIRBGuilhemCaramelfunding from INRIA since june 2004 until end of september 2004, ENSEIRBRodolpheCazemajour-Tourniefunding from INRIA since june 2004 until end of september 2004, MatMécaCédricChevalierfunding from INRIA since february 2004 until end of june 2004, ENSEIRBFabienDecungfunding from INRIA since june 2004 until end of september 2004, MatMécaHichamElmansourifunding from INRIA since february 2004 until end of august 2004, ENSEIRBArnaudGouremanfunding from CEA since february 2004 until end of november 2004, ENSEIRBFrédéricHuardfunding from INRIA since june 2004 until end of september 2004, ENSEIRBDorianPierrefunding from INRIA since february 2004 until end of september 2004, ENSEIRBBorisVaissairefunding from INRIA since june 2004 until end of september 2004, MatMécaJessieWellerfunding from INRIA since june 2004 until end of september 2004, Master ingénierie mathématique

ScAlApplix is a joint project of INRIA Futurs, LaBRI (Laboratoire Bordelais de Recherche en Informatique – CNRS UMR 5800, University of Bordeaux 1 and ENSEIRB) and MAB (Laboratoire de Mathématiques Appliquées de Bordeaux – CNRS UMR 5466, Universities of Bordeaux 1 and Bordeaux 2). ScAlApplix has been created on the first of november 2002 (http://www.labri.fr/scalapplix).

Overall Objectives(Sans Titre)

The purpose of the ScAlApplix project is to analyze and solve scientific computation problems arising from complex research and industrial applications and involving scaling. These applications are characterized by the fact that they require enormous computing power, on the order of tens or hundreds of teraflops, and that they handle huge amounts of data. Solving these kinds of problems requires a multidisciplinary approach concerning both applied mathematics and computer science. In applied mathematics, it is essentially the field of numerical schemes that is concerned. In computer science, parallel computing and the design of high-performance codes to be executed on today's major computing platforms are concerned (parallel computers organized as a large network of SMP nodes, GRID platforms).

Through this approach, ScAlApplix intends to contribute to all steps in the line that goes from the design of new high-performance, more robust and more precise, numerical schemes to the optimized implementation of algorithms and codes for the simulation of physical (fluid mechanics, inert and reactive flows, multimaterial and multiphase flows), biological (molecular dynamics simulations) and environmental (host-parasite systems in population dynamics) phenomena that are by nature multiscale and multiphysics.

Another domain we are currently investigating is the development of distributed environments for coupling numerical codes and for steering interactively numerical simulations. The computational steering is an effort to make the typical simulation work-flow (modeling, computing, analyzing) more efficient, by providing on-line visualization and interactive steering over the on-going computational processes. On-line visualization appears very useful to monitor and detect possible errors in long-running applications, and interactive steering allows the researcher to alter simulation parameters on-the-fly and immediately receive feedback on their effects. Thus, scientists gain a better insight in the simulation regarding to the cause-effect relationship and can better grasp the complexity of the underlying models.

Scientific FoundationsNumerical schemes for fluid mechanicsunstructured meshesupwind schemescomplex physical models

A large number of industrial problems can be translated into fluid mechanics ones. They may be coupled with one or more physical models. An example is provided by aeroelastic problems, which are studied in details by other INRIA teams. Another example is given by flows in pipelines where the fluid (a mixture of air–water–gas) has not very well-understood physical properties. One may also consider problems in aeroacoustics, which become more and more important in everyday life. In some ocasions, one needs specific numerical tools because fluids have exotic equation of states, or because the amount of computation becomes huge, as for unsteady flows.

It is a fact that there are many commercial codes. They allow one to consider some of these examples, but the quality of the solutions is far from being optimal. Moreover, the numerical tools of these codes are often not the most recent ones. Last, the know–how that results into these codes is not available. An example is the noise generated by vortices crossing through a shock wave. It is, up to our knowledge, out of reach of the most recent technologies because the numerical resources that would necessitate such simulations are tremendous. In the same spirit, the simulation of a compressible mixing layer in a complex geometry is also out of reach because very different temporal and physical scales need to be captured.

In order to reach this goal efficient simulation of complex physical problems, we are working on some fundemental aspects of the numerical analysis of non linear hyperbolic problems. Our goal is to develop schemes that can adapt to the modern computer architectures. More precisely, we are working on a class of numerical schemes specifically tuned for unstructured meshes. They have the most possible compact stencil that is compatible with the expected order of accuracy. The order of accuracy typically ranges from two to four. Since the stencil is compact, the implementation on parallel machines becomes simple. The price to pay is that the scheme is necessarily implicit. However, the compactness of the scheme makes us expect to use the high performance parallel linear algebra tools developed by the team. Other research theme may emerge out of these themes. In parallel to these aspects of fundemental numerical analysis, we also work on adapting more classical numerical tools to complex physical problems such as those encountered in interface flows, turbulent or multiphase flows.

Within a few years, we expect to be able to consider the physical problems that are now difficult to compute thanks to the know–how coming from our research on compact distribution schemes. These problems range from aeroacoustic ones to multiphysics problems, such as the one mentionned above.

Schemes and algorithms for computational chemistrymultiscale schemesmolecular dynamicsparallelism and distributed computing

Due to the increase of available computer power, new applications such as reaction paths, free energy computations, biomolecular dynamics simulations or failure material simulations are now commonly performed by chemists. These computations simulate systems up to several thousands of atoms, for large time scales up to several nanoseconds. The larger the simulation is, the smaller the computational cost of the potential driving the phenomena is, resulting in low precision results. To achieve realistic results, simulations need to include the environment surrounding the molecules, such as water and membranes, resulting in system sizes up to about several hundred thousands of atoms. Furthermore, simulating the aggregation of proteins, which is critical for biologists studying viruses, requires models of up to one million atoms, with a simulation time up to one millisecond. This implies that atomistic simulations must be speeded up by several orders of magnitude. To obtain this speed, numerical and parallel algorithms must be improved, as well as their implementations on distributed or parallel architectures.

We are currently focusing on several aspects of these problems. First, we try to improve models and algorithms. To do this, we decrease the complexity of classical algorithms by introducing new approximations in the algorithms, in the model (this is the trick of linear scaling methods like the divide-and-conquer method), and by proposing new algorithms.

Second, we apply multiscale methods to decrease the number of atoms that are considered at the finest level (electronic or atomistic). To do this, we introduce a coarser model like continuum media to take into account the electrostatic effect of the environment, or an elasticity model for crystals. The difficulty here is to build an efficient scheme which couples the two different scales without any loss of precision.

Finally, efficient implementation is necessary to reach the desired level of performance. For instance, we can rewrite our algorithms in the form of block computations, in order to use efficient computational routines such as BLAS vector-matrix operations, or to implement accurate load balancing strategies.

Another domain we are currently investigating is the development of parallel and distributed environments for coupling numerical codes, and for interactively steering numerical simulations in particular in the context of molecular dynamics.

Algorithms and high-performance solversirregular graph partionningsparse linear algebrafast multipole methodsparallel algorithmshigh-performance computingHigh-performance direct solvers for distributed clusters

Solving large sparse systems Ax = b of linear equations is a crucial and time-consuming step, arising in many scientific and engineering applications. Consequently, many parallel techniques for sparse matrix factorization have been studied and implemented.

We have started this research by working on the parallelization of an industrial code for structural mechanics, which was a 2D and 3D finite element code and non linear in time. This computational finite element code solves plasticity problems (or thermo-plasticity problems, possibly coupled with large displacements). Since the matrices of these systems are very ill-conditioned, classical iterative methods are not an issue. Therefore, to obtain an industrial software tool that must be robust and versatile, high-performance direct sparse solvers are mandatory, and parallelism is then necessary for reasons of memory capabilities and acceptable solving time. Moreover, in order to solve efficiently 3D problems with more than 10 millions of unkowns, which is now a reachable challenge with new SMP supercomputers, we must achieve a good time scalability and control memory overhead.

In the ScAlApplix project, we focused first on the block partitioning and scheduling problem for high performance sparse LDL^T or LL^T parallel factorization without dynamic pivoting for large sparse symmetric positive definite systems. Our strategy is suitable for non-symmetric sparse matrices with symmetric pattern, and for general distributed heterogeneous architectures whose computation and communication performances are predictable in advance. In order to achieve efficient parallel sparse factorization, we study the following pre-processing phases:

The ordering phase, which computes a symmetric permutation of the initial matrix A such that factorization process will exhibit as much concurrency as possible while incurring low fill-in. We use a tight coupling of the Nested Dissection and Approximate Minimum Degree algorithms. The partition of the original graph into supernodes is achieved by merging the partition of separators computed by the Nested Dissection algorithm and the supernodes amalgamated for each subgraph ordered by Halo Approximate Minimum Degree.

The block symbolic factorization phase, which determines the block data structure of the factorized matrix L associated with the partition resulting from the ordering phase. One can efficiently perform such a block symbolic factorization in quasi-linear space and time complexities. From this block structure, one can deduce the weighted task graph that captures all dependencies between blocks, as well as the supernodal elimination tree.

The block repartitioning and scheduling phase, which refines the previous partition by splitting large supernodes in order to exploit concurrency within dense block computations, and which maps the resulting blocks onto the processors of the target architecture.

These topics are also investigated for general sparse non-symmetric matrices with a more dynamic strategy due to the need for numerical pivoting during the factorization.

Research about high performance sparse direct solvers is carried on in collaboration with P. Amestoy (ENSEEIHT – IRIT) and J.-Y. L'Excellent (INRIA Rhône-Alpes), and has led to software developments (see section ) and to industrial contracts with CEA (Commissariat à l'Energie Atomique).

Fast Multipole Methods

In most of scientific computing applications considered nowadays as computational challenges like biological systems, astrophysic or electromagnetism, the introduction of hierarchical methods based on an octree structure has dramatically reduced the amount of computation needed to simulate those systems for a given error tolerance.

Among these methods, the Fast Multipole Method (FMM) allows the computation of the interactions in, for example, a molecular dynamics system of N particles in O(N) time, against O(N²) with a direct approach. The extension of these methods and their efficient implementation on current parallel architectures is still a critical issue. Moreover the use of periodic boundary conditions, or of duplications of the system in 2 out of 3 space dimensions, just as well as the use of higher approximations for integral equations are also still relevant.

In order to treat biological systems of up to several millions of atoms, these methods must be integrated in the QC++ platform (see section ). They can be used in the three (quantum, molecular and continuum) models for atom-atom interactions in quantum or molecular mechanics, atom-surface interactions for the coupling between continuum and the other models, and also for fast matrix-vector products in the iterative solving of the linear system given by the integral formulation of the continuum method. Moreover, the significant experience achieved by the Scotch and PaStiX projects (see section ) will be useful in order to develop efficient implementations of the FMM methods on parallel clusters of SMP nodes.

Parallel algorithms for heterogeneous platformsparallel algorithmscomputational gridsscheduling

Recently, a lot of work has been devoted to computational grids. Such computing architectures differ from usual parallel platforms in terms of heterogeneity (of both processing and communication resources) and scale (use of large distance network links with high latencies). Such platforms are usually not dedicated to one application, and therefore, security and fault tolerance problems also arise. In our works, we do not consider the security and fault tolerance problems, but rather concentrate on additional difficulties arising from the heterogeneity and the dynamicity (in terms of performances rather than topology) of such platforms.

Our goal is to design efficient scheduling algorithms for heterogeneous and non-dedicated platforms. Scheduling computational tasks on a given set of processors is a key issue for high-performance computing. The traditional objective of scheduling algorithms is makespan minimization: given a task graph and a set of computing resources, find a mapping of the tasks onto the processors, and order the execution of the tasks so that (i) task precedence constraints are satisfied; (ii) resource constraints are satisfied; and (iii) a minimum schedule length is provided. However, makespan minimization turned out to be NP-complete problem in most practical situations and the advent of more heterogeneous architectural platforms is likely to even increase the computational complexity of the process of mapping applications to machines.

Many of the works presented in section have been done in collaboration with the GRAAL project (INRIA Rhône-Alpes), with Arnaud Legrand (CNRS research scientist at ID-IMAG), and during the PhD thesis of Loris Marchal (begun 09/03) which is co-directed by Olivier Beaumont and Yves Robert (GRAAL project).

Computational steering for distributed numerical simulationsnumerical simulationcomputational steeringscientific visualizationinteractioncoupling

Thanks to the constant evolution of computational capacity, numerical simulations are becoming more and more complex; it is not uncommon to couple different models in different distributed codes running on heterogeneous networks of parallel computers (e.g. multi-physics simulations). For years, the scientific computing community has expressed the need of new computational steering tools to better grasp the complexity of the underlying models.

The computational steering is an effort to make the typical simulation work-flow (modeling, computing, analyzing) more efficient, by providing on-line visualization and interactive steering over the on-going computational processes. On-line visualization appears very useful to monitor and detect possible errors in long-running applications, and the interactive steering allows researchers to alter simulation parameters on-the-fly and immediately receive feedback on their effects. Thus, the scientist gains a better insight in the simulation regarding to the cause-and-effect relationship.

A computational steering environment can be defined as a communication infrastructure, coupling a simulation with a remote user interface, called steering system. This interface usually provides on-line visualization and user interaction. Over the last decade, many steering environments have been developed; they distinguish themselves by some critical features such as the simulation integration process, the communication infrastructure and the steering system design. A first solution for the integration is the problem solving environment (PSE) approach, like in SCIRun. This approach allows the scientist to construct a steering application according to a visual programming model. As an opposite, environments like CAVEStudy only interact with the application through its standard input/output. Nevertheless, the majority of the steering environments, such as the well-known CUMULVS, are based on the instrumentation of the application source-code; this approach allows fine grain steering functionalities and achieves good runtime performances. Regarding the communication infrastructure, there are many underlying issues especially when considering parallel and distributed simulations: heterogeneous data transfers, network communication protocols and data redistributions.

In EPSN project, we intend to explore the capabilities of the CORBA technology and the currently under development parallel CORBA objects for the computational steering of parallel and distributed simulations. This environment, that we are working on, allows the control, the data exploration and the data modification for numerical simulations involving an iterative process. In order to be as generic as possible, we introduce an abstract model of steerable simulations. This abstraction allows us to build steering clients independently of a given simulation. This model is described with an XML syntax and is used in the simulation by some source code annotations. EPSN takes advantage of the CORBA technology to design a communication infrastructure with portability, interoperability and network transparency. In addition, the in-progress parallel CORBA objects will give us a very attractive framework for extending the steering to parallel and distributed simulations.

Application DomainsIntroduction

The main objective of ScAlApplix project is to analyze and solve scientific computing problems coming from complex research and industrial applications and involving scaling. This allows us to validate the numerical schemes, the algorithms and the associated softwares that we develop. We have today three reference application domains which are fluid mechanics, molecular dynamics and host-parasite systems in population dynamics. In these three domains, we study and simulate phenomena that are by nature multiscale and multiphysics, and that require enormous computing power. A major part of these works leads to industrial collaborations in particular with the french CEA/CESTA and CEA/Ile-de-France centers.

Fluid mechanicsfluid mechanicsunsteady flowsmultiphase flowsunstructured meshes

The numerical simulation of unsteady flows is still a challenge since efficient schemes and efficient implementations are needed. This challenge is even higher if large size problems are to be tackled, and if the meshes are not regular.

Among the problems to be considered, one may list the computation of mixing layers, shock–vortices interactions, the noise generated by a flow. This last item clearly needs very high order schemes, and the today best schemes use regular structued meshes. Hence, one of our objectives is to construct very high order schemes for unstructured meshes.

Another example where large computer resources are needed is the simulation of multiphase flows. In that case, several difficulties have to be faced: unsteady flows, complex geometries and a very complex physical model.

Molecular chemistryproteinenzyme reactionmembranedrug designbiological simulationparallelismmolecular dynamicsquantum methodcontinuum method

Population DynamicsBiomathematicspopulation dynamicshost-microparasite systemhost-macroparasite systemdetemerministic modelindividual based modelspatially explicit modelparasite aggregationhealth

In population dynamics, systems can present very complex behaviors and can be difficult to analyse from a purely mathematical point of view. The aim of this interdisciplinary project was to develop numerical tools for population dynamics models arising in modelling complex heterogeneous host-parasite systems. Some typical heterogeneities we consider are spatial locations, age or ability to recruit macroparasites for hosts, age of macroparasites. Our main goals are: understanding the impact of a host population structure on a parasite population dynamics, developing accurate numerical simulations using parallelization, designing prophylactic methods. For many host-parasite systems, different time scales between the host population (e.g. a one year period) and the virus (e.g. an infected host dies with a few weeks) require a small time step. Numerical schemes of the resulting nonlinear epidemiological model in spatially heterogeneous environment are complex to perform and reliable numerical results become difficult to get when the size of the spatial domain is increasing. In addition, many input parameters (biological and environnmental factors) are taken into account to compare results of simulations and observations from field studies. Therefore, a realistic simulator has a significant computation cost and parallelization is required.

Individual-Based Models (IBM) are becoming more and more useful to describe biological systems. Interactions between individuals are simple and local, yet can lead to complex patterns at a global scale. The principle is to replicate several times the simulation program to obtain statistically meaningful results. The Individual-Based Model approach contrasts with a more aggregate population modeling approach in providing low level mechanisms to manage the population interactions. Stochastic simulations reproduce elementary processes and often lead to prohibitive computations; thus we need parallel algorithmic.

In our developments of both stochastic and deterministic models, biological processes are combined to reach a good level of realism. For host-parasite systems, it makes a big difference with purely mathematical models, for which numerical results could hardly be compared to observations. Parallel numerical simulations mimic some of the dynamics observed in the fields, and supply a usable tool to validate the models.

This work is a collaborative effort in an interdisciplinary approach: population dynamics and biology with P. Silan (UPS CNRS 2561, Cayenne) and Agnès Calonnec (INRA - UMR Santé végétale, Villenave d'Ornon), applied mathematics with M. Langlais and S. Gaucel (MAB - Université Bordeaux 2) and computer science.

SoftwareIntroduction

We develop two kinds of software. The first one consists in generic libraries that will be used in the applications. We work on a partitioner for large irregular graphs or meshes (Scotch), on high performance direct or hybrid solvers for very large sparse systems of equations (MUMPS, PaStiX). The second one corresponds to dedicated software for molecular chemistry (QC++), fluid mechanics (FluidBox), and to a platform for computational steering (EPSILON). For these parallel software developments, we use the message passing (MPI) paradigm, the OpenMP programming language, threads, and the Java and/or CORBA technologies.

MUMPSPatrickAmestoycorresponding memberFrançoisPellegriniparallel asynchronous sparse multifrontal solver

In the context of PARASOL (Esprit IV Long Term Project, 1996-1999), CERFACS and ENSEEIHT-IRIT teams have initiated a parallel sparse solver MUMPS (``MUltifrontal Massively Parallel Solver''). Since the first public release of MUMPS (March 2000), much research work has been done in collaboration with J. Y. L'Excellent from INRIA project GRAAL, and Sherry Li and Esmond Ng from NERSC (Lawrence Berkeley National Laboratory). This work related to performance scalability, orderings for unsymmetric matrices, and dynamic scheduling has been incorporated in the new improved version of the package (release 4.3.2 available since November 2003 at http://www.enseeiht.fr/apo/MUMPS).

MUMPS is a package for solving linear systems of equations Ax = b, where the matrix A is sparse and can be either unsymmetric, symmetric positive definite, or general symmetric. It uses a multifrontal technique which is a direct method based on either the LU or the LDL^T factorization of the matrix. The main features of the MUMPS package include numerical pivoting during factorization, solution of the transposed system, input of the matrix in assembled format (distributed or centralized) or elemental format, error analysis, iterative refinement, scaling of the original matrix, and return of a Schur complement matrix. It also offers several built-in ordering algorithms, a tight interface to some external ordering packages such as Scotch and is available in various arithmetics (real or complex, single or double).

PaStiXPascalHénonDimitriLecasFrançoisPellegriniPierreRametcorresponding memberJeanRomancomplete and incomplete supernodal sparse parallel factorizations

This work is supported by the French ``Commissariat à l'Energie Atomique CEA/CESTA'' in the context of structural mechanics and electromagnetism applications.

PaStiX (http://www.labri.fr/~ramet/pastix) is a scientific library that provides a high performance solver for very large sparse linear systems, based on block complete and incomplete (ILU(k)) factorizations for direct and iterative solutions. Many algorithms are implemented in simple or double precision (real or complex): LLt (Cholesky), LDLt (Crout) and LU with static pivoting (for non symmetric matrices having a symmetric pattern). This latter version will be used in FluidBox .

The PaStiX library uses the graph partitioning and sparse matrix block ordering package Scotch (see section ). PaStiX is based on an efficient static scheduling and memory manager, in order to solve problems with more than 10 millions of unknowns. The mapping and scheduling algorithm handles a combination of 1D and 2D block distributions. This algorithm computes an efficient static scheduling of the block computations for our supernodal parallel solver which uses a total local aggregation of contribution blocks. This can be done by taking into account very precisely the computational costs of the BLAS 3 primitives, the communication costs and the cost of local aggregations. We also improved this static computation and communication scheduling algorithm to anticipate the sending of partially aggregated blocks, in order to free memory dynamically. By doing this, we are able to reduce dramatically the aggregated memory overhead, while keeping good performances.

Another important point is that our study is suitable for any heterogeneous parallel/distributed architecture the performance of which is predictable, such as clusters of SMP nodes. In particular, we propose now a high performance version with a low memory overhead for SMP node architectures, which fully exploits shared memory advantages by using an hybrid MPI-thread implementation.

However, direct methods may fail to solve very large three-dimensional problems, due to the large amount of memory needed for these cases and despite any memory optimization. Our approach now consists in symbolically computing the block structure of the factors that would have been obtained with a complete factorization, and then deciding to drop off some blocks of this structure according to relevant criteria. This incomplete factorization induced by the new sparse pattern is then used within a preconditioned GMRES or Conjugate Gradient solver.

ScotchFrançoisPellegrinicorresponding membergraph partitioningmesh partitioningstatic mappingsparse matrix block ordering

The initial purpose of Scotch was to provide an efficient software environment for partitioning and mapping statically applications modeled as valuated process graphs of arbitrary topologies. The original contribution consisted in developing a ``divide and conquer'' algorithm in which processes are recursively mapped onto processors by using graph bisection algorithms that are applied both to the process graph and to the architecture graph. This allows the mapper to take into account the topology and hetorogeneity of the valuated graph which models the interconnection network and its resources (processor speed, link bandwidth). This technique allowed to compute high quality mappings with low complexity.

Based on these results, new graph partitioning algorithms have been developed, which compute vertex separators instead of edge separators, using a multi-level framework. Recursive vertex separation is used to compute reorderings of the unknowns of large sparse linear systems, which both preserve sparsity when factoring the matrix and preserve concurrency for computing and solving the factored matrix in parallel. The original contribution has been to study and implement a tight coupling between the nested dissection and the approximate minimum degree methods; this work was carried out in collaboration with Patrick Amestoy, of ENSEEIHT-IRIT.

Recently, new classes of methods have been added to the Scotch library, which allow it to compute efficient orderings of native meshes, resulting in the handling of larger problems than with standard graph partitioners.

The Scotch software package (http://www.labri.fr/~pelegrin/scotch/), the version 4.0 of which is about to be released as a LGPLed libre software, is quite popular and compares favorably with its most known US competitor, MeTiS.

QC++OlivierCoulaudcorresponding memberFrançoisPellegriniGéraldMonardquantum chemistrysemi-empiricalsimulationparallelism

QC++ is a Quantum Chemistry software written in C++. The current version of QC++ supports the semi-empirical quantum models MNDO, AM1 and PM3. It allows to calculate the energy of a molecular configuration by some Self Consistent Field (SCF) algorithms : fixed point, optimal damping and level shifting. It is also possible to optimize its geometry by using the minimization algorithms L-BFGS and BFGS.

The major new feature in the version 2.0 is the implementation of a linear scaling ``divide and conquer'' method both in sequential and parallel to calculate the electronic energy. Several type of subdomains and strategies to partition a molecule are available, one of them being based on Scotch.

FluidBoxChristopheBerthonBonifaceNkongacorresponding memberRémiAbgrallfluid mechanicsinert and reactive flowsmultimaterial and multiphase flows

FluidBox is a software dedicated to the simultation of inert or reactive flows. It is also able to simulate multiphase and multimaterial flows. There exist 2D– and 3D dimensional versions. The 2D version is used to test new ideas that are later implemented in the 3D one. Two classes of schemes have been implemented: classical finite volume schemes and the more recent residual distribution schemes. Several low Mach preconditioning techniques are also implemented. The code has been parallelized with and without overlap of the domains. It is used by our industrial contacts : CEA/CESTA, EADS,SNPE.

EPSILONOlivierCoulaudcorresponding memberMichaëlDussèreAurélienEsnardGuillaumeLatuBonifaceNkongaJeanRomannumerical simulationcomputational steeringon-line visualizationCORBA

EPSILON is the prototype of EPSN computational steering environment. This platform allows scientists to couple their existing simulations (written in C, C++ or FORTRAN) with remote user interfaces providing on-line visualization and user interaction. Epsilon, which is based on CORBA (omniORB), provides a C API for instrumenting both simulations and client applications. It addresses sequential numerical simulations and has an extension for steering parallel simulations. Release 0.1, which is distributed among the project participants, allows the construction of multi-simulation and multi-client steering systems providing the control, the data exploration and the data modification for the iterative processes. The client examples distributed with Epsilon show the diversity of solutions permitted by the environment from the generic Java-swing client to the AVS modules for on-line advanced visualization.

New ResultsNumerical schemes and algorithms for fluid mechanics.RémiAbgrallNikolaïAndrianovChristopheBerthonBenjaminBraconnierDimitriLecasBonifaceNkongaMikaëlPapinFrançoisPellegriniPierreRametJeanRomanCompact distribution schemesmultimaterial and multiphase flowsaeroacousticsCompressible multiphase flows

We have extended our previous works to the case of unstructured meshes in FluidBox. The proposed scheme is currently being extended to a simplified equation of state in order to better take into account the quasi incompressible nature of some constituents. This work is supported by the CEA/CESTA.

In order to improve the numerical accuracy and to reduce the computational time consumption, some simplified models of the Bear – Nuziato multi-fluid model have been proposed. The reduced models are obtained after a scaling of the equations and a WKB development. These models can take into account some complex phenomena such as turbulence, gravity, surface tension... However, it does not exist a fully conservative form of the models. In order to recover the physical behavior, strategies are defined to approximate the non-conservative products: nonlinear projection, extended consistency. The resulting schemes are accurate and robust.

The other challenge is to develop an efficient numerical strategy for complex equations of states. Some accurate relaxation schemes have been developed and validated for the isobaric and unique velocity model. This work is under progress in the context of B. Braconnier PhD thesis, with some 3D applications proposed by the CEA/CESTA. These computations are still expensive and these schemes need to be implemented on parallel computers. The parallel solution strategy is based on a domain decomposition achieved by the mesh partitioner Scotch. It will integrate the high performance parallel solvers developed in the ScAlApplix project for the solution of the linear systems associated with the implicit numerical scheme.

Residual distribution schemes

We are still working on the theoretical understanding of these particular schemes. A review is given in . Several results have been obtained about the stability, the approximation of viscous terms in the Navier – Stokes case. This has been done in collaboration with the Von Karmàn Institute (in collaboration with H. Deconinck, M. Ricchiutto). A better understanding of the bad iterative convergence properties of the schemes has been obtained, a paper is beeing written . It amounts to adding an additional stabilisation term, and this technique has been sucessfully used for second to fourth order schemes.

The unsteady version of the residual distribution scheme has also been studied for space and time accuracy higher than two by N. Andrianov (HYKE post-doctoral fellow) in . Results in aeroacoustics has also been obtained in .

Funded by CNES and in collaboration with ONERA, we are studying how the residual distribution schemes (second order) can be implemented in the structure of the ONERA code CEDRE. This is the topic of Cédric Tavé's PhD thesis.

Lagrangian schemes for inertial fusion problems

In collaboration with J. Ovadia (CEA CESTA), P.H. Maire and J. Breil (UMR CELIA), we have developped a new Lagrangian scheme for the simulation of compressible flows. This scheme solves several stability problems encountered in the simulation of implosions .

Multiscale scheme for the simulation of bubbly flows

Inspired by the work of A. Murone and H. Guillard (INRIA project SMASH), we have studied, in collaboration with V. Perrier (UMR CELIA and MAB), the limit scheme described in a previous work when some relaxation parameters become infinite. This provides a discretisation of non conservative terms which is consistant with .

Hamilton Jacobi equations

In the last years, we designed a scheme for Hamilton Jacobi equations that was inspired by our work on residual distribution schemes in their blended version. A 3D version of the scheme has been implemented for SNPE. This has motivated a better understanding of the method: a convergence proof is available for steady problems, as well as a more efficient version in terms of accuracy. The full paper is .

Porous medium flows

In the context of the management of subsoil waters, a numerical model for flows in porous medium has been designed. The mathematical model, based on the Darcy's law, is a 2D diffusion equation solved by using a mixed hybrid finite volume method. This work has been done in the intership INRIA's program with the university of Yaoundé (Cameroon).

Pressureless gas dynamics

In collaboration with M. Breuss and M.-O. Titeux, we have developed a new strategy to approximate the solutions of the pressureless gas dynamics. Such a system arises when simulating the large scale structures of the universe. The numerical method is based on a relaxation strategy and solves several difficulties: namely the $\delta$ -shock and the vacuum states. The scheme has been integrated in FluidBox.

The 10 moment Gaussian closure

Actually, main of the numerical simulations for compressible flows are proposed in the framework of the Euler equations. This system, related to velocity moments of the Boltzmann equation, is based on the assumption that the gas is in local thermodynamic equilibrium. Several recent applications consider a gas which moves away from equilibrium. Since the main assumption is not satisfied, the classical Euler equation cannot be used. The computations of extremely low pressure rarefied gas flows from the reentry of space vehicles concern the present framework where the local thermodynamic equilibrium is not assumed. Also, we note several applications related to the inertial confinement fusion where under-dense plasma is considered and the effects of an anisotropic laser heating are studied. Since non local thermodynamic equilibrium is assumed, a new model investigated by Levermore and Morokoff, admits 10 equations: the 10-moment Gaussian closure model. We have proposed a Suliciu relaxation scheme to approximate solutions of the system. The main stability properties can be established for this numerical procedure. The multidimensional result turns out to be very attractive since the scheme is poorly diffusive and thus gives accurate results.

Schemes and algorithms for computational chemistryGuillaumeAnciauxOlivierCoulaudPierreFortinGéraldMonardFrançoisPellegriniJeanRomanquantum chemistrymolecular dynamicshierarchical methodscoupling modelDivide & Conquer method

In the QC++ code, the divide and conquer method has been parallelized by using the message passing paradigm. The main features of the parallelization are the following. First, several domains can be handled by each processor, which will allow us to implement load balancing methods. Second, asynchronous communications are used to overlap communications by computations, according to two different communication patterns: for data that should be shared across all subdomains, such as the diagonal blocks of the global density matrix which are used to build the global Fock matrix, a ring of communication is used; otherwise, point to point communications are performed. Third, a molecule partitioner based on the Scotch library, called Kimika, has been developed. Several types of partitioning and clustering strategies (by atom or by fragment) have been evaluated, and have yielded that, when the number of subdomains increases, load imbalance can adversely impact performance. New partitioning algorithms are being considered, which will take into account for load balancing both the load of the interior of the subdomains and the loads of the overlapping frontier areas.

Crack simulation

A study of crack propagation in silica glass with a coupling method between molecular dynamics and elasticity begun in collaboration with the CEA Ile-de-France in December 2003. A state-of-the-art for large scale atomistic simulations for crack propagation, and for different methods to couple atomistic model with continuum elasticity has been done. In such models, the main difficulties is to prevent the overlapping region from wave reflexions.

To understand the key points in the Bridging method introduced by T. Belytschko, we have analyzed the one-dimensional model in order to understand the formulation of the problem, the influence of several parameters concerning the wave reflexions in the overlapping region. Moreover dispersion analysis has been done to find which frequencies are not affected by the coupling. The propagation of the waves through materials is changed by the underlying model used.

The next step is the study of the coupling for two and three dimensional cases. The parallel computing aspects will be also considered.

Numerical schemes and algorithmic for population dynamicsSébastienGaucelGuillaumeLatuJeanRomanGaëlTessierdeterministic simulatorstochastic simulatorsparse matrices

For many host-parasite systems using matrix computations, memory storage problems and lengthy computational times occurred. To tackle this problem, a crucial observation is that the nonlinear matrix model involves sparse periodic matrices. Then, using appropriate storage techniques and sparse matrices multiplication techniques allow to simulate large size spatial domains and get robust numerical results. In the ScAlApplix project, methods using sparse matrix multiplications are applied to analyze rabies propagation and help designing prophylactic methods. This work is carried on by S. Gaucel (PhD student) in collaboration with M. Langlais.

Fish ectoparasites interact continuously with their host populations. A model describing the demographic strategies of such fish and parasite populations has been developed for the Diplectanum aequans-Sea Bass system . This model is mostly deterministic with some stochastic aspects. Aggregation of macroparasites on the hosts is not assigned to the model but occurs or does not occur depending on the parasite population dynamics. Previous deterministic discrete simulations brought about too large computations and some simulations had important run-times. A high-performance simulator working on parallel machines (IBM SP3, Regatta and SGI Origin 3800, parallel machines of CINES – Montpellier) and providing more accurate computations has been implemented. The algorithmic study and a performance analysis establish the efficiency and scalability (checked up to 448 processors) of the parallel algorithm. The parallel simulator provides more accurate computations than the sequential one. The parallel efficiency reaches 77 % on 448 processors for a complete simulation. A study of memory accesses and cache utilization leads to an implementation reaching 60 % of peak performance in the computation kernel on the IBM SP3 and on the Origin 3800. A complete simulation of 1.45 PFLOP was achieved in only two hours on 256 processors (IBM SP3). A full analysis of numerical simulations has allowed us to tune the model to get a realistic qualitative behavior. Then, a thorough validation of the model is actually performed with P. Silan, a biologist from CNRS. Results of numerical simulations show the effect of overdispersion, of parasite and host mortality on the parasite distribution, and host regulation (occurring through cycles).

Subsequently an individual-based model has been developed. We used a Monte Carlo algorithm for this stochastic simulator for which we described three different levels of parallelism. Analysis and performance, up to 256 processors, of a hybrid MPI/OpenMP code were studied for a cluster of SMP nodes. The qualitative results of both parallel simulators were compared. Improving the model leads to a deeper understanding of the processes occurring in the real biological system. Even though the method is different from the deterministic one, results are qualitatively similar for identical data sets.

A recent cooperation involving Agnès Calonnec, a biologist from INRA, and a PhD student in computer science (Gaël Tessier) began in october 2003. Using numerical methods and parallel technics, we are interested in modeling the spread of oïdium, a disease of vineyard. Correct prediction of this type of parasite epidemics needs a realistic simulator, and could have an industrial impact. An architectural model of vine stocks is used for two purposes: the study of the growth of stocks and the influence of its structure on the dispersal of oïdium. In this model, we consider a large number of infectious elements and several spatially hetereogeneous environmental parameters. Indeed, the dispersal of oïdium is a multiscale mechanism that takes place within vine stocks, and along and across the rows of the vineyard. An initial version of a parallel simulator has been developed . A characterization of the implemented algorithms is under development; we evaluate particularly the communication costs and the load imbalance. First results indicate a good scalability up to 24 processors.

Algorithms and high-performance solversPatrickAmestoyOlivierBeaumontOlivierCoulaudPierreFortinPascalHénonDimitriLecasBonifaceNkongaFrançoisPellegriniPierreRametJeanRomansparse matrix orderinggraphs and irregular mesheshigh performance computingdirect and hybrid direct-iterative solversparallel fast multipole methodsDomain decomposition and sparse matrix reordering

The work carried out within the Scotch project (see section ) was again focused on the development of efficient algorithms for partitioning and reordering meshes. One of the bottlenecks that had been evidenced when using the PaStiX software (see section ) to solve large 3D problems was that the size of the adjacency graphs, which is linear in the number of elements but quadratic with respect to the number of nodes of each of the elements (since these latter are considered as cliques), resulted in memory shortage and disk swapping during the pre-processing phase. Using native mesh structures instead of adjacency graphs makes memory occupation linear with respect to the number of nodes of the elements, but requires additional work on the algorithms that traverse these structures such that the gain in space complexity does not result in a substantial loss in time complexity.

Experiments carried out on real 3D graphs with several million nodes provided by CEA/CESTA show that, when graphs reside entirely in memory, mesh algorithms are about twice as slow as their graph counterparts, due to the two-level traversal of the bipartite mesh data structures, but meshes with about 30% more nodes can be handled in main memory without incurring swapping. Partitioning quality is also slightly increased when using mesh algorithms, but decreases as mesh sizes increase, because of the currently lower quality of the mesh coarsening algorithms used by the multi-level scheme, with respect to their graph counterparts. Early results have been presented in . On-going work now focuses on the development of more efficient mesh coarsening algorithms.

A joint work with Patrick Amestoy allowed us to adapt his approximate minimum fill algorithm to the mesh data structure, which removed the need to turn submeshes into subgraphs to compute the ordering of the submeshes at the end of the nested dissection process.

During his internship from february to end of june 2004, Cédric Chevalier studied ways to couple even more tightly the nested dissection method with the minimum degree algorithm that takes place at the end of the nested dissection process. The purpose of the study was to investigate whether the topological structure of the subgraphs handled at the different stages of the nested dissection process could be exploited so as to extract criteria to decide whether to go on with nested dissection one step further or to switch to minimum fill. Although it has been possible to find topological criteria that characterize graphs belonging to the two above classes, using these criteria in practice yielded no positive result.

Because sequential graph partitioning is more and more a critical bottelneck, Cédric Chevalier has been appointed since October 2004, as a PhD student, to research on parallel graph partitioning algorithms that will be the starting point of the parallel Scotch library.

High-performance direct solvers on multi-plateforms

In order to solve linear systems of equations coming from 3D problems and with more than 10 millions of unkowns, which is now a reachable challenge for new SMP supercomputers, the parallel solvers must keep good time scalability and must control memory overhead caused by the extra structures required to handle communications.

In our current works, we obtain with our direct solvers rather good scalability for many test problems. In most large cases, the efficiency is better than the one obtained previously on IBM SP2, and this shows that our techniques are suitable for SMP architectures. Some experiments were run on the TERA parallel computer at CEA, and the factorization times are close to the ones obtained on the IBM SP3 of CINES (Montpellier, France). For example, on our largest problem (26 millions of unknowns for a 2D 1/2 problem), we reach 500Gflops on 768 processors, that is, about 50% of the peak performance of the TERA computer.

The first major improvement has consisted in taking into account heterogeneous architectures, and more particulary the ones based on SMP nodes, such as the IBM SP3. Our communication model, used during the static scheduling and mapping step, is extended to manage both data exchanges by shared memory (less costly) and data exchanges through the network (more costly). We have proposed a mapping and scheduling algorithm for clusters of SMP nodes, and we have shown the benefits on run-time performances of such strategies.

In addition to the problem of run-time performance, another important aspect in the direct solution of very large sparse systems is the great memory requirements usually needed to factor the matrix in parallel. These memory requirements are due either to the structures required for communication, or by the matrix coefficients themselves.

The first solution we proposed to reduce the cost of those structures was to modify the communication pattern to decrease the amount of allocated message buffers at the same time. Hence, our statically scheduled factorization algorithm can take advantage of the deterministic access (and allocation) pattern to all of the data stored in memory. Indeed, the data access pattern is determined by the computational task ordering that drives the access to the matrix coefficient blocks, and by the communication priorities that drive the access to the communication structures. The results of this work has been presented in particular at the SIAM Linear Algebra 2003 conference.

The second solution, in the context of new SMP node architectures, aims at fully exploit shared memory advantages. A relevant approach is then to use an hybrid MPI-thread implementation. This not yet explored approach in the framework of direct solver aims at solve efficiently 3D problems with much more than 10 millions of unkowns. The rationale that motived this hybrid implementation was that the communications within a SMP node can be advantageously substituted by direct accesses to shared memory between the processors in the SMP node using threads. In addition, the MPI communication between processes are grouped by SMP node. We have shown that this approach allows a great reduction of the memory required for communications .

Many factorization algorithms are now implemented in real or complex variables, for single or double precision: LLt (Cholesky), LDLt (Crout) and LU with static pivoting (for non symmetric matrices having a symmetric pattern). This latter version is being integrated in the FluidBox software .

High-performance hybrid direct-iterative solvers for large sparse systems

In this work, we consider an approach which, we hope, will bridge the gap between direct and iterative methods. The goal is to provide a method which exploits the parallel blockwise algorithmic used in the framework of the high performance sparse direct solvers. We want to extend these high-performance algorithms to develop robust parallel incomplete factorization based preconditioners for iterative solvers such as GMRES or Conjugate Gradient solvers.

The idea is to define an adaptive blockwise incomplete factorization that is much more accurate (and numerically more robust) than the scalar incomplete factorizations commonly used to precondition iterative solvers. Such incomplete factorization can take advantage of the latest breakthroughts in sparse direct methods and particularly should be very competitive in CPU time (effective power used from processors and good scalability) while avoiding the memory limitation encountered by direct methods. By this way, we expect to be able to solve systems in the order of hundred millions of unknowns and even one billion of unknowns. Another goal is to analyse and justify the chosen parameters that can be used to define the block sparse pattern in our incomplete factorization.

The driving rationale for this study is that it is easier to incorporate incomplete factorization methods into direct solution software than it is to develop new incomplete factorizations. Our approach consists in computing symbolically the block structure of the factors that would have been obtained with a complete factorization, and then deciding to drop off some blocks of this structure according to relevant critera. Our main goal at this point is to achieve a significant diminution of the memory needed to store the incomplete factors (with respect to the complete factors) while keeping enough fill-in to make the use of BLAS3 primitives profitable.

In and , we have shown the benefit of this approach over classic scalar implementation and also over direct factorisations. Indeed, on the AUDI problem (that is a reference 3D test case for direct solver with about one million of unknowns), we are able to solve the system in half the time required by the direct solver while using only one tenth of the memory needed (for a relative residual precision of 10^-7).

We have now to perform much more experiments on a large class of problems to validate our approach and to find restrictions. We also expect to improve the criteria that are used to drop some blocks from the blockwise symbolic structure.

This research is included in a NSF/INRIA project and is carried out in collaboration with Yousef Saad (University of Minneapolis, USA).

ILU factorization based on a hierarchical interface decomposition algorithm

In recent years, a few Incomplete LU factorization techniques were developed with the goal of combining some of the features of standard ILU preconditioners with the good scalability features of multi-level methods. The key feature of these techniques is to reorder the system in order to extract parallelism in a natural way. Often a number of ideas from domain decomposition are utilized and mixed to derive parallel factorizations.

Under this framework, we developped in collaboration with Yousef Saad (University of Minnesota) algorithms that generalize the notion of ``faces'' and ``edge'' of the ``wire-basket'' decomposition. The interface decomposition algorithm is based on defining a ``hierarchical interface structure''. The hierarchy consists of classes with the property that class k nodes, with k>1, are separators for class k-1 nodes. In each class, nodes are grouped in independent sets. Class 1 nodes are simply interior nodes of a domain in the graph partitioning of the problem. These are naturally grouped in group-independent sets, in which the blocks (groups) are the interior points of each domain. Nodes that are adjacent to more subdomains will be part of the higher level classes and are ordered last.

The second part of this work is a factorization that uses dropping strategies which attempt to preserve the independent set structure. The Gaussian elimination process proceeds by classes: nodes of the first class are eliminated first, followed by those of the second class, etc. We propose two dropping rules: the first one privileges numerical robustness and the second one privilege parallelism. Generally, we apply the first rule for the factorization of the first 2 or 3 classes because in those first levels, the factorization implies communication between a few processors and we need to capture more numerical accuracy. We use the second rules in the higher classe factorizations for the opposite reasons.

These algorithms have been implemented in the PHIDAL library. Some first experiments has been realized on realistic 3D fluid dynamics test cases, and they show a very promising scalability of the preconditioner (number of iterations, cpu time). The latest developments of this work has been presented in .

In recent works, we enhanced the level and connector definition and brought out new algorithms that aim at the decreasing of the number of connectors and levels that are obtained with irregular graphs. A paper describing these algorithms and some experimental results have been submitted in the SIAM Journal of Scientific Computing (SISC).

Parallel fast multipole method

The Fast Multipole Method (FMM) is a hierarchical method which computes interactions for the N-body problem in O(N) time for any given precision. In order to compute energy and forces on large systems, we need to improve the computation speed of the method.

BLAS routines (Basic Linear Algebra Subprograms) had already been successfully used for the near field computation. Even if it is straightforward for the far field computation to use level 2 BLAS (corresponding to matrix-vector operations), the use of level 3 BLAS (that corresponds to matrix-matrix operations) is interesting because much more efficient. So, we have managed the algorithm in order to use level 3 BLAS, thus greatly improve the overall runtime, with however a careful data memory storage.

Other enhancements of Fast Multipole Methods, such as the use of Fast Fourier Transform or the use of ``rotations'', allow the reduction of the theoretical operation count: these techniques have been implemented for comparison with our BLAS version. Tests have then shown that our approach is either faster (compared to the rotation based method) or as fast and without any numerical instabilities (compared to the FFT based method), hence justifying our BLAS approach. An efficient parallel code of our BLAS version will be validated for real study cases.

Parallel algorithms for heterogeneous platformsOlivierBeaumontPierreRametJeanRomanparallel algorithmsheterogeneous platformsmaster slave taskingschedulingdivisible taskscommunication/computation overlappingcollective communications

As already mentioned in section , makespan minimization turns out to be very difficult, even for simple homogeneous processors and links. Our objective is to lower the ambition of makespan minimization in order to build efficient scheduling algorithms for more realistic platform models. In our works, we usually adopt the so-called ``one-port with overlap model'', where a processor can simultaneously send one message, receive one message, process one task, and contentions over communication links are taken into account. This requires a fine knowledge of the topology of the platform, but recently, some tools (like ENV and AlNEM) have been designed to build such platform models. An idea to circumvent the difficulty of makespan minimization is to lower the ambition of the scheduling objective. Instead of aiming at the absolute minimization of the execution time, why not consider asymptotic optimality ? After all, the number of tasks to be executed on the computing platform is expected to be very large: otherwise why deploy the corresponding application on computational grids ? This approach has been pioneered by Bertsimas and Gamarnik. The dramatic simplification of steady-state scheduling is to concentrate on steady-state operations ! The scheduling problem is relaxed in many ways. Initialization and clean-up phases are neglected. The initial integer formulation is replaced by a continuous or rational formulation. The precise ordering and allocation of tasks and messages are not required, at least in the first step. The main idea is to characterize the activity of each resource during each time-unit: which (rational) fraction is spent computing, which is spent receiving or sending to which neighbor. Such activity variables are gathered into a linear program, which includes conservation laws that characterize the global behavior of the system.

This approach has been applied with success to many scheduling problems. We have first considered very simple application models, such as master-slave tasking, where a processor initially holds all the data , and the makespan minimization counterpart has been studied. Generalizations, when some parallelism can be extracted within tasks, have been considered and the general case has been proven NP-Hard . The case of divisible tasks (perfect parallel tasks that can be arbitrarily divided) has been addressed .

We have also studied the problem of communication/computation overlapping for pipelined communications. This corresponds to ``do-dopar'' loops whereas master slave tasking corresponds to ``dopar-do'' loops. The use of automatic parallelization techniques in order to obtain a canonical representation of loops, and then to derive both mapping and scheduling of tasks is one of the perspectives of this work.

We have applied steady-state techniques to collective communication schemes, such as scatters, broadcasts, parallel prefix and multicasts. We have derived polynomial algorithms for broadcasts and scatters , but multicasts and parallel prefix have been shown NP-Hard . Another perspective of our work is to understand the limitations of steady state techniques, with respect to these new NP-Hardness results. Recently, we have also considered a more restrictive communication model (unidirectional 1-port model) where a processor can either send or receive data at a given time step, and issues related to memory minimization .

From the computational complexity point of view, considering steady state and throughput maximization instead of makespan minimization is very efficient. Nevertheless, many problems related to throughput maximization remain open.

Concerning the modeling of the processing time, memory limitations should be taken into account. Indeed, proposed schemes involve communications of large messages and therefore memory problems may occur. Another essential characteristic of grid platforms is the dynamicity of resources. It is therefore necessary to design algorithms (in incremental order of complexity)

stable against small variations in link and processor capabilities;

where scheduling could be computed in a distributed fashion;

able to cope with modifications in the topology of the platform.

If we are able to deal with memory constraints and small variations in resource performances, then applications such as EPSN (where data are generated, processed and then visualized and for which a given throughput must be ensured) or video on demand (where video streams must be adapted to the capabilities of visualization terminals and network occupation) could be considered.

Computational steering environment for distributed numerical simulationsOlivierCoulaudMichaëlDussèreAurélienEsnardGuillaumeLatuBonifaceNkongaJeanRomannumerical simulationcomputational steeringscientific visualizationinteractioncouplingdata redistribution

EPSN is a software environment for the coupling of parallel numerical simulations with parallel visualization tools such as OpenInventor, VTK, AVS, etc. This environment mainly allows the control, the data exploration and the modification of parameters for complex simulations.

One can sum up the EPSN principles; some key points have been presented at EuroPar 2004 conference (see also an EPSN former paper ).

We propose a coupling model based on source code instrumentation of both the simulation and the visualization programs, as it allows fine grain steering functionalities and achieves runtime performances.

In order to closely interact with any complex parallel simulation, we introduce a high-level description model based on a hierarchical task graph representation (HTG). This abstraction allows us to ensure the coherency of EPSN operations and to provide a weak synchronization of the parallel processes.

The HTG is also useful to build steering clients independently of a given simulation. We have validated this point by developing a generic user interface, which enables the end-user to interact with any EPSN simulation through simple numerical data-sheets or by using basic visualization plug-ins for fields, particles or meshes.

Regarding the communication infrastructure, we advocate the choice of CORBA as a coupling technology, because it allows portability, interoperability and network transparency. The experiments with both sequential and parallel simulations have shown that the EPSN library achieves high performances on data transfers.

We have also demonstrated the computation/communication overlapping capabilities provided by the EPSN infrastructure even on the parallel case. Further experiments should be carried out with large parallel simulations in order to validate the scalability of our platform.

The coupling of parallel simulations with parallel visualization systems relies on a M×N redistribution library. This library is called RedGRID.

RedGRID is a software environment dedicated to the parallel data redistribution of complex data objects, such as dense or sparse fields, particles and unstructured meshes. This environment initially intends to be used in EPSN, but can also be used in a more generic purpose for code coupling such as Parallel CORBA Object (PaCO++) defined in the PARIS project. RedGRID is made up of two libraries: RedSYM and RedCORBA.

RedSYM

The Symbolic Redistribution Library intends to provide a generic framework for the purpose of the data redistribution. It mainly proposes a standard description of parallel data objects and implements redistribution algorithms for these objects. This library is said to be symbolic as it is fully independent of the communication layer.

RedCORBA

The CORBA Redistribution Library is a communication layer based on the CORBA technology and built upon RedSYM. It allows to dynamically couple several parallel codes (e.g. MPI, PVM, etc.) over an heterogeneous and distributed network and to redistribute data between codes. The RedCORBA library mainly provides a component-like framework to connect parallel codes and exchange distributed data according to a one-sided communication model (get/put).

Finally, we have validated the use of EPSN with two real scientific simulations: the parallel fluid flow simulation software FluidBox and the molecular docking application FINGERPRINT. At this point in EPSN development, the platform allows the coupling of parallel simulations with parallel visualization systems and the efficient transfer of redistributed data.

The developments of EPSN are now oriented on the support of massively parallel simulations and distributed applications. On going works also aim at the development of EPSN client components to ease the integration of the platform in the visualization systems and the development of fully parallel visualizations prototypes (parallel rendering, tiled display).

RedGRID evolution is oriented on the specification of a generic data model for unstructured meshes and graphs which would be suitable for the M×N redistribution. We will also develop a new transfer layer on top of MPI, called RedMPI and that supports parallel data objects and redistribution algorithms as defined in RedSYM.

Contracts and Grants with IndustryCEA contract for scientific expertise in high performance computingOlivierCoulaudFrançoisPellegriniPierreRametJeanRoman

This contract consists of expertise tasks concerning high performance aspects for some industrial CEA/CESTA parallel codes.

CEA research and development contracts

Parallel resolution of multifluid flows (Benjamin Braconnier, Boniface Nkonga);

Numerical simulation of compressible multifluid flows (Rémi Abgrall, Michaël Papin);

Feasibility study of new direct-iterative hybrid solvers for very large complex sparse systems (Pascal Hénon, Dimitri Lecas, François Pellegrini, Pierre Ramet, Jean Roman);

Feasibility study of the new hybrid MPI – Threads version of the PaStiX parallel direct solver on the SMP supercomputer of CEA (Arnaud Goureman, Pascal Hénon, Dimitri Lecas, Pierre Ramet, Jean Roman);

Numerical simulation of crack propagation in silica glass by coupling molecular dynamics and elasticity methods (Guillaume Anciaux, Olivier Coulaud, Jean Roman).

Other Grants and ActivitiesRegional initiatives``CNES – EADS/EXPERT''RémiAbgrall (leader of the project)PascalHénonPierreRametJeanRomanCédricTavé

Grant: Conseil Régional d'Aquitaine, CNES and EADS – EXPERT project

Dates: 2004 – 2007

Overview: The objective of this work is to upgrade the numerical schemes in the aerodynamic modules of the ONERA code CEDRE using the know–how we have developed in residual distribution schemes. The main difficulty is to adapt these methods to the data structure of CEDRE. The residual distribution schemes are tuned for cell vertex data structure while CEDRE works with cell centered data structures. The scientific objective of this grant is to provide a bridge between residual distribution schemes and discontinuous Galerkin ones.

National initiatives``Numerical aeroacoustic''RémiAbgrallNikolaïAndrianov

Grant: ACI ``Sciences et Technologies pour le Transport Supersonique'' – French Ministry of Industry

Dates: 2002 – 2004

Partners: Dassault Aviation, Ecole Centrale de Lyon (leader of the project)

Overview: The objective is to develop high order numerical schemes for wave propagation on unstructured meshes .

Sire: ``computing simulation of enzymatic systems: from structural to functional aspectsOlivierCoulaud

Grant: ACI IMPIO (``Action Concertée Incitative Informatique, Mathématiques, Physique en Biologie Moléculaire'' – French Ministry of Research)

Dates: 2004 – 2006

Partners: CBT and MAEM (UHP Nancy 1, CNRS)

Overview: The goal of this action is to study the using of linear scaling algorithms in order to understand the behavior of Methionine synthase reductase enzymes.

Web:http://www.cbt.uhp-nancy.fr/sire/

RedGrid: ``redistribution of large data on computational grids''OlivierBeaumontOlivierCoulaudAurélienEsnardPierreRametJeanRoman

Grant: INRIA New Investigation Grant

Dates: 2003 – 2004

Partners: GRAAL (INRIA Rhône-Alpes, leader of the project), PARIS (INRIA IRISA), Résédas (INRIA LORIA)

Overview: The objective of RedGrid action is to develop for a CORBA environment efficient redistribution algorithms for large data sets on grids. The application domain is the computational steering of distributed numerical simulations.

Web:http://graal.ens-lyon.fr/~desprez/REDGRID/index.html

EPSN: ``a computational steering environment for distributed numerical simulations''OlivierCoulaud (leader of the project)MichaëlDussèreAurélienEsnardGuillaumeLatuBonifaceNkongaJeanRoman

Grant: ACI GRID (``Action Concertée Incitative Globalisation des Ressources Informatiques et des Données'' – French Ministry of Research)

Dates: 2002 – 2005

Partners: IPARLA (INRIA Futurs), IECB (Bordeaux 1, Bordeaux 2, CNRS, INSERM), SRSMC (CNRS), LSIIT (Strasbourg 1, CNRS)

Overview: This project aims at conceiving a framework enabling the steering of distributed and parallel numerical simulations through visualisation or virtual reality environments. We are focusing on the design and implementation of the framework including data redistribution and parallel visualisation client.

Web:http://www.labri.fr/Recherche/PARADIS/epsn/

TLSE: ``test for large systems of equations''PatrickAmestoyPascalHénonDimitriLecasFrançoisPellegriniPierreRametJeanRoman

Grant: ACI GRID (``Action Concertée Incitative Globalisation des Ressources Informatiques et des Données'' – French Ministry of Research)

Dates: 2002 – 2005

Partners: CERFACS, ENSEEIHT – IRIT (leader of project), GRAAL (INRIA Rhône-Alpes), CEA, CNES, EADS, IFP

Overview: The main objective of TLSE is to design and to develop an expertise platform for sparse linear algebra by using the grid technology. There has been much joint work over many years on sparse matrix software between CERFACS, ENSEEIHT-IRIT, Rutherford Appleton Laboratory, LaBRI, LIP-ENSL, Parallab, University of Florida, Berkeley, and other collaborators. This has given rise to the production of several software packages that are available to the scientific community. The goal of the project is to design an expert site that uses the accumulated expertise just mentioned and provides a one-stop shop for potential users of sparse codes. The user may want to interrogate our databases for information or references on sparse matrix work or may want actual statistics from runs of sparse software on his or her problem. The site will provide an easy access to the tools and will allow comparative analysis of these packages on a user-submitted problem or on particular matrices in the matrix collection also available on the site.

Web:http://www.enseeiht.fr/lima/tlse/

``GRID5000 – GRINTA''OlivierBeaumontOlivierCoulaudMichaëlDussèreAurélienEsnardPascalHénonPierreRametJeanRoman

Grant: ACI GRID (``Action Concertée Incitative Globalisation des Ressources Informatiques et des Données'' – French Ministry of Research and Conseil Régional d'Aquitaine)

Dates: 2003 – 2005

Partners: RunTime (INRIA Futurs), IPARLA (INRIA Futurs), SOD (LaBRI)

Overview: The main objectives of GRID5000 is to deploy, to manage and to use a very large PC grid on the french territory. In this project, we focus on the algorithmic aspects and on the development of complex applications on such an heterogeneous grid. This work is complementary with the research works of EPSN project.

International initiativesNSF - INRIA project: ``robust parallel preconditioning methods''PatrickAmestoyPascalHénonDimitriLecasFrançoisPellegriniPierreRametJeanRoman

Grant: NSF and INRIA

Dates: april 2001 – april 2004

Partners: Yousef Saad/Minneapolis, Randall Bramley/Indiana, Esmond Ng/Berkeley, Maria Sosonkina/Minneapolis, John Gilbert/Santa Barbara, Bernard Philippe (IRISA), Jocelyne Ehrel (IRISA), Jean-Yves L'Excellent (INRIA Rhône-Alpes)

Overview: This project focuses on algorithmic and experimental comparaisons between the parallel solvers developped by the participants, and on different solutions to design efficient parallel hybrid direct-iterative solvers. Yousef Saad has visited the ScAlApplix project in Bordeaux several times. Pascal Hénon, Pierre Ramet and Jean Roman have stayed in Minneapolis for 15 days during april 2003.

DisseminationParticipation to the Scientific Community

Rémi Abgrall is scientific editor of the international journals ``Mathematical Modelling and Numerical Analysis'', ``Computer and Fluids'' and ``Journal of Computational Physics''. He is member of the scientific committee of the international conference ICCFD.

Olivier Beaumont has been member of the scientific committee of the international conferences HeteroPar'04 and IPDPS'04.

Olivier Coulaud has been member of the scientific committee of the international conference VECPAR'04.

Jean Roman is vice-president of the project committee of INRIA Futurs and member of the national evaluation committee of INRIA. He has been member of the scientific committee of the national conference RenPar'04 and of the international conference PMAA'04. He is co-editor of a special issue for PMAA in the international journal ``Parallel Computing''. He has been member of the EDF scientific committee for the evaluation of the R & D scientific program ``Simulation 2010''.

Teaching

In complement of the normal teaching activity of the university members and of ENSEIRB members, Olivier Coulaud and Pascal Hénon teach at ENSEIRB (computer science) and MatMéca (applied mathematics) engineering schools.

Residual distribution schemes : current status and future trendsR.AbgrallInvited paper, submitted to Comp. and Fluids2004Towards very high order accurate schemes for unsteady convection problems on unstructured meshesR.AbgrallN.AndrianovM.MezineInt. J. Numer. Meth. in Fluidsto appear2004Hybrid scheduling strategies for the parallel multifrontal methodP. R.AmestoyA.GuermoucheJ.-Y.L'ExcellentS.Pralet3rd International Workshop on Parallel Matrix Algorithms and Applications (PMAA'04), CIRM, Marseille, France18-22 October2004Scheduling divisible loads on star and tree networks: results and open problemsO.BeaumontH.CasanovaA.LegrandY.RobertY.YangIEEE Trans. Parallel Distributed Systemsto appear2004Pipelining broadcasts on heterogeneous platformsO.BeaumontA.LegrandL.MarchalY.RobertIEEE Trans. Parallel Distributed Systemsto appear2004Multifluid numerical approximations based on multipressure formulation.C.BerthonB.NkongaComp. and Fluidsto appear2004A Time-coherent Model for the Steering of Parallel SimulationsO.CoulaudM.DussèreA.EsnardEuro-Par 2004 Parallel ProcessingLecture Notes in Computer Science3149Springer Verlag200490-97Toward a Computational Steering Environment based on CORBAO.CoulaudM.DussèreA.EsnardParallel Computing: Software Technology, Algorithms, Architectures & ApplicationsAdvances in Parallel Computing13Elsevier2004151–158Applying parallel direct solver skills to build robust and highly performant preconditionersP.HénonF.PellegriniP.RametJ.RomanY.SaadPARA'04, Workshop on state-of-the-art in scientific computing, Copenhague, DanemarkJune2004http://www.labri.fr/~ramet/restricted/para2004.ps.gzPHIDAL: A Parallel Multilevel Linear Solver based on a Hierarchical Interface DecompositionP.HénonY.SaadEleventh SIAM Conference on Parallel Processing for Scientific Computing, San Francisco, USAFebruary2004Nouvelles méthodes pour l'ordonnancement sur plates-formes hétérogènesO.BeaumontPh. D. ThesisHabilitation à diriger des recherches de l'Université de Bordeaux 1December2004Numerical discretisation of boundary conditions for first order Hamilton Jacobi equationsR.AbgrallSIAM J. Numer. Anal.to appear2004Residual Distribution Schemes, Discontinuous Galerkin Schemes, Multidimensional Schemes and Mesh AdaptationR.AbgrallH.DeconinckComputer and Fluidsto appear2004The simulation of the linearised Euler equations by a second order scheme is possible.R.AbgrallM.RavacholS.MarretInt. J. Accousticsto appear2004Algorithm 8xx: AMD, an approximate minimum degree ordering algorithmP.AmestoyT.DavisI.DuffACM Transactions on Mathematical Software2004Task scheduling in an asynchronous distributed memory multifrontal solverP. R.AmestoyI. S.DuffC.VömelSIAM Journal on Matrix Analysis and Applications2004Scheduling strategies for master-slave tasking on heterogeneous processor platformsC.BaninoO.BeaumontL.CarterJ.FerranteA.LegrandY.RobertIEEE Trans. Parallel Distributed Systems1542004319-330noA relaxation method for two-phase flow models with hydrodynamic closure lawM.BaudinC.BerthonF.CoquelR.MassonH.TranNumer. Math.to appear2004Informatique répartie : architecture, parallélisme et systèmeO.BeaumontV.BoudetP.-F.DutotY.RobertD.TrystramHermès Publications2004Inégalités d'entropie pour un schéma de relaxationC.BerthonC. R. Acad. Sci. I, Math.to appear2004Numerical approximation of a degenerated non-conservative multifluid model: relaxation schemeC.BerthonB.BraconnierB.NkongaInt. J. Numer. Meth. in Fluidsto appear2004Nonlinear projection methods to approximate combustion turbulent flowsC.BerthonD.ReignierComp. and Fluids332004679-685Study of a mathematical model for stimulated Raman scatteringT.BoucheresA.BourgeadeT.ColinB.NkongaB.TexierMath. Models Methods Appl. Sci.to appear2004Dynamic load balancing computation of pulses propagation in a nonlinear mediumA.BourgeadeB.NkongaJ. of Supercomputing, Special Issueto appear2004Numerical model for light interaction with two-level atoms medium.T.ColinB.NkongaPhysica Dto appear2004Invading introduced species in insular heterogeneous environmentsS.GaucelM.LanglaisD.PontierEcological Modellingto appear2004noA Lagrangian discontinuous Galerkin type method on unstructured meshes to solve hydrodynamic problemsR.LoubèreJ.OvadiaR.AbgrallInt. J. Numer. Meth. in Fluidsto appear2004A numerical scheme for compressible multiphase flowsR.AbgrallV.PerrierThird International Conference in Computational Fluid DynamicsLecture Notes in PhysicsSpringer Verlag2004GRID-TLSE: A Web Site for Experimenting with Sparse Direct Solvers on a Computational GridP. R.AmestoyI. S.DuffL.GiraudJ.-Y.L'ExcellentC.PuglisiSIAM conference on Parallel Processing for Scientific Computing, San Francisco, USA25-27 February2004Unsymmetric Orderings Using A Constrained Markowitz SchemeP.AmestoyS.PraletS.LiSIAM conference on Parallel Processing for Scientific Computing, Workshop of Combinatorial Scientific Computing, San Francisco, USA27-28 February2004Tradeoff to minimize extra-computations and stopping criterion tests for parallel iterative schemesO.BeaumontE.DaoudiN.MaillardP.MannebackJ.-L.RochPMAA'04 Parallel Matrix Algorithms and ApplicationsCIRM, Marseille2004noAssessing the impact and limits of steady-state scheduling for mixed task and data parallelism on heterogeneous platformsO.BeaumontA.LegrandL.MarchalY.RobertHeteroPar'2004: International Conference on Heterogeneous Computing, jointly published with ISPDC'2004: International Symposium on Parallel and Distributed ComputingIEEE Computer Society Press2004noComplexity results and heuristics for pipelined multicast operations on heterogeneous platformsO.BeaumontA.LegrandL.MarchalY.Robert2004 International Conference on Parallel Processing (ICPP'2004)IEEE Computer Society Press2004267-274Independent and Divisible Task Scheduling on Heterogeneous Star-shaped Platforms with Limited MemoryO.BeaumontA.LegrandL.MarchalY.Robert13th Euromicro Conference on Parallel, Distributed and Network-based Processingto appearIEEE Computer Society Press2004Pipelining Broadcasts on Heterogeneous PlatformsO.BeaumontA.LegrandL.MarchalY.RobertIPDPS'2004IEEE Computer Society Press2004noSteady-state scheduling on heterogeneous clusters: why and how?O.BeaumontA.LegrandL.MarchalY.Robert6th Workshop on Advances in Parallel and Distributed Computational Models APDCM 2004IEEE Computer Society Press2004Invading species, native species and insular environmentS.GaucelD.PontierJoint Conferences DESTOBIO 3 and MPD 6July2004Using of the High Performance Sparse Solver PaStiX for the Complex Multiscale 3D Simulations performed by the FluidBox Fluid Mechanics SoftwareP.HénonB.NkongaP.RametJ.RomanProceedings of PMAA'2004, Marseille, FranceOctober2004High Performance Complete and Incomplete Factorizations for Very Large Sparse Systems by using Scotch and PaStiX softwaresP.HénonF.PellegriniP.RametJ.RomanY.SaadEleventh SIAM Conference on Parallel Processing for Scientific Computing, San Francisco, USAFebruary2004A Blockwise Algorithm for Parallel Incomplete Cholesky FactorizationP.HénonP.RametJ.RomanProceedings of PMAA'2004, Marseille, FranceOctober2004Using the native mesh partitioning capabilities of Scotch 4.0 in a parallel industrial electromagnetics codeF.PellegriniD.GoudinEleventh SIAM Conference on Parallel Processing for Scientific Computing, San Francisco, USAFebruary2004Développement de la phase d'assemblage de la chaîne EMILIO (distribution du maillage et multi-threading)A.GouremanP.RametJ.RomanRapport FinalTechnical reportC.E.A. / C.E.S.T.A2004Etude sur l'applicabilité de méthodes itératives nouvelles aux problèmes du CESTAP.HénonF.PellegriniP.RametJ.RomanRapport finalTechnical reportC.E.A. / C.E.S.T.A2004Construction of simple, stable and convergent high order schemes for steady first order Hamilton Jacobi equationsR.AbgrallSubmitted to SIAM J. Sci. Comput.2004On the construction of compact schemes for compressible flow problemsR.AbgrallSubmitted to J. Comp. Phys.2004A Lagrangian scheme for multidimensional compressible flow problemsR.AbgrallJ.BreilP.MaireJ.OvadiaSubmitted to J. Comp. Phys.2004Numerical approximations of the 10-moment Gaussian closureC.BerthonSubmitted to Math. Comp.2004A relaxation scheme to approximate pressureless gas dynamicC.BerthonM.BreussM.-O.TiteuxSubmitted to Num. Meth. Partial Differential Equations2004Numerical model of a compressible multifluid fluctuating flowC.BerthonB.NkongaSubmitted to Int. J. Finite Volume, in revision2004Modèle pour la redistribution de donnés complexesA.EsnardSubmitted to 16ièmes Rencontres francophones du parallélisme, RenPar'162004Invasion processes for alien species on sub-Antartic ecosystemsS.GaucelM.LanglaisD.PontierMathematical understanding of invasion processes in life sciences, CIRM, LuminyMarch2004How predator food preference can change the destiny of native prey in predator-prey systemsS.GaucelD.PontierSubmitted to Biological Invasions2004Processus d'invasion pour un modèle prédateur-compétiteur-proie en milieu hétérogèneS.GaucelD.PontierEcole d'été du GDR GRIPJuly2004Parallel Complete and Incomplete Blockwise Factorisations for Very Large Sparse SystemsP.HénonP.RametJ.RomanPoster session, SuperComputing'2004, Pittsburgh, USANovember2004http://www.labri.fr/~ramet/restricted/poster_sc2004.tar.gzHighly scalable parallel simulator for a host-parasite systemM.LanglaisG.LatuJ.RomanP.SilanSubmitted to International Journal of High Performance Computing Applications, in revision2004Simulation parallèle de la propagation de l'oïdium dans une parcelle de vigneG.TessierSubmitted to 16ièmes Rencontres francophones du parallélisme, RenPar'162004