Section: Application Domains
High Performance Computing
Participants : PierreNicolas Clauss, Sylvain ContassotVivier, Jens Gustedt, Soumeya Leila Hernane, Emmanuel Jeanvoine, Thomas Jost, Wilfried Kirschenmann, Stéphane Vialle.
Models and Algorithms for Coarse Grained Computation
With this work we aim at extending the coarse grained modeling (and the resulting algorithms) that we provide previously, see [6] , to hierarchically composed machines such as clusters of clusters or clusters of multiprocessors.
To be usable in a Grid context this modeling has first of all to overcome a principal constraint of the existing models: the idea of an homogeneity of the processors and the interconnection network. Even if the long term goal is to target arbitrary architectures it would not be realistic to think to achieve this directly, but in different steps:

Hierarchical but homogeneous architectures: these are composed of an homogeneous set of processors (or of the same computing power) interconnected with a nonuniform network or bus which is hierarchic (CCNuma, clusters of SMP s).

Hierarchical heterogeneous architectures: there is no established measurable notion of efficiency or speedup. Also most certainly not any arbitrary collection of processors will be useful for computation on the Grid. Our aim is to be able to give a set of concrete indications of how to construct an extensible Grid.
In parallel, we have to work upon the characterization of architecturerobust efficient algorithms, i.e.,algorithms that are independent, up to a certain degree, of lowlevel components or the underlying middleware.
Asynchronous algorithms are very good candidates as they are robust to dynamic variations of the performances of the interconnection network used. Moreover, they are even tolerant to the loss of message related to the computations. However, as mentioned before they cannot be used in all cases. We will then focus on the feasibility to modify those schemes in order to widen their range of applicability while preserving a maximum of asynchronism.
Finally, as the number of components grows, so does the probability of having failures. Work has already been achieved to provide efficient fault tolerance solutions for some SPMDwithcommunications and MasterWorker families of parallel applications (cf. Section 6.1.9 ). Being at the application level, these solutions seem suitable for the aforementioned heterogeneous architectures and may complement algorithmicbased fault tolerance such as the one naturally provided by asynchronous algorithms. We are currently investigating the compatibility of our fault tolerance solutions with some applications developed to run on clusters of GPGPUs (e.g.: American option pricer). We also see to extend our solutions to support and take advantage of asynchronous algorithms.
Irregular Problems
Irregular data structures like sparse graphs and matrices are in wide use in scientific computing and discrete optimization. The importance and the variety of application domains are the main motivation for the study of efficient methods on such type of objects. The main approaches to obtain good results are parallel, distributed and outofcore computation.
We follow several tracks to tackle irregular problems: automatic parallelization, design of coarse grained algorithms and the extension of these to external memory settings.
In particular we study the possible management of very large graphs, as they occur in reality. Here, the notion of “networks” appears twofold: on one side many of these graphs originate from networks that we use or encounter (Internet, Web, peertopeer, social networks) and on the other side the handling of these graphs has to take place in a distributed Grid environment. The principal techniques to handle these large graphs will be provided by the coarse grained models. With the PRO model [6] and the parXXL library we already provide tools to better design algorithms (and implement them afterward) that are adapted to these irregular problems.
In addition we will be able to rely on certain structural properties of the relevant graphs (short diameter, small clustering coefficient, power laws). This will help to design data structures that will have good locality properties and algorithms that compute invariants of these graphs efficiently.
Heterogeneous Architecture Programming
Clusters of heterogeneous nodes, composed of CPUs and GPUs, require complex multigrain parallel algorithms: coarse grain to distribute tasks on cluster nodes and fine grain to run computations on each GPU. Algorithms implementation is achieved on these architectures using a multiparadigm parallel development environment, typically composed of MPI and CUDA libraries (compiling with both gcc and nVIDIA nvcc compilers).
We investigate the design of multigrain parallel algorithm and multiparadigm parallel development environment for GPU clusters, in order to achieve both speedup and size up on different kinds of algorithms and applications. Our main application targets are: financial computations, PDE solvers, and relaxation methods.
Energy
Nowadays, society is getting more and more aware of the problem of energy supply and is therefore concerned with reducing energy consumption. Computer science is not an exception and a lot of effort has to be made in our domain in order to optimize the energetic efficiency of our systems and algorithms.
In that context, we investigate the potential benefit of using intensively parallel devices such as GPUs in addition to CPUs. Although such devices present quite high instantaneous energy consumptions, their energetic efficiency, that is to say their ratio of flops/Watt is often much better than the one of CPUs.
We have studied the potential energetic gain of GPUs in different kinds of applications (pricer, PDE solver,...). Our experiments have pointed out that there is, in most cases, a complex frontier between the best energetic solutions (CPU alone, CPU + GPU) according to the problem parameters (problem size,...) and architecture configuration (number of nodes, network...). Then, we have designed a first set of models that allows for predicting the best combination of compute kernels according to a given context of use. Further investigations will be done in order to enhance the models and try to design a dynamic adaptive scheme to make a biobjective optimization (computing performance and energy consumption).
Load balancing
Although load balancing in parallel computing has been intensively studied, it is still an issue in the most recent parallel systems whose complexity and dynamic nature regularly increase. For the grid in particular, where the nodes or the links may be intermittent, the demand is stronger and stronger for noncentralized algorithms.
In a joint work with the University of FrancheComté, we study the design and optimal tuning of a fully decentralized load balancing scheme (see 8.2.8 ). In particular, we study the optimal load amount to migrate between neighboring nodes. We have developed a SimGrid program to study the impact of the different strategies and we are currently adapting our load balancing scheme to a real application of neural learning (AdaBoost). This code was initially developed by S. Genaud and V. Galtier to compare JavaSpace and P2PMPI.
Another aspect of loadbalancing is also addressed by our team in the context of the Neurad project. Neurad is a multidisciplinary project involving our team and some computer scientists and physicists from the University of FrancheComté to tackle the planning of external radiotherapy against cancer. In that work, we have already proposed an original approach in which a neural network is used inside a numerical algorithm to provide radiation dose deposits in any heterogeneous environments, see [42] . The interest of the Neurad software is to combine very small computation times (five minutes) with an accuracy close to the most accurate methods (MonteCarlo) whereas these accurate methods take several hours to deliver their results.
In fact, in Neurad most of the computational cost is hidden in the learning of the internal neural network. This is why we work on the design of a parallel learning algorithm based on domain decomposition [25] . However, as learning the obtained subdomains may take quite different times, a pertinent loadbalancing is required in order to get approximately the same learning times for all the subdomains. The work here is thus more focused on the decomposition strategy as well as the load estimator in the context of neural learning. We have recently proposed an efficient algorithm to perform the decomposition and the data selection of an initial learning set in order to obtain similar learning times of the induced subnetworks [20] .