The Hyperion system: Compiling multithreaded Java bytecode for distributed execution

RUNTIME Efficient runtime systems for parallel architectures

Distributed and High Performance Computing

Networks, Systems and Services, Distributed Computing

Laboratoire Bordelais de Recherche en Informatique (LaBRI) CNRS Ecole nationale supérieure d'électronique, informatique et radiocommunications de Bordeaux Université de Bordeaux Grid Computing High Performance Computing Scheduling Grid'5000 High Performance Communication Multithreading Raymond Namyst UnivFr Enseignant

Bordeaux

University Bordeaux 1, Professor, Team Leader oui Sylvie Embolla INRIA Assistant

Bordeaux

Olivier Aumage INRIA Chercheur

Bordeaux

Junior Researcher, INRIA Denis Barthou UnivFr Enseignant

Bordeaux

Professor, IPB oui Marie-Christine Counilh UnivFr Enseignant

Bordeaux

Assistant Professor, University of Bordeaux Alexandre Denis INRIA Chercheur

Bordeaux

Junior Researcher, INRIA Brice Goglin INRIA Chercheur

Bordeaux

Junior Researcher, INRIA Emmanuel Jeannot INRIA Chercheur

Bordeaux

Senior Researcher, INRIA oui Guillaume Mercier UnivFr Enseignant

Bordeaux

Assistant Professor, IPB Samuel Thibault UnivFr Enseignant

Bordeaux

Assistant Professor, University of Bordeaux Pierre-André Wacrenier UnivFr Enseignant

Bordeaux

Assistant Professor, University of Bordeaux Remi Sharrock UnivFr PostDoc

Bordeaux

IPB until 08/2011 Nicolas Collin INRIA Technique

Bordeaux

Associate Engineer, INRIA, European Project grant Nathalie Furmento CNRS Technique

Bordeaux

Research Engineer, CNRS Yannick Martin INRIA Technique

Bordeaux

Associate Engineer, INRIA Cyril Roelandt INRIA Technique

Bordeaux

Associate Engineer, INRIA, ANR grant Ludovic Stordeur INRIA Technique

Bordeaux

Associate Engineer, INRIA Ludovic Courtès INRIA Technique

Bordeaux

Research Engineer, INRIA François Tessier INRIA Technique

Bordeaux

Associate Engineer, INRIA, ANR grant (until 10/2011) Sébastien Barascou INRIA Technique

Bordeaux

Associate Engineer, INRIA, ANR grant Paul-Antoine Arras UnivFr PhD

Bordeaux

University of Bordeaux, STMicroelectronics-INRIA CIFRE Cédric Augonnet UnivFr PhD

Bordeaux

University of Bordeaux, École Normale Supérieure de Lyon grant Andres Charif-Rubial UnivFr PhD

Bordeaux

University of Versailles, ANR grant Jérôme Clet-Ortega UnivFr PhD

Bordeaux

University of Bordeaux, MESR grant Sylvain Henry UnivFr PhD

Bordeaux

University of Bordeaux, MESR grant Andra Hugo UnivFr PhD

Bordeaux

University of Bordeaux, MESR grant Julien Jaeger UnivFr PhD

Bordeaux

University of Versailles, ANR grant Stéphanie Moreaud INRIA PhD

Bordeaux

University of Bordeaux Bertrand Putigny INRIA PhD

Bordeaux

University of Bordeaux, INRIA grant François Tessier UnivFr PhD

Bordeaux

University of Bordeaux, MESR grant (since 10/2011) Overall Objectives Designing Efficient Runtime Systems

parallel,runtime,environment,heterogeneity,SMP,multicore,NUMA,HPC,high-speed networks,protocols,MPI,schedulin g,thread,OpenMP,compiler optimizations

The Runtimeresearch project takes place within the context of high-performance computing. It seeks to explore the design, the implementation and the evaluation of novel mechanisms needed by runtime systemsfor parallel computers. Runtime systemsare intermediate software layers providing parallel programming environments with specific functionalities left unaddressed by the underlying operating system. Runtime systems can thus be seen as functional extensions of operating systems, but the boundary between them is rather fuzzy since runtime systems may actually contain specific extensions/enhancements to the underlying operating system (e.g. extensions to the OS thread scheduler). The increasing complexity of modern parallel hardware, making it more and more necessary to postpone essential decisions and actions (scheduling, optimizations) at run time, emphasizes the role of runtime systems.

One of the main challenges encountered when designing modern runtime systems is to provide powerful abstractions, both at the programming interface level and at the implementation level, to deal with the increasing complexity of upcoming hardware architectures. While it is essential to understand – and somehow anticipate – the evolutions of hardware technologies (e.g. programmable network interface cards, multicore architectures, hardware accelerators), the most delicate task is to extract models and abstractions that will fit most of upcoming hardware features.

The originality of the runtime group lies in the fact that we address all these issues following a global approach, so as to propose complementary solutions to problems which may not seem to be linked at first sight. We actually realized, for instance, that we could greatly improve our communication optimization techniques by increasing the functionalities of the underlying core thread scheduler. This illustrates why most of our research efforts have consisted in cross-studying different topics, and have led to co-designing many software.

Our research project centers on three main directions:

Mastering large, hierarchical multiprocessor machines

Thread scheduling over multicore machines

Data management over NUMA architectures

Task scheduling over GPU heterogeneous machines

Exploring parallelism orchestration at compiler and runtime level

Improved interactions between optimizing compiler and runtime

Optimizing communication over high performance clusters

Scheduling data packets over high speed networks

New MPI implementations for Petascale computers

Optimized intra-node communication

understand network topology and application communication pattern to optimize process placement

Integrating Communications and Multithreading

Parallel, event-driven communication libraries

Communication and I/O within large multicore nodes

Beside those main research topics, we obviously intend to work in collaboration with other research teams in order to validateour achievements by integrating our results into larger software environments (MPI, OpenMP) and to joinour efforts to solve complex problems.

Among the target environments, we intend to carry on developing the successor to the PM $^{2}$ software suite, which would be a kind of technological showcase to validate our new concepts on real applications through both academic and industrial collaborations (CEA/DAM, Bull, IFP, Total, Exascale Research Lab.). We also plan to port standard environments and libraries (which might be a slightly sub-optimal way of using our platform) by proposing extensions (as we already did for MPI and Pthreads) in order to ensure a much wider spreading of our work and thus to get more important feedback.

Finally, as most of our work proposed is intended to be used as a foundation for environments and programming tools exploiting large scale, high performance computing platforms, we definitely need to address the numerous scalability issues related to the huge number of cores and the deep hierarchy of memory, I/O and communication links.

Highlights

The hwloc software is used for node topology discovery and process binding by the most popular MPI implementations, including MPICH2 and Open MPIand all their derivatives such as Intel MPI.

The StarPU software is used for dynamic scheduling by the state-of-the art dense linear algebra library, Magma v1.1 http:// icl. cs. utk. edu/ magma/ .

Euro-Par is a major conference in parallel and distributed computing. It has been organized in Bordeaux from August 29 to September 2, 2011. It has featured 16 topics, 25 sessions and 12 workshops. 271 papers have been submitted and 81 papers have been accepted (29.9 %). Moreover 3 invited lectures have been given. 330 persons registered at either the conference or the workshops. The website is http:// europar2011. bordeaux. inria. fr/ . The conference chairs were Emmanuel Jeannot, Raymond Namyst and Jean Roman. The institutions involved in the organization were INRIA, the LaBRI, the CNRS and others.

Scientific Foundations Runtime Systems Evolution

This research project takes place within the context of high-performance computing. It seeks to contribute to the design and implementation of parallel runtime systems that shall serve as a basis for the implementation of high-level parallel middleware. Today, the implementation of such software (programming environments, numerical libraries, parallel language compilers, parallel virtual machines, etc.) has become so complex that the use of portable, low-level runtime systems is unavoidable.

Our research project centers on three main directions:

Mastering large, hierarchical multiprocessor machines

With the beginning of the new century, computer makers have initiated a long term move of integrating more and more processing units, as an answer to the frequency wall hit by the technology. This integration cannot be made in a basic, planar scheme beyond a couple of processing units for scalability reasons. Instead, vendors have to resort to organize those processing units following some hierarchical structure scheme. A level in the hierarchy is then materialized by small groups of units sharing some common local cache or memory bank. Memory accesses outside the locality of the group are still possible thanks to bus-level consistency mechanisms but are significantly more expensive than local accesses, which, by definition, characterizes NUMA architectures.

Thus, the task scheduler must feed an increasing number of processing units with work to execute and data to process while keeping the rate of penalized memory accesses as low as possible. False sharing, ping-pong effects, data vs task locality mismatches, and even task vs task locality mismatches between tightly synchronizing activities are examples of the numerous sources of overhead that may arise if threads and data are not distributed properly by the scheduler. To avoid these pitfalls, the scheduler therefore needs accurate information both about the computing platform layout it is running on and about the structure and activities relationships of the application it is scheduling.

As quoted by Gao et al. , we believe it is important to expose domain-specific knowledge semantics to the various software components in order to organize computation according to the application and architecture. Indeed, the whole software stack, from the application to the scheduler, should be involved in the parallelizing, scheduling and locality adaptation decisions by providing useful information to the other components. Unfortunately, most operating systems only provide a poor scheduling API that does not allow applications to transmit valuable hintsto the system.

This is why we investigate new approaches in the design of thread schedulers, focusing on high-level abstractions to both model hierarchical architectures and describe the structure of applications' parallelism. In particular, we have introduced the bubblescheduling concept that helps to structure relations between threads in a way that can be efficiently exploited by the underlying thread scheduler. Bubblesexpress the inherent parallel structure of multithreaded applications: they are abstractions for grouping threads which “work together” in a recursive way. We are exploring how to dynamically schedule these irregular nested sets of threads on hierarchical machines , the key challenge being to schedule related threads as closely as possible in order to benefit from cache effects and avoid NUMA penalties. We are also exploring how to improve the transfer of scheduling hints from the programming environment to the runtime system, to achieve better computation efficiency.

This is also the reason why we explore new languages and compiler optimizations to better use domain specific information. In the ANR project PetaQCD, we propose a new domain specific language, QIRAL, to generate parallel codes from high level formulations for Lattice QCD problems. QIRAL describes the formulation of the algorithms, of the matrices and preconditions used in this domain and generalizes languages such as SPIRAL used in auto-tuning library generator for signal processing applications. Lattice QCD applications require huge amount of processing power, on multinode, multi-core with GPUs. Simulation codes require to find new algorithms and efficient parallelization. So far, the difficulties for orchestrating parallelism efficiently hinder algorithmic exploration. The objective of QIRAL is to decouple algorithm exploration with parallelism description. Compiling QIRAL uses rewriting techniques for algorithm exploration, parallelization techniques for parallel code generation and potentially, runtime support to orchestrate this parallelism. Results of this work are submitted to publication.

For parallel programs running on multicores, measuring reliable performance and determining performance stability is becoming a key issue: indeed, a number of hardware mechanisms may cause performance instability from one run to the other. Thread migration, memory contention (on any level of the cache hierarchy), scheduling policy of the runtime can introduce some variation, indenpendently of the program input. A speed-up is interesting only it corresponds to a performance that can be obtained through repeated execution of the application. Very few research efforts have been made in the identification of program optimization/runtime policy/hardware mechanisms that may introduce performance instability. We studied in on a large set of OpenMP benchmarks performance variations, identified the mechanisms causing them and showing the need for better strategies for measuring speed-ups. Following this effort, we developped inside the tool MAQAO(Modular Assembler Quality Analyzer and Optimizer), the precise analysis of the interactions between OpenMP threads, through static analysis of binary codes and memory tracing. In particular, the influence of thread affinity is estimated and the tool proposes hints to the user to improve its OpenMP codes.

Aside from greedily invading all these new cores, demanding HPC applications now throw excited glances at the appealing computing power left unharvested inside the graphical processing units (GPUs). A strong demand is arising from the application programmers to be given means to access this power without bearing an unaffordable burden on the portability side. Efforts have already been made by the community in this respect but the tools provided still are rather close to the hardware, if not to the metal. Hence, we decided to launch some investigations on addressing this issue. In particular, we have designed a programming environment named StarPUthat enables the programmer to offload tasks onto such heterogeneous processing units and gives that programmer tools to fit tasks to processing units capability, tools to efficiently manage data moves to and from the offloading hardware and handles the scheduling of such tasks all in an abstracted, portable manner. The challenge here is to take into account the intricacies of all computation unit: not only the computation power is heterogeneous among the machine, but data transfers themselves have various behavior depending on the machine architecture and GPUs capabilities, and thus have to be taken into account to get the best performance from the underlying machine. As a consequence, StarPUnot only pays attention to fully exploit each of the different computational resources at the same time by properly mapping tasks in a dynamic manner according to their computation power and task behavior by the means of scheduling policies, but it also provides a distributed shared-memory library that makes it possible to manipulate data across heterogeneous multicore architectures in a high-level fashion while being optimized according to the machine possibilities.

Optimizing communications over high performance clusters and grids

Using a large panel of mechanisms such as user-mode communications, zero-copy transactions and communication operation offload, the critical path in sending and receiving a packet over high speed networks has been drastically reduced over the years. Recent implementations of the MPI standard, which have been carefully designed to directly map basicpoint-to-point requests onto the underlying low-level interfaces, almost reach the same level of performance for very basic point-to-point messaging requests. However more complex requests such as non-contiguous messages are left mostly unattended, and even more so are the irregular and multiflow communication schemes. The intent of the work on our NewMadeleinecommunication engine, for instance, is to address this situation thoroughly. The NewMadeleineoptimization layer delivers much better performance on complexcommunication schemes with negligible overhead on basic single packet point-to-point requests. Through Mad-MPI, our proof-of-concept implementation of a subset of the MPI API, we intend to show that MPI applications can also benefit from the NewMadeleinecommunication engine.

The increasing number of cores in cluster nodes also raises the importance of intra-node communication. Our KNemsoftware module aims at offering optimized communication strategies for this special case and let the above MPI implementations benefit from dedicated models depending on process placement and hardware characteristics.

Moreover, the convergence between specialized high-speed networks and traditional Ethernetnetworks leads to the need to adapt former software and hardware innovations to new message-passing stacks. Our work on the Open-MXsoftware is carried out in this context.

Regarding larger scale configurations (clusters of clusters, grids), we intend to propose new models, principles and mechanisms that should allow to combine communication handling, threads scheduling and I/O event monitoring on such architectures, both in a portable and efficient way. We particularly intend to study the introduction of new runtime system functionalities to ease the development of code-coupling distributed applications, while minimizing their unavoidable negative impact on the application performance.

Integrating Communications and Multithreading

Asynchronism is becoming ubiquitous in modern communication runtimes. Complex optimizations based on online analysis of the communication schemes and on the de-coupling of the request submission vs processing. Flow multiplexing or transparent heterogeneous networking also imply an active role of the runtime system request submit and process. And communication overlap as well as reactiveness are critical. Since network request cost is in the order of magnitude of several thousands CPU cycles at least, independent computations should not get blocked by an ongoing network transaction. This is even more true with the increasingly dense SMP, multicore, SMT architectures where many computing units share a few NICs. Since portability is one of the most important requirements for communication runtime systems, the usual approach to implement asynchronous processing is to use threads (such as Posix threads). Popular communication runtimes indeed are starting to make use of threads internally and also allow applications to also be multithreaded. Low level communication libraries also make use of multithreading. Such an introduction of threads inside communication subsystems is not going without troubles however. The fact that multithreading is still usually optional with these runtimes is symptomatic of the difficulty to get the benefits of multithreading in the context of networking without suffering from the potential drawbacks. We advocate the importance of the cooperation between the asynchronous event management code and the thread scheduling code in order to avoid such disadvantages. We intend to propose a framework for symbiotically combining both approaches inside a new generic I/O event manager.

Application Domains Application Domains

The RUNTIME group is working on the design of efficient runtime systems for parallel architectures. We are currently focusing our efforts on High Performance Computing applications that merely implement numerical simulations in the field of Seismology, Weather Forecasting, Energy, Mechanics or Molecular Dynamics. These time-consuming applications need so much computing power that they need to run over parallel machines composed of several thousands of processors.

Because the lifetime of HPC applications often spreads over several years and because they are developed by many people, they have strong portability constraints. Thus, these applications are mostly developed on top of standard APIs (e.g. MPI for communications over distributed machines, OpenMP for shared-memory programming). That explains why we have long standing collaborations with research groups developing parallel language compilers, parallel programming environments, numerical libraries or communication software. Actually, all these “clients” are our primary target.

Although we are currently mainly working on HPC applications, many other fields may benefit from the techniques developed by our group. Since a large part of our efforts is devoted to exploiting multicore machines and GPU accelerators, many desktop applications could be parallelized using our runtime systems (e.g. 3D rendering, etc.).

Software Common Communication Interface Brice Goglin

The Common Communication Interfaceaims at offering a generic and portable programming interface for a wide range of networking technologies (Ethernet, InfiniBand, ...) and application needs (MPI, storage, low latency UDP, ...).

CCIis developed in collaboration with the Oak Ridge National Laboratoryand several other academics and industrial partners.

CCIis in early development and currently composed of 19 000 lines of C.

http:// www. cci-forum. org

Hardware Locality Brice Goglin Samuel Thibault

Hardware Locality( hwloc) is a library and set of tools aiming at discovering and exposing the topology of machines, including processors, cores, threads, shared caches, NUMA memory nodes and I/O devices.

It builds a widely-portable abstraction of these resources and exposes it to the application so as to help them adapt their behavior to the hardware characteristics.

hwloctargets many types of high-performance computing applications , from thread scheduling to placement of MPI processes. Most existing MPI implementations, several resource managers and task schedulers already use hwloc.

hwlocis developed in collaboration with the Open MPIproject. The core development is still mostly performed by Brice Goglinand Samuel Thibaultfrom the Runtimeteam-project, but many outside contributors are joining the effort, especially from the Open MPIand MPICH2 communities.

hwlocis composed of 33 000 lines of C.

http:// runtime. bordeaux. inria. fr/ hwloc/

KNem Brice Goglin Stéphanie Moreaud

KNem( Kernel Nemesis) is a Linux kernel module that offers high-performance data transfer between user-space processes.

KNemoffers a very simple message passing interface that may be used when transferring very large messages within point-to-point or collective MPI operations between processes on the same node.

Thanks to its kernel-based design, KNemis able to transfer messages through a single memory copy, much faster than the usual user-space two-copy model.

KNemalso offers the optional ability to offload memory copies on IntelI/O AT hardware which improves throughput and reduces CPU consumption and cache pollution.

KNemis developed in collaboration with the MPICH2 team at the Argonne National Laboratory and the Open MPIproject. These partners already released KNemsupport as part of their MPI implementations.

KNemis composed of 7000 lines of C. Its main contributor is Brice Goglin.

http:// runtime. bordeaux. inria. fr/ knem/

Marcel Olivier Aumage Yannick Martin Samuel Thibault

Marcelis the two-level thread scheduler (also called N:M scheduler) of the PM $^{2}$ software suite.

The architecture of Marcelwas carefully designed to support a large number of threads and to efficiently exploit hierarchical architectures (e.g. multicore chips, NUMA machines).

Marcelprovides a seedconstruct which can be seen as a precursor of thread. It is only when the time comes to actually run the seed that Marcelattempts to reuse the resources and the context of another, dying thread, significantly saving management costs.

In addition to a set of original extensions, Marcelprovides a POSIX-compliant interface which thus permits to take advantage of it by just recompiling unmodified applications or parallel programming environments (API compatibility), or even by running already-compiled binaries with the Linux NPTL ABI compatibility layer.

For debugging purpose, a trace of the scheduling events can be recorded and used after execution for generating an animated movie showing a replay of the execution.

The Marcelthread scheduling library is made of 80 000 lines of code.

http:// runtime. bordeaux. inria. fr/ marcel/

Marcel has been supported for 2 years (2009-2011) by the INRIA ADT Visimar.

ForestGOMP Olivier Aumage Yannick Martin Pierre-André Wacrenier

ForestGOMPis an OpenMPenvironment based on both the GNU OpenMPrun-time and the Marcelthread library.

It is designed to schedule efficiently nested sets of threads (derived from nested parallel regions) over hierarchical architectures so as to minimize cache misses and NUMA penalties.

The ForestGOMPruntime generates nested Marcelbubbles each time an OpenMPparallel region is encountered, thereby grouping threads sharing common data.

Topology-aware scheduling policies implemented by BubbleSchedcan then be used to dynamically map bubbles onto the various levels of the underlying hierarchical architecture.

ForestGOMPallowed us to validate the BubbleSchedapproach with highly irregular, fine grain, divide-and-conquer parallel applications.

http:// runtime. bordeaux. inria. fr/ forestgomp/

Open-MX Brice Goglin Ludovic Stordeur

The Open-MXsoftware stack is a high-performance message passing implementation for any generic Ethernetinterface.

It was developed within our collaboration with Myricom, Inc. as a part of the move towards the convergence between high-speed interconnects and generic networks.

Open-MXexposes the raw Ethernetperformance at the application level through a pure message passing protocol.

While the goal is similar to the old GAMMA stack or the recent iWarp implementations, Open-MXrelies on generic hardware and drivers and has been designed for message passing.

Open-MXis also wire-compatible with Myricom MX protocol and interface so that any application built for MX may run on any machine without Myricom hardware and talk other nodes running with or without the native MX stack.

Open-MXis also an interesting framework for studying next-generation hardware features that could help Ethernethardware become legacy in the context of high-performance computing. Some innovative message-passing-aware stateless abilities, such as multiqueue binding and interrupt coalescing, were designed and evaluated thanks to Open-MX , .

Brice Goglinis the main contributor to Open-MX. The software is already composed of more than 45 000 lines of code in the Linux kernel and in user-space.

http:// open-mx. org/

StarPU Cédric Augonnet Nicolas Collin Nathalie Furmento Cyril Roelandt Samuel Thibault Ludovic Courtès

StarPUpermits high performance libraries or compiler environments to exploit heterogeneous multicore machines possibly equipped with GPGPUs or Cell processors.

StarPUoffers a unified offloadable task abstraction named codelet.In case a codelet may run on heterogeneous architectures, it is possible to specify one function for each architectures (e.g. one function for CUDA and one function for CPUs).

StarPUtakes care to schedule and execute those codelets as efficiently as possible over the entire machine. A high-level data management library enforces memory coherency over the machine: before a codelet starts (e.g. on an accelerator), all its data are transparently made available on the compute resource.

StarPUobtains portable performances by efficiently (and easily) using all computing resources at the same time.

StarPUalso takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models.

StarPUcan also leverage existing parallel implementations, by supporting parallel tasks, which can be run concurrently over the machine.

StarPUprovides a reductionmode, which permit to further optimize data management when results have to be reduced.

StarPUprovides integration in MPI clusters through a lightweight DSM over MPI.

StarPUcomes with a plug-in for the GNU Compiler Collection (GCC), which extends languages of the C family with syntactic devices to describe StarPU's main programming concepts in a concise, high-level way.

http:// runtime. bordeaux. inria. fr/ StarPU/

NewMadeleine Alexandre Denis François Trahay Raymond Namyst

NewMadeleineis communication library for high performance networks, based on a modular architecture using software components.

The NewMadeleineoptimizing scheduler aims at enabling the use of a much wider range of communication flow optimization techniques such as packet reordering or cross-flow packet aggregation.

NewMadeleinetargets applications with irregular, multiflow communication schemes such as found in the increasingly common application conglomerates made of multiple programming environments and coupled pieces of code, for instance.

It is designed to be programmable through the concepts of optimization strategies, allowing experimentations with multiple approaches or on multiple issues with regard to processing communication flows, based on basic communication flows operations such as packet merging or reordering.

The reference software development branch of the NewMadeleinesoftware consists in 90 000 lines of code. NewMadeleineis available on various networking technologies: Myrinet, Infiniband, Quadrics and Ethernet. It is developed and maintained by Alexandre Denis.

http:// runtime. bordeaux. inria. fr/ newmadeleine/

PadicoTM Alexandre Denis

PadicoTM is a high-performance communication framework for grids. It is designed to enable various middleware systems (such as CORBA, MPI, SOAP, JVM, DSM, etc.) to utilize the networking technologies found on grids.

PadicoTM aims at decoupling middleware systems from the various networking resources to reach transparent portability and flexibility.

PadicoTM architecture is based on software components. Puk (the PadicoTM micro-kernel) implements a light-weight high-performance component model that is used to build communication stacks.

PadicoTM component model is now used in NewMadeleine. It is the cornerstone for networking integration in the projects “LEGO” and “COOP” from the ANR.

PadicoTM is composed of roughly 60 000 lines of C.

PadicoTM is registered at the APP under number IDDN.FR.001.260013.000.S.P.2002.000.10000.

http:// runtime. bordeaux. inria. fr/ PadicoTM/

MAQAO Denis Barthou Andres Charif-Rubial

MAQAOis a performance tuning tool for OpenMP parallel applications. It relies on the static analysis of binary codes and the collection of dynamic information (such as memory traces). It provides hints to the user about performance bottlenecks and possible workarounds.

MAQAOrelies on binary codes and inserts probes for instrumention directly inside the binary. There is no need to recompile. The static/dynamic approach of MAQAOanalysis is the main originality of the tool, combining performance model with values collected through instrumentation.

MAQAOhas a static performance model for x86 architecture and Itanium. This model analyzes performance of the predecoder, of the decoder and of the different pipelines of the x86 architecture, in particular for SSE instructions.

The dynamic collection of data in MAQAOenables the analysis of thread interactions, such as false sharing, amount of data reuse, runtime scheduling policy, ...

MAQAOis in the project ”ProHMPT” from the ANR. A demo of MAQAOhas been made in Jan. 2010 for SME/INRIA days and in Nov. 2010 at SuperComputing, INRIA Booth.

http:// www. maqao. org/

QIRAL Denis Barthou

QIRAL is a high level language (expressed through LaTeX) that is used to described Lattice QCD problems. It describes matrix formulations, domain specific properties on preconditionings, and algorithms.

The compiler chain for QIRAL can combine algorithms and preconditionings, checking validity of the composition automatically. It generates OpenMP parallel code, using libraries, such as BLAS.

This code is developped in collaboration with other teams participating to the ANR PetaQCD project.

TreeMatch Emmanuel Jeannot Guillaume Mercier

TreeMatchis a library for performing process placement based on the topology of the machine and the communication pattern of the application.

TreeMatchprovides a permutation of the processes to the processors/cores in order to minimize the communication cost of the application.

Important features are : the number of processors can be higher than the number of processes ; it assumes that the topology is a tree and does not require valuation of the topology (e.g. communication speed) ; it implements different placement algorithms that are switched according to the input size.

TreeMatchis implemented as a load-balancer in Charm++ and as an tool for performing rank reordering in OpenMPI and MPICH-2

New Results High-Performance Intra-node Collective Operations Brice Goglin Stéphanie Moreaud

KNemis known to improve the performance of point-to-point intra-node MPI communication significantly , .

We designed an extended RMA interface in KNemthat suits the needs of point-to-point, collective and RMA operations.

We showed that the native use of KNemin MPI collective implementations enabled further optimization by combining the knowledge of collective algorithms with the mastering of KNemregion management and copies .

This work was initiated in the context of our collaboration with the MPICH2 team and is now also pursued within the Open MPIproject in collaboration with the University of Tennessee in Knoxville.

I/O-Affinity-aware MPI Communications Brice Goglin Stéphanie Moreaud

We demonstrated in the past that the locality of I/O devices within modern computing nodes has the significant impact of the MPI communication performance ( Non-Uniform I/O Access, NUIOA).

A first way to deal with such affinities would be to privilege I/O-intensive processes by placing them near the network interfaces. However, determining the communication-intensiveness may be tricky. Also, some applications have uniform communication patterns. The other way to deal with I/O affinities is to modify the implementation of communication operations given a predetermined task placement.

We demonstrated that the implementation of collective operations should take I/O affinities into account. Deciding which steps and leaders should be involved in the algorithms based on NUIOA effects led us to improve broadcast performance significantly , .

High-Performance Point-to-Point Communications Alexandre Denis Raymond Namyst

NewMadeleineis our communication library designed for high performance networks in clusters. We have worked on optimizations on low-level protocols so as to improve point-to-point performance.

We have proposed auto-tuning mechanisms for most parameters of a communication library: rendez-vous threshold, multi-rail ratio, optimization strategies.

We have proposed a communication protocol for InfiniBand that completely amortizes the cost of memory registration, through the use of a superpipeline that overlaps communication and memory copies. We have modeled the behavior of the network and proposed auto-tuning mechanism to adapt the protocol to the hardware properties.

Improve code-coupling performance in the SALOME platform Alexandre Denis Sébastien Barascou

SALOME platform is an open source software devlopped by EDF, CEA, and OpenCascade. It is an open simulation platform with pre-processing, post-processing, interoperability with CAD models, integration with computation kernels.

YACS is the workflow engine used for code coupling applications in SALOME. It leverages CORBA for communications between kernels. We have ported YACS atop PadicoTM, our communication platform for grids. It enables CORBA connections to use InfiniBand networks. Benchmarks show a significant improvement in code coupling performance.

Hardware topology-aware MPI applications Emmanuel Jeannot Guillaume Mercier François Tessier

We have expanded our previous work dealing with MPI process placement. Indeed, our approach relied on tools and techniques which were outside the scope of the MPI standard itself. In order to allow the users to utilize our work in a portable way, we enhanced some routines of the MPI standard. We worked mainly with the MPICH2 implementation but we are also working on an Open MPIversion as well.

Instead of modifying the binding of the MPI processes onto the physical cores on the underlying architecture, we chose to create a new communicator for which the logical topology organization is optimized for the hardware. This work has been published in and show interesting performance improvements for some class of MPI applications.

The problem of process placement, which can be reduced to a NP-hard graph partitionning problem, can be dealt with several famous applications like Scotch or ParMETIS. To evaluate these solution with TreeMatch, we ran several benchmarks using NAS Parellel Benchmarks and a real CFD application. On the one hand we study the quality of processes permutation (which will impact the execution time) and on the other hand the computation time of the permutation. These results will allow us to conclude about the pertinence of what graph partitioner can be used to bind processes on process units or to do a dynamic processes reordering

Mastering Heterogeneous Platforms Cédric Augonnet Olivier Aumage Ludovic Courtès Nathalie Furmento Andra Hugo Raymond Namyst Samuel Thibault Pierre-André Wacrenier

We continued our work on extending StarPUto master exploitation of Heterogeneous Platforms.

We have extended the StarPUscheduler into managing parallel taskswhich permit a better exploitation of CPUs and load balancing with GPUs.

We have designed over StarPUa lightweight DSM over MPI, which permits to seamlessly execute StarPUapplications over an MPI cluster of GPU-enhanced nodes.

We have been developing a GCC plug-in which extends the C language with pragmas and attributes that make writing StarPUapplications a lot easier.

We have brought to StarPUsupport for automatically converting data between CPU and GPU formats (typically arrays of structures vs structures of arrays). We are now optimizing it.

We have added an OpenCL interface to StarPU, SOCL , which permits to execute unmodified OpenCL applications over StarPU.

We have introduced in StarPUtheoretical bound support : from a record of the set of tasks submitted by the application, StarPUuses linear programming to give the execution time of an ideal scheduling, which can then be compared with the actual results.

We have continued collaboration with the University of Tennessee, Knoxville for StarPUsupport in the state-of-the art dense linera algebra library, Magma, in particular LU and QR factorizations. We have also collaborated with the University of Mons and Linköping .

Cédric Augonnet defended his PhD on StarPU .

Development of a flexible heterogeneous system-on-chip platform using a mix of programmable processing elements and hardware accelerators Paul-Antoine Arras Emmanuel Jeannot Samuel Thibault

Today's embedded applications are increasingly demanding in terms of computational power, especially in real-time digital signal processing (DSP) where tight timing requirements are to be fulfilled. More specifically, when it comes to video decoding (e.g. H.264/AVC) not only has it been almost impossible for some time to run such codecs on a stand-alone embedded processor, but it now also becomes quite impractical to execute them on homogeneous multicore platforms. In this context, STMicroelectronics is developing a scalable heterogeneous system-on-chip template called P2012 and aimed at meeting the latest codecs' requirements.`

This year, the privileged axis of research was directed towards dataflow-based models, which benefit from such strong, well-known properties as analyzability, schedulability and expressivity. Furthermore, dataflow programming has already been used extensively in DSP, yielding a number of dedicated software synthesis tools. We have proposed a first version of the programming model that will be evaluated later.

Sparse GMRES on heterogeneous platforms in oil extraction simulation Olivier Aumage Corentin Rossignon Samuel Thibault

We started a study on sparse matrix factorization and system resolution on heterogeneous platforms in collaboration with Pascal Hénon from company Total, in the context of oil extraction simulation. Sparse matrix computations are notoriously difficult to efficiently run on heterogeneous platforms in the general case due to the irregular memory access patterns they generate.

However, in the specific context of this study, Corentin Rossignon showed as part of his Master Thesis that the sparsity layout of matrices generated by such oil extraction simulation problems can lead to a much higher level of efficiency on hetereogeneous platforms thanks when using a suitable sparse internal representation together with carefully written operators such as the sparse matrix-vector product together with the StarPU heterogeneous scheduler.

Corentin Rossignon is now starting a Phd. Thesis in partnership with Total to build on these promising results.

Programming models for heterogeneous platforms Olivier Aumage Cyril Roelandt Samuel Thibault Ludovic Courtès

As part of Project FP3C with Japan, we started a study on to explore the use of StarPU as possible target runtime system for the XcalableMP language and compiler developed by Prof. Sato's team from University of Tsukuba. XcalableMP is a pragma-based language designed for parallelising application on clusters of multicore processors. The compiler is responsible to expand XcalableMP pragma into complex work mapping, communication and data redistribution commands.

The study of porting XcalableMP on top of StarPU was conducted by Cyril Roelandt during his Master Thesis , starting from the idea that computing node with one or more attached accelerating expansion cards can be seen as a distributed platform. The results of the study showed that on the one side, the power of the XcalableMP language itself is very interesting for the goal of simplifying the port of applications on hetereogeneous platforms. However, a current assumption of the XcalableMP model is that the compiler does not insert implicit commands and behaviour except at the exact location of pragma annotations, which limit the range of optimizations available to the dynamic scheduler and memory manager of StarPU. We will thus continue to collaborate with Prof. Sato's team within the FP3C to see how these limitations could be reduced or lifted when using XcalableMP with StarPU.

In an effort to make it easier for C programmers to benefit from StarPU, the team-project has been working on extensions to the C language allowing important StarPU concepts to be expressed concisely. These C extensions are provided as a plug-in for the GNU Compiler Collection (GCC See http:// gcc. gnu. org/ , for more information on GCC.), and is now distributed as part of StarPU.

The GCC plug-in extends the syntax and semantics of C and related languages (C++, Objective-C) using attributesand pragmas. Attributes are used, for instance, to declare StarPU tasksand their implementationsfor the available targets (CPU, OpenCL, CUDA, etc.) Pragmas are used notably to provide programmers a way to describe data buffers that are passed to tasks, which in turn allows the StarPU run-time support to manage data transfers between main memory and GPUs as it sees fit. Finally, tasks are invoked like regular C functions.

In addition to easing application development, the GCC plug-in, thanks to its higher-level view of the program structure, is able to report certain classes of errors at compile-time, which would otherwise lead to run-time errors.

This project has been led by Ludovic Courtès of Inria's Development and Experimentation Department (SED) at Bordeaux, as part of a joint development action with the SED.

Parallel Concha Olivier Aumage Marie-Christine Counilh

Within the ADT Ampli project, we contributed to the Concha CFD library developed by R. Becker's Inria Team Concha in Pau. Together with R. Becker, E. Bergounioux and D. Trujillo from Concha Team, and François Rue from SED Bordeaux we designed and experimented with the MPI parallelization and the hybrid MPI+OpenMP parallelization of the library.

The MPI parallelization is now finalized. The OpenMP level has been successfully tested on the Vanka smoother and is now being spread in the library. We will thus continue to contribute to this parallelization work, in particular with respect to the support of 3D simulation cases.

Scientific Application Analysis and Experiments Olivier Aumage Denis Barthou Andres Charif-Rubial François Tessier Ludovic Stordeur

Within the context of the ANR ProHMPT project, we contributed a thorough analysis of hot spots, data structure usages and locality issues in memory accesses of an aerodynamics application from partner CEA CESTA.

In accordance with these results, a new version of this application has been written by the CESTA Team with redesigned, locality-friendly data structures and simplified loop scheme. This new version perfoms much better than the previous one on both 2D and 3D cases.

We also conducted tests about the port of selected kernels of this application on accelerated hetereogeneous platforms. The results of these tests were desappointing with the first version of the application due to the layout of the main data structures that led to a lot of memory transfers between the central memory and the accelerated memory.

We are now working on conducting these experiments with the redesigned version of the application whose new data structures should dramatically reduce the amount of data transfers.

Virtualization of GPUs for OpenCL Sylvain Henry Alexandre Denis Denis Barthou

We propose a new approach for OpenCL programming, using a unique virtual accelerator instead of using the physical accelerator. Placement on the real hardware is handled by the runtime instead of the user, improving productivity and performance scalability. This proposition relies on OpenCL standard but changes the way its API is used.

We have shown on some simple examples how this approach, using StarPU as a runtime, enables executions with a better load balance and performance. We are working on how to generalize this to more complex benchmarks. This work has been presented in Renpar workshop.

Automatically Adaptating Task Grain for Hybrid Architectures Sylvain Henry Alexandre Denis Denis Barthou

Given a parallel task graph, a runtime such as StarPU can place each task on different hardware. However, there is still the need to adapt the number of tasks, the granularity of these tasks, according to the target hardware. Due to architectures with CPUs and GPUs, it is potentially interesting to have tasks of different granularities. We explore transformations that enable to either automatically split tasks into small ones, or given some user knowledge on the tasks, decide how and when to split a large task into small ones.

This work starts from a high-level representation of the code, using an explicit data-flow graph.

Performance modeling for power consumption reduction on the SCC Bertrand Putigny Brice Goglin Denis Barthou

We build a model to predict performance of HPC code on the SCC ship. This model can predict runtime of regular code as well as power consumption for different frequency.

This allows users to choose either to optimize power consumption, power efficiency or raw performance.

This work has been published in an Intel Symposium .

Modeling cache coherence protocol overhead Bertrand Putigny Denis Barthou Brice Goglin

We are building a fine grained cache model to understand common cache coherence issue.

This model is built on a set of micro-benchmarks and can also be used to improve find some bottlenecks in memory bound code. Our set of micro-benchmarks can also be used as a test bed for new architectures .

Memory Performance Analysis and Tool for OpenMP codes Andres Charif-Rubial Denis Barthou

We propose a performance analysis of OpenMP codes, based on memory accesses and cache hierarchies.

This analysis relies on memory traces for multi-threaded codes and on static analysis of binary code. Memory traces are obtained through MAQAO by static binary rewriting and are compressed online, building polyhedral iteration domains. The static analysis, mostly induction variable detection on binary code, provides the same analysis whenever possible, removing the need in some cases for dynamic instrumentation.

The analysis focuses on a number of issues in multi-threaded executions: thread affinity issues, false sharing, cache pollution.

This work is in collaboration with Exascale Computing Lab.

Data-layout Optimization for Stencil codes on multi-cores and GPUs Julien Jaeger Denis Barthou

We develop a new approach for stencil code generation, optimizing data-layout for multi-threaded, SIMD code on multicores and CUDA code on GPU. The transformation handles different stencil parameters, and memory constraints.

The code generated reaches high levels of performance, outperforming related works for multicores and with similar performance on GPUs. This work is submitted to publication and was first presented in a workshop .

Contracts and Grants with Industry Contracts with Industry EDF R&D

We participate to a contract between INRIA and EDF R&D which was granted a 6 month funding (apr. – sept. 2011). It aims at optimizing the communications of YACS, the workflow engine of the SALOME simulation platform, using our PadicoTM communication framework.

STMicroelectronics

STMicroelectronics is paying the CIFRE PhD Thesis of Paul-Antoine Arras on The development of a flexible heterogeneous system-on-chip platform using a mix of programmable processing elements and hardware acceleratorsfrom October 2011 to October 2014.

Total

Total funded a study (apr. – sept. 2011) on porting sparse matrix computations and system resolution on heterogeneous platforms for oil extraction simulations.

Partnerships and Cooperations National Initiatives COOP

We participate to a research proposal to the ANR Cosinusprogram called “COOP” which was granted a three-year funding (dec. 2009 – dec. 2012). It aims at establishing generic cooperation mechanisms between resource management, runtime systems, and application programming frameworks to simplify programming models, and improve performance through adaptation to the resources. It involves academic partners and EDF R&D. ( http:// coop. gforge. inria. fr/ )

FP3C

We participate to the joint ANR-JST project FP3C ( Framework and Programming for Post Petascale Computing). The goal of this project is to contribute to establish software technologies, languages and programming models to explore extreme performance computing beyond petascale computing, on the road to exascale computing.

ProHMPT

Cédric Augonnet Olivier Aumage Denis Barthou Andres Charif-Rubial Jérôme Clet-Ortega Nathalie Furmento Raymond Namyst Ludovic Stordeur François Tessier Samuel Thibault Pierre-André Wacrenier

We lead a research proposal to the ANR Cosinusprogram called “ProHMPT” which was granted a three-year funding (jan. 2009 – jun. 2012). It aims at focusing the joint research work of several teams about compilers, runtimes and libraries on programming heterogeneous platforms such as GPU and accelerators. It involves academic partners, companies (Bull, CAPS entreprise) and CEA teams. Olivier Aumageis the head of the ANR ProHMPT project. ( http:// runtime. bordeaux. inria. fr/ prohmpt/ )

Hemera

The runtime team is member of the large wigspan project Hémera started in 2010, that aims at demonstrating ambitious up-scaling techniques for large scale distributed computing by carrying out several dimensioning experiments on the Grid’5000 infrastructure, at animating the scientific community around Grid’5000 and at enlarging the Grid’5000 community by helping newcomers to make use of Grid’5000. It is not restricted to INRIA teams.

MEDIAGPU

We participate to a research proposal to the ANR CONTINTprogram called “MEDIAGPU” which was granted a 30-month funding (jan. 2010 - jun. 2012). It will develop a software architecture and will review and adapt a number of classical multimedia algorithms, considering the latest advances offered by the new hardware architectures, such as combinations of CPUs and GPUs ( http:// picoforge. int-evry. fr/ projects/ mediagpu/ ).

European Initiatives FP7 Project PEPPHER

Title: Performance Portability and Programmability for Heterogeneous Many-core Architectures

Type: COOPERATION (ICT)

Defi: Computing Systems

Instrument: Specific Targeted Research Project (STREP)

Duration: October 2010 - December 2012

Coordinator: Universität Wien (Austria)

Others partners: Chalmers Tekniska Högskola AB (Sweden), Codeplay Software Limited (United Kingdom), Intel GmbH (Germany), Linköpings Universitet (Sweden), Movidia Ltd. (Ireland), Universität Karlsruhe (Germany)

See also: http:// www. peppher. eu/

Abstract: PEPPHER will provide a unified framework for programming architecturally diverse, heterogeneous many-core processors to ensure performance portability. PEPPHER will advance state-of-the-art in its five technical work areas:

Methods and tools for component based software

Portable compilation techniques

Data structures and adaptive, autotuned algorithms

Efficient, flexible run-time systems

Hardware support for autotuning, synchronization and scheduling

Collaborations in European Programs, except FP7

Program: COST

Project acronym: ComplexHPC

Project title: Open Network for High-Performance Computing on Complex Environments

Duration: may 2009 – may 2013

Coordinator: Emmanuel Jeannot

Other partners: 24 European Countries, 2 non-European counties.

Abstract: The goal of the Action is to establish a European research network focused on high performance heterogeneous computing in order to address the whole range of challenges posed by these new platforms including models, algorithms, programming tools and applications.

International Initiatives INRIA Associate Teams Morse

The goal of Matrices Over Runtime Systems at Exascale (MORSE) project is to design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale multicore systems with GPU accelerators, using all the processing power that future high end systems can make available. To develop software that will perform well on petascale and exascale systems with thousands of nodes and millions of cores, several daunting challenges have to be overcome, both by the numerical linear algebra and the runtime system communities. By designing a research framework for describing linear algebra algorithms at a high level of abstraction,the MORSE team will enable the strong collaboration between research groups in linear algebra and runtime systems needed to develop methods and libraries that fully benefit from the potential of future large-scale machines. Our project will take a pioneering step in the effort to bridge the immense software gap that has opened up in front of the High-Performance Computing (HPC) community.

INRIA International Partners

The Runtime project is the representative of Inria within the MPI Forumwhich designs and maintains the Message Passing Interface Standard( http:// www. mpi-forum. org).

We established a collaboration with the Open MPIproject in the context of development of the hwlocsoftware (see Section ). This collaboration was also informally extended to the development of high-performance intra-node communication with Open MPIover our KNemdriver (see Section ).

Runtime is a member of the CCI project together with the Oak Ridge National Laboratory and several other american academic and industrial partners ( http:// www. cci-forum. org). See Section .

The Runtime project is part of the joint laboratory that was setup between INRIA and University of Illinois Urbana-Champaign (UIUC) about Petascale Computing ( http:// jointlab. ncsa. illinois. edu/ ).

Visits of International Scientists

Jan Perhacfrom Trondheim University visited the runtime team as an ERCIM Fellow from March 7 to March 11. We worked on the Thor runtime system.

Keisuke Fukudafrom Tokyo Tech visited from December 12th to Friday 16th, for the FP3C project, to port an FMM application on top of StarPU.

Tetsuya Odajimafrom University of Tsukuba, Japan visited the Runtime Team from September 2th to September 16th, for the FP3C Project, to integrate the XcalableMP language environment with StarPU.

Satoshi Ohshimafrom Tokyo University visited from April 4th to April 15th, for the FP3C project, to work on FEM methods.

Dissemination Animation of the scientific community

The Runtime project organized the Euro-Par 2011 conference in August in Bordeaux.

Emmanuel Jeannotand Raymond Namystare chairs and organizers of the 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011)

Raymond Namystis vice-chair of the Research and Training Department in Mathematics and Computer Science (UFR Math-Info) of the University of Bordeaux 1. He is also a member of the Scientific Committee of the University of Bordeaux 1

Raymond Namystis the head of the LaBRI-CNRS “SATANAS” ( Runtime systems and algorithms for high performance numerical applications) research team (about. 50 people) that includes the Bacchus, Hiepacsand RuntimeINRIA groups.

Raymond Namystchairs the scientific committee of the ANR “Numerical Models” program for the 2011-2013 period.

Raymond Namystserves as an expert for the following initiatives/institutions:

EESI ( European Exascale Software Initiative, since 2010) ;

CEA/DAM (as a “scientific advisor” for the 2008-2010 period) ;

CEA-EDF-INRIA School technical committee (since 2009) ;

GENCI ( http:// www. genci. fr/ ?lang=en, since 2009) ;

ORAP ( http:// www. irisa. fr/ ORAP/ , as the INRIA representative since 2010) ;

In 2011, Raymond Namystwas co-chair of topic “Architecture and Networks” for the SuperComputing (SC) conference.

Raymond Namysthas been reviewer for the PhD dissertation of Swann Perarnau(University of Grenoble) and Christiane Pousa(University of Grenoble). He served as a member of the jury for the PhD defense of Souad Koliaï(University of Versailles Saint Quentin) and Fabrice Dupros(BRGM, Orleans).

Raymond Namystwas a program committee member of the following international conferences: SC11, EuroMPI 2011, CASS 2011, PMEA 2011, A4MMC 2011.

Brice Goglinis member of the following program committees: SuperComputing 2012, EuroMPI 2011 and 2012, ISPAN 2011. He was also a reviewer for PLDI 2011 and CCGrid 2011 conferences, the CASS 2011 workshop, and the JPDC journal.

Denis Barthouhas served as topic chair for Euro-par 2011 conference, has been member of the external review commitee of IEEE/ACM PLDI 2011, member of the program committee of SMART 2011.

Denis Barthouhas been reviewer of the PhD dissertation of Guillaume Rizk (University of Rennes 1), Samir Ammenouche (University of Versailles Saint Quentin) and has been member of the qualifying exam for the PhD of Alexandre Duchateau (University of Illinois, Urbana Champaign).

Denis Barthouwas involved in the reviewing process of one ANR project proposal for the call “`Infrastructures matérielles et logicielles pour la société numérique”. He serves as expert in the Exascale Research Lab.

Guillaume Mercieris member of the CCGrid 2011 program comittee and was involved in the reviewing process for the Computer and Fluids Journal.

Olivier Aumagewas involved in the reviewing process of two ANR project proposals for the call “Infrastructures matérielles et logicielles pour la société numérique”. He was also involved in the reviewing process of JPDC and Parallel Computing journals.

Emmanuel Jeannotis member of the steering committee and the direction committee of the ADT Aladdin-G5K and serves as head of the Bordeaux site since October 2009.

Emmanuel Jeannothas been reviewer of the PhD dissertation of Alexandru Dobilla (Université de Franche-Comté) and Robbert Higgins (University College Dublin, Ireland).

Emmanuel Jeannotserved as reviewers of following journals: IEEE Trans. on Parallel and Dist. Syst., Parallel computing, Computing, Journal of Parallel and Distributed Computing.

Emmanuel Jeannotis member of the steering committee of the IEEE cluster conference.

Emmanuel Jeannotis associate editor of the International Journal of Parallel, Emergent and Distributed Systems.

Emmanuel Jeannotis member of the program committee of IPDPS 2012, heteropar 2011, PPAM, Cluster 2011 and Renpar 20 conferences.

Samuel Thibaultis member of the program committee of HPCVirt. He was also involved in the reviewing process of the JPDC, SPE, and CCPE journals.

Seminars and invited talks

Raymond Namystgave a keynote speech at the International Conference On Preconditioning Techniques For Scientific And Industrial Applications (Preconditioning 2011) about “Programming heterogeneous, accelerator-based multicore machines:current situation and main challenges”.

Raymond Namystgave a keynote speech at the 9 ^thInternational Conference on Parallel Processing and Applied Mathematics (PPAM 2011) about “Programming heterogeneous, accelerator-based multicore machines: a runtime system's perspective”.

Raymond Namystgave an invited talk at the 4th Workshop on UnConventional High Performance Computing (UCHPC 2011) about “programming heterogenous systems”.

Raymond Namystgave a lecture about “hybrid programming” at the 2011 CEA-EDF-INRIA school (Sophia Antipolis) about “Petaflop numerical simulation over hybrid parallel machines.

Teaching

Members of Runtime project gave thousands of hours of teaching at University of Bordeaux and ENSEIRB-MATMECA engineering schools, covering a wide range of topics from basic use of computers and C programming to advance topics such as operating systems, parallel programming and high-performance runtime systems.

PhD & HdR:

PhD : Stéphanie Moreaud, Mouvement de données et placement des tâches pour les communications haute performance sur machines hiérarchiques, Univ. Bordeaux, 12/10/2011, Brice Goglinand Raymond Namyst

PhD : Cédric Augonnet, Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System's Perspective, Univ. Bordeaux, 09/12/2011, Samuel Thibaultand Raymond Namyst

PhD in progress : Bertrand Putigny, Modèles de performance pour l'ordonnancement sur architectures multicoeurs hétérogènes, 2010/11, Brice Goglinand Denis Barthou

PhD in progress : François Tessier, Placement d'applications hybrides sur machine non-uniformes multicœurs, 2011/10 Emmanuel Jeannotand Guillaume Mercier

PhD in progress : Paul-Antoine Arras, Development of a Flexible Heterogeneous System-On-Chip Platform using a mix of programmable Processing Elements and harware accelerators. 2011/10, Emmanuel Jeannotand Samuel Thibault

PhD in progress : Jérôme Clet-Ortega, Exploitation efficace des architectures parallèles de type grappes de NUMA à l'aide de modèles hybrides de programmation, 2007/10, Raymond Namystand Guillaume Mercier

PhD in progress : Sylvain Henry, Modèles de programmation et systèmes d'exécution pour architectures hétérogènes, 2009/10, Denis Barthouand Alexandre Denis

PhD in progress : Andres Charif-Rubial, Performance analysis and tuning of memory accesses for multi-core codes, 2008/10, Denis Barthouand William Jalby(Université de Versailles Saint Quentin en Yvelines)

PhD in progress : Julien Jaeger, Iterative compilation for irregular applications, 2007/10, Denis Barthou

PhD in progress: Andra Hugo, Composability of parallel codes over heterogeneous platforms, 2013/10, Abdou Guermoucheand Pierre-André Wacrenierand Raymond Namyst

PhD in progress: Cyril Bordage, Parallélisation de la méthode multipôle sur architecture hybride, 2012/10, Raymond Namystand David Goudin(CEA Le Barp)

PhD in progress: Corentin Rossignon, Design of an object-oriented runtime system for oil reserve simulations on heterogeneous architectures, 2013/10, Olivier Aumageand Pascal Hénon(TOTAL) and Raymond Namystand Samuel Thibault

Diffusion of the scientific culture

Brice Goglinis in charge of the diffusion of the scientific culture for the INRIA Research Center of Bordeaux. He is also a member of the National INRIA working group on Scientific Mediation.

Brice Goglinpublished two papers explaining multiprocessor operating systems and affinities in modern computers in Interstices.

Brice Goglinpresented the team's research work to one hundred high-school students at the “Fête de la Science”.

Stéphanie Moreaudand Brice Goglinpresented research careers at the Aquitec student exhibition.

The Hyperion system: Compiling multithreaded Java bytecode for distributed execution Gabriel Antoniu G. Luc Bougé L. Philip Hatcher P. Mark MacBeth M. Keith McGuigan K. Raymond Namyst R. Parallel Computing 27 October 2001 1279–1297 A Portable and Efficient Communication Library for High-Performance Cluster Computing (extended version) Olivier Aumage O. Luc Bougé L. Alexandre Denis A. Lionel Eyraud L. Jean-François Méhaut J.-F. Guillaume Mercier G. Raymond Namyst R. Loïc Prylli L. Cluster Computing 5 1 January 2002 43-54 NewMadeleine: a Fast Communication Scheduling Engine for High Performance Networks Olivier Aumage O. Élisabeth Brunet É. Nathalie Furmento N. Raymond Namyst R. CAC 2007: Workshop on Communication Architecture for Clusters, held in conjunction with IPDPS 2007 Long Beach, California, USA March 2007 http:// hal. inria. fr/ inria-00127356 Also available as LaBRI Report 1421-07 and INRIA RR-6085 MPICH/MadIII: a Cluster of Clusters Enabled MPI Implementation Olivier Aumage O. Guillaume Mercier G. Proc. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2003) Tokyo IEEE May 2003 26–35 MPICH/Madeleine: a True Multi-Protocol MPI for High-Performance Networks Olivier Aumage O. Guillaume Mercier G. Raymond Namyst R. Proc. 15th International Parallel and Distributed Processing Symposium (IPDPS 2001) San Francisco IEEE April 2001 51 Extended proceedings in electronic form only. hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications François Broquedis F. Jérôme Clet-Ortega J. Stéphanie Moreaud S. Nathalie Furmento N. Brice Goglin B. Guillaume Mercier G. Samuel Thibault S. Raymond Namyst R. Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2010) Pisa, Italia IEEE Computer Society Press February 2010 180–186 http:// hal. inria. fr/ inria-00429889 ForestGOMP: an efficient OpenMP environment for NUMA architectures François Broquedis F. Nathalie Furmento N. Brice Goglin B. Pierre-André Wacrenier P.-A. Raymond Namyst R. International Journal on Parallel Programming, Special Issue on OpenMP; Guest Editors: Matthias S. Müller and Eduard Ayguadé 38 5 2010 418-439 http:// hal. inria. fr/ inria-00496295 Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem Darius Buntinas D. Guillaume Mercier G. William Gropp W. Recent Advances in Parallel Virtual Machine and Message Passing Interface: Proc. 13th European PVM/MPI Users Group Meeting Bonn, Germany September 2006 Linux Kernel Activations to Support Multithreading Vincent Danjean V. Raymond Namyst R. Robert Russell R. Proc. 18th IASTED International Conference on Applied Informatics (AI 2000) Innsbruck, Austria IASTED February 2000 718-723 Finding a Tradeoff between Host Interrupt Load and MPI Latency over Ethernet Brice Goglin B. Nathalie Furmento N. Proceedings of the IEEE International Conference on Cluster Computing New Orleans, LA IEEE Computer Society Press September 2009 http:// hal. inria. fr/ inria-00397328 Impact of NUMA Effects on High-Speed Networking with Multi-Opteron Machines Stéphanie Moreaud S. Brice Goglin B. The 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2007) Cambridge, Massachussetts November 2007 http:// hal. inria. fr/ inria-00175747 Contribution à la conception de supports exécutifs multithreads performants Raymond Namyst R. Université Claude Bernard de Lyon, pour des travaux effectués à l'école normale supérieure de Lyon December 2001 Habilitation à diriger des recherches An Efficient OpenMP Runtime System for Hierarchical Architectures Samuel Thibault S. François Broquedis F. Brice Goglin B. Raymond Namyst R. Pierre-André Wacrenier P.-A. International Workshop on OpenMP (IWOMP) Beijing,China 6 2007 148–159 http:// hal. inria. fr/ inria-00154502 Building Portable Thread Schedulers for Hierarchical Multiprocessors: the BubbleSched Framework Samuel Thibault S. Raymond Namyst R. Pierre-André Wacrenier P.-A. EuroPar Rennes,France ACM 8 2007 http:// hal. inria. fr/ inria-00154506 A multithreaded communication engine for multicore architectures François Trahay F. Élisabeth Brunet É. Alexandre Denis A. Raymond Namyst R. CAC 2008: Workshop on Communication Architecture for Clusters, held in conjunction with IPDPS 2008 Miami, FL IEEE Computer Society Press April 2008 http:// hal. inria. fr/ inria-00224999 Improving Reactivity and Communication Overlap in MPI using a Generic I/O Manager François Trahay F. Alexandre Denis A. Olivier Aumage O. Raymond Namyst R. Franck Cappello F. Thomas Herault T. Jack Dongarra J. EuroPVM/MPI, Recent Advances in Parallel Virtual Machine and Message Passing Interface Lecture Notes in Computer Science 4757 Springer 2007 170-177 http:// hal. inria. fr/ inria-00177167 Euro-Par 2011 Parallel Processing - 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part I Lecture Notes in Computer Science Emmanuel Jeannot E. Raymond Namyst R. Jean Roman J. 6852 Springer 2011 Euro-Par 2011 Parallel Processing - 17th International Conference, Euro-Par 2011, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part II Lecture Notes in Computer Science Emmanuel Jeannot E. Raymond Namyst R. Jean Roman J. 6853 Springer 2011 Computing Networks: From Cluster to Cloud Computing Pascale Vicat-Blanc Primet P. Brice Goglin B. Romaric Guillier R. Sebastien Soudan S. Wiley-ISTE May 2011 http:// hal. inria. fr/ inria-00590739/ en Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System's Perspective Cédric Augonnet C. Université Sciences et Technologies - Bordeaux I December 2011 Ph. D. Thesis Mouvement de données et placement des tâches pour les communications haute performance sur machines hiérarchiques Stéphanie Moreaud S. Université Sciences et Technologies - Bordeaux I October 2011 http:// hal. inria. fr/ tel-00635651/ en Ph. D. Thesis StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures Cédric Augonnet C. Samuel Thibault S. Raymond Namyst R. Pierre-André Wacrenier P.-A. 1532-0626 Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009 23 February 2011 187–198 http:// hal. inria. fr/ inria-00550877 PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems Siegfried Benkner S. Sabri Pllana S. Jesper Larsson Träf J. L. Philippas Tsigas P. Uwe Dolinsky U. Cédric Augonnet C. Beverly Bachmayer B. Christoph Kessler C. David Moloney D. Vitaly Osipov V. 0272-1732 IEEE Micro 31 5 2011 28-41 http:// hal. inria. fr/ hal-00648480/ en AT SE GB DE IE Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms Anne Benoit A. Louis-Claude Canon L.-C. Emmanuel Jeannot E. Yves Robert Y. 1094-6136 Journal of Scheduling May 2011 http:// hal. inria. fr/ hal-00653477/ en Pablo de Oliveira Castro P. Stéphane Louise S. Denis Barthou D. Programming Multi-core and Many-core Computing Systems Wiley-Blackwell 2012 To Appear De votre boulangerie à un système d'exploitation multiprocesseur Brice Goglin B. I-NtFd17 Interstices February 2011 http:// hal. inria. fr/ inria-00566232/ en Et plus vite si affinités... Brice Goglin B. I-NtFd17 Interstices June 2011 http:// hal. inria. fr/ inria-00604025/ en High-Performance Message Passing over generic Ethernet Hardware with Open-MX Brice Goglin B. 0167-8191 Journal of Parallel Computing 37 2 February 2011 85-100 http:// hal. inria. fr/ inria-00533058/ en NIC-assisted cache-efficient receive stack for message passing over Ethernet Brice Goglin B. 1532-0626 Concurrency and Computation: Practice and Experience 23 2 2011 199-210 http:// hal. inria. fr/ inria-00496301/ en Hardware Locality: Peering under the hood of your server Brice Goglin B. Jeffrey Squyres J. Samuel Thibault S. 1291-7834 Linux Pro Magazine 128 July 2011 28-33 http:// hal. inria. fr/ inria-00597961/ en US Optimizing Performance and Reliability on Heterogeneous Parallel Systems: Approximation Algorithms and Heuristics Emmanuel Jeannot E. Erik Saule E. Denis Trystram D. 0743-7315 Journal of Parallel and Distributed Computing 72 2 2012 268 – 280 Virtualization of Hybrid Architectures Raymond Namyst R. 0029-5671 Super-computers: at the frontiers of extreme computing November 2011 LU Factorization for Accelerator-based Systems Emmanuel Agullo E. Cédric Augonnet C. Jack Dongarra J. Mathieu Faverge M. Julien Langou J. Hatem Ltaief H. Stanimire Tomov S. 9th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 11) Sharm El-Sheikh, Egypt June 2011 http:// hal. inria. fr/ hal-00654193/ en ACS/IEEE International Conference on Computer Systems and Applications 9 AICCSA US QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators Emmanuel Agullo E. Cédric Augonnet C. Jack Dongarra J. Mathieu Faverge M. Hatem Ltaief H. Samuel Thibault S. Stanimire Tomov S. 25th IEEE International Parallel & Distributed Processing Symposium Anchorage, United States May 2011 http:// hal. inria. fr/ inria-00547614/ en IEEE International Parallel and Distributed Processing Symposium 25 IPDPS US The PEPPHER Approach to Programmability and Performance Portability for Heterogeneous many-core Architectures Siegfried Benkner S. Sabri Pllana S. Jesper Larsson Träff J. Philippas Tsigas P. Andrew Richards A. Raymond Namyst R. Beverly Bachmayer B. Christoph Kessler C. David Moloney D. Peter Sanders P. ParCo Ghent, Belgique 2011 http:// hal. inria. fr/ hal-00661320 International Conference on Parallel Computing 2011 ParCO A sampling-based approach for communication libraries auto-tuning Élisabeth Brunet É. François Trahay F. Alexandre Denis A. Raymond Namyst R. IEEE International Conference on Cluster Computing Austin, United States September 2011 http:// hal. inria. fr/ inria-00605735/ en IEEE International Conference on Cluster Computing 2011 Cluster MO-Greedy: an extended beam-search approach for solving a multi-criteria scheduling problem on heterogeneous machines Louis-Claude Canon L.-C. Emmanuel Jeannot E. International Heterogeneity in Computing Workshop Anchorage, United States September 2011 http:// hal. inria. fr/ hal-00653724/ en International Heterogeneous Computing Workshop 16 HCW A Scheduling and Certification Algorithm for Defeating Collusion in Desktop Grids Louis-Claude Canon L.-C. Emmanuel Jeannot E. Jon Weissman J. International Conference on Distributed Computing Systems Minneapolis, United States July 2011 http:// hal. inria. fr/ hal-00653493/ en International Conference on Distributed Computing Systems 31 ICDCS US Flexible runtime support for efficient skeleton programming on hybrid systems Usman Dastgeer U. Christoph Kessler C. Samuel Thibault S. International conference on Parallel Computing (ParCo) Gent, Belgium August 2011 http:// hal. inria. fr/ inria-00606200/ en International Conference on Parallel Computing 2011 ParCO SE A High-Performance Superpipeline Protocol for InfiniBand Alexandre Denis A. Emmanuel Jeannot E. Raymond Namyst R. Jean Roman J. Euro-Par 2011 Bordeaux, France Lecture Notes in Computer Science 6853 Springer August 2011 276-287 http:// hal. inria. fr/ inria-00586015/ en International Euro-Par Conference on Parallel Processing 17 Euro-Par Dodging Non-Uniform I/O Access in Hierarchical Collective Operations for Multicore Clusters Brice Goglin B. Stéphanie Moreaud S. CASS 2011: The 1st Workshop on Communication Architecture for Scalable Systems, held in conjunction with IPDPS 2011 Anchorage, United States May 2011 7p http:// hal. inria. fr/ inria-00566246/ en Workshop on Communication Architecture for Scalable Systems 1 Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs Teng Ma T. George Bosilca G. Aurélien Bouteiller A. Brice Goglin B. Jeffrey Squyres J. Jack Dongarra J. 40th International Conference on Parallel Processing (ICPP-2011) Taipei, Taiwan, Province Of China September 2011 http:// hal. inria. fr/ inria-00602877/ en International Conference on Parallel Processing 40 ICPP US Détection optimale des coins et contours dans des bases d'images volumineuses sur architectures multicœurs hétérogènes Sidi Mahmoudi S. Pierre Manneback P. Cédric Augonnet C. Samuel Thibault S. Rencontres francophones du parallélisme Saint-Malo, France May 2011 http:// hal. inria. fr/ inria-00606195/ en Rencontres francophones du Parallélisme 20 RENPAR BE Analysing the Variability of OpenMP Programs Performances on Multicore Architectures Abdelhafid Mazouz A. Sid-Ahmed-Ali Touati S.-A.-A. Denis Barthou D. Fourth Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG-2011) Heraklion, Greece Held in conjunction with: the 6th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) 2011 14 http:// hal. inria. fr/ inria-00637957/ en Workshop on Programmability Issues for Heterogeneous Multicores 4 MULTIPROG Improving MPI Applications Performance on Multicore Clusters with Rank Reordering Guillaume Mercier G. Emmanuel Jeannot E. EuroMPI Santorini, Italy 6960 Springer Verlag September 2011 39-49 http:// hal. inria. fr/ hal-00643151/ en European MPI Users' Group Meeting 18 EuroMPI Performance modeling for power consumption reduction on SCC Bertrand Putigny B. Brice Goglin B. Denis Barthou D. Hasso Plattner H. 4th Many-core Applications Research Community (MARC) Symposium Potsdam, Germany December 2011 http:// hal. inria. fr/ hal-00649635/ en Many-core Applications Research Community Symposium 4 MARC Programmation multi-accélérateurs unifiée en OpenCL Henry Sylvain H. RenPAR'20 Saint Malo, France May 2011 http:// hal. inria. fr/ hal-00643257/ en Rencontres francophones du Parallélisme 20 RENPAR EZTrace: a generic framework for performance analysis François Trahay F. François Rue F. Mathieu Faverge M. Yutaka Ishikawa Y. Raymond Namyst R. Jack Dongarra J. IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) Newport Beach, CA, United States May 2011 http:// hal. inria. fr/ inria-00587216/ en IEEE International Symposium on Cluster Computing and the Grid 11 CCGRID Poster Session US JP Towards Real-Time, Volunteer Distributed Computing Sangho Yi S. Emmanuel Jeannot E. Derrick Kondo D. David P. Anderson D. P. 11th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2011) Newport Beach, CA, United States 2011 http:// hal. inria. fr/ hal-00654691/ en IEEE International Symposium on Cluster Computing and the Grid 11 CCGRID US Optimisation des communications pour les calculs parallèles avec SALOME/YACS et PadicoTM Sébastien Barascou S. Université Sciences et Technologies - Bordeaux I September 2011 http:// hal. inria. fr/ hal-00652882/ en Master 2 probation report Composabilité de codes parallèles sur architectures hétérogènes Andra-Ecaterina Hugo A.-E. Université Sciences et Technologies - Bordeaux I 2011 http:// hal. inria. fr/ inria-00619654/ en Masters thesis Stencils sur CPU et GPU Julien Jaeger J. Denis Barthou D. December 2011 Quatrième rencontres de la communauté française de compilation, Saint-Hippolyte, France Programming heterogeneous, accelerator-based multicore machines:current situation and main challenges Raymond Namyst R. May 2011 http:// hal. inria. fr/ inria-00590670/ en Invited Talk Modélisation du coût de la cohérence de cache pour améliorer le tuilage de boucles Bertrand Putigny B. Denis Barthou D. Brice Goglin B. December 2011 Quatrième rencontres de la communauté française de compilation, Saint-Hippolyte, France Association de modèles de programmation pour l'exploitation de clusters de GPUs dans le calcul intensif Cyril Roelandt C. Université Sciences et Technologies - Bordeaux I June 2011 Rapport de stage de Master 2 Étude du GMRES dans un code de simulation de réservoir Corentin Rossignon C. Université Sciences et Technologies - Bordeaux I June 2011 Rapport de stage de Master 2 Supporting iWARP Compatibility and Features for Regular Network Adapters P. Balaji P. H.-W. Jin H.-W. K. Vaidyanathan K. D. K. Panda D. K. Proceedings of the Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and Technologies (RAIT); held in conjunction with the IEEE International Confer ence on Cluster Computing Boston, MA September 2005 GAMMA and MPI/GAMMA on GigabitEthernet G. Ciaccio G. G. Chiola G. Proceedings of 7th EuroPVM-MPI conference Balatonfured, Hongrie Lecture Notes in Computer Science 1908 Springer Verlag Septembre 2000 Hierarchical multithreading: programming model and system software Guang R. Gao G. R. Thomas Sterling T. Rick Stevens R. Mark Hereld M. Weirong Zhu W. 20th International Parallel and Distributed Processing Symposium (IPDPS) April 2006 KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework Brice Goglin B. Stéphanie Moreaud S. Journal of Parallel and Distributed Computing 2012 Submitted Study of Variations of Native Program Execution Times on Multi-Core Architectures Abdelhafid Mazouz A. Sid-Ahmed-Ali Touati S.-A.-A. Denis Barthou D. Intl. IEEE Workshop on Multi-Core Computing Systems Krakow, Poland IEEE Computer Society February 2010 919—924