parallel,runtime,environment,heterogeneity,SMP,multicore,NUMA,HPC,high-speed networks,protocols,MPI,schedulin g,thread,OpenMP,compiler optimizations
The Runtimeresearch project takes place within the context of high-performance computing. It seeks to explore the design, the implementation and the evaluation of novel mechanisms needed by runtime systemsfor parallel computers. Runtime systemsare intermediate software layers providing parallel programming environments with specific functionalities left unaddressed by the underlying operating system. Runtime systems can thus be seen as functional extensions of operating systems, but the boundary between them is rather fuzzy since runtime systems may actually contain specific extensions/enhancements to the underlying operating system (e.g. extensions to the OS thread scheduler). The increasing complexity of modern parallel hardware, making it more and more necessary to postpone essential decisions and actions (scheduling, optimizations) at run time, emphasizes the role of runtime systems.
One of the main challenges encountered when designing modern runtime systems is to provide powerful abstractions, both at the programming interface level and at the implementation level, to deal with the increasing complexity of upcoming hardware architectures. While it is essential to understand – and somehow anticipate – the evolutions of hardware technologies (e.g. programmable network interface cards, multicore architectures, hardware accelerators), the most delicate task is to extract models and abstractions that will fit most of upcoming hardware features.
The originality of the runtime group lies in the fact that we address all these issues following a global approach, so as to propose complementary solutions to problems which may not seem to be linked at first sight. We actually realized, for instance, that we could greatly improve our communication optimization techniques by increasing the functionalities of the underlying core thread scheduler. This illustrates why most of our research efforts have consisted in cross-studying different topics, and have led to co-designing many software.
Our research project centers on three main directions:
Thread scheduling over multicore machines
Data management over NUMA architectures
Task scheduling over GPU heterogeneous machines
Exploring parallelism orchestration at compiler and runtime level
Improved interactions between optimizing compiler and runtime
Scheduling data packets over high speed networks
New MPI implementations for Petascale computers
Optimized intra-node communication
understand network topology and application communication pattern to optimize process placement
Parallel, event-driven communication libraries
Communication and I/O within large multicore nodes
Beside those main research topics, we obviously intend to work in collaboration with other research teams in order to validateour achievements by integrating our results into larger software environments (MPI, OpenMP) and to joinour efforts to solve complex problems.
Among the target environments, we intend to carry on developing the successor to the PM
Finally, as most of our work proposed is intended to be used as a foundation for environments and programming tools exploiting large scale, high performance computing platforms, we definitely need to address the numerous scalability issues related to the huge number of cores and the deep hierarchy of memory, I/O and communication links.
The hwloc software is used for node topology discovery and process binding by the most popular MPI implementations, including MPICH2 and Open MPIand all their derivatives such as Intel MPI.
The StarPU software
is used for dynamic scheduling by the state-of-the art dense linear algebra
library, Magma v1.1
http://
Euro-Par is a major conference in parallel and distributed computing. It has been organized in Bordeaux from August 29 to September 2, 2011. It has featured 16 topics, 25
sessions and 12 workshops. 271 papers have been submitted and 81 papers have been accepted (29.9 %). Moreover 3 invited lectures have been given. 330 persons registered at either the
conference or the workshops. The website is
http://
This research project takes place within the context of high-performance computing. It seeks to contribute to the design and implementation of parallel runtime systems that shall serve as a basis for the implementation of high-level parallel middleware. Today, the implementation of such software (programming environments, numerical libraries, parallel language compilers, parallel virtual machines, etc.) has become so complex that the use of portable, low-level runtime systems is unavoidable.
Our research project centers on three main directions:
With the beginning of the new century, computer makers have initiated a long term move of integrating more and more processing units, as an answer to the frequency wall hit by the technology. This integration cannot be made in a basic, planar scheme beyond a couple of processing units for scalability reasons. Instead, vendors have to resort to organize those processing units following some hierarchical structure scheme. A level in the hierarchy is then materialized by small groups of units sharing some common local cache or memory bank. Memory accesses outside the locality of the group are still possible thanks to bus-level consistency mechanisms but are significantly more expensive than local accesses, which, by definition, characterizes NUMA architectures.
Thus, the task scheduler must feed an increasing number of processing units with work to execute and data to process while keeping the rate of penalized memory accesses as low as possible. False sharing, ping-pong effects, data vs task locality mismatches, and even task vs task locality mismatches between tightly synchronizing activities are examples of the numerous sources of overhead that may arise if threads and data are not distributed properly by the scheduler. To avoid these pitfalls, the scheduler therefore needs accurate information both about the computing platform layout it is running on and about the structure and activities relationships of the application it is scheduling.
As quoted by Gao et al. , we believe it is important to expose domain-specific knowledge semantics to the various software components in order to organize computation according to the application and architecture. Indeed, the whole software stack, from the application to the scheduler, should be involved in the parallelizing, scheduling and locality adaptation decisions by providing useful information to the other components. Unfortunately, most operating systems only provide a poor scheduling API that does not allow applications to transmit valuable hintsto the system.
This is why we investigate new approaches in the design of thread schedulers, focusing on high-level abstractions to both model hierarchical architectures and describe the structure of applications' parallelism. In particular, we have introduced the bubblescheduling concept that helps to structure relations between threads in a way that can be efficiently exploited by the underlying thread scheduler. Bubblesexpress the inherent parallel structure of multithreaded applications: they are abstractions for grouping threads which “work together” in a recursive way. We are exploring how to dynamically schedule these irregular nested sets of threads on hierarchical machines , the key challenge being to schedule related threads as closely as possible in order to benefit from cache effects and avoid NUMA penalties. We are also exploring how to improve the transfer of scheduling hints from the programming environment to the runtime system, to achieve better computation efficiency.
This is also the reason why we explore new languages and compiler optimizations to better use domain specific information. In the ANR project PetaQCD, we propose a new domain specific language, QIRAL, to generate parallel codes from high level formulations for Lattice QCD problems. QIRAL describes the formulation of the algorithms, of the matrices and preconditions used in this domain and generalizes languages such as SPIRAL used in auto-tuning library generator for signal processing applications. Lattice QCD applications require huge amount of processing power, on multinode, multi-core with GPUs. Simulation codes require to find new algorithms and efficient parallelization. So far, the difficulties for orchestrating parallelism efficiently hinder algorithmic exploration. The objective of QIRAL is to decouple algorithm exploration with parallelism description. Compiling QIRAL uses rewriting techniques for algorithm exploration, parallelization techniques for parallel code generation and potentially, runtime support to orchestrate this parallelism. Results of this work are submitted to publication.
For parallel programs running on multicores, measuring reliable performance and determining performance stability is becoming a key issue: indeed, a number of hardware mechanisms may cause performance instability from one run to the other. Thread migration, memory contention (on any level of the cache hierarchy), scheduling policy of the runtime can introduce some variation, indenpendently of the program input. A speed-up is interesting only it corresponds to a performance that can be obtained through repeated execution of the application. Very few research efforts have been made in the identification of program optimization/runtime policy/hardware mechanisms that may introduce performance instability. We studied in on a large set of OpenMP benchmarks performance variations, identified the mechanisms causing them and showing the need for better strategies for measuring speed-ups. Following this effort, we developped inside the tool MAQAO(Modular Assembler Quality Analyzer and Optimizer), the precise analysis of the interactions between OpenMP threads, through static analysis of binary codes and memory tracing. In particular, the influence of thread affinity is estimated and the tool proposes hints to the user to improve its OpenMP codes.
Aside from greedily invading all these new cores, demanding HPC applications now throw excited glances at the appealing computing power left unharvested inside the graphical processing units (GPUs). A strong demand is arising from the application programmers to be given means to access this power without bearing an unaffordable burden on the portability side. Efforts have already been made by the community in this respect but the tools provided still are rather close to the hardware, if not to the metal. Hence, we decided to launch some investigations on addressing this issue. In particular, we have designed a programming environment named StarPUthat enables the programmer to offload tasks onto such heterogeneous processing units and gives that programmer tools to fit tasks to processing units capability, tools to efficiently manage data moves to and from the offloading hardware and handles the scheduling of such tasks all in an abstracted, portable manner. The challenge here is to take into account the intricacies of all computation unit: not only the computation power is heterogeneous among the machine, but data transfers themselves have various behavior depending on the machine architecture and GPUs capabilities, and thus have to be taken into account to get the best performance from the underlying machine. As a consequence, StarPUnot only pays attention to fully exploit each of the different computational resources at the same time by properly mapping tasks in a dynamic manner according to their computation power and task behavior by the means of scheduling policies, but it also provides a distributed shared-memory library that makes it possible to manipulate data across heterogeneous multicore architectures in a high-level fashion while being optimized according to the machine possibilities.
Using a large panel of mechanisms such as user-mode communications, zero-copy transactions and communication operation offload, the critical path in sending and receiving a packet over high speed networks has been drastically reduced over the years. Recent implementations of the MPI standard, which have been carefully designed to directly map basicpoint-to-point requests onto the underlying low-level interfaces, almost reach the same level of performance for very basic point-to-point messaging requests. However more complex requests such as non-contiguous messages are left mostly unattended, and even more so are the irregular and multiflow communication schemes. The intent of the work on our NewMadeleinecommunication engine, for instance, is to address this situation thoroughly. The NewMadeleineoptimization layer delivers much better performance on complexcommunication schemes with negligible overhead on basic single packet point-to-point requests. Through Mad-MPI, our proof-of-concept implementation of a subset of the MPI API, we intend to show that MPI applications can also benefit from the NewMadeleinecommunication engine.
The increasing number of cores in cluster nodes also raises the importance of intra-node communication. Our KNemsoftware module aims at offering optimized communication strategies for this special case and let the above MPI implementations benefit from dedicated models depending on process placement and hardware characteristics.
Moreover, the convergence between specialized high-speed networks and traditional Ethernetnetworks leads to the need to adapt former software and hardware innovations to new message-passing stacks. Our work on the Open-MXsoftware is carried out in this context.
Regarding larger scale configurations (clusters of clusters, grids), we intend to propose new models, principles and mechanisms that should allow to combine communication handling, threads scheduling and I/O event monitoring on such architectures, both in a portable and efficient way. We particularly intend to study the introduction of new runtime system functionalities to ease the development of code-coupling distributed applications, while minimizing their unavoidable negative impact on the application performance.
Asynchronism is becoming ubiquitous in modern communication runtimes. Complex optimizations based on online analysis of the communication schemes and on the de-coupling of the request submission vs processing. Flow multiplexing or transparent heterogeneous networking also imply an active role of the runtime system request submit and process. And communication overlap as well as reactiveness are critical. Since network request cost is in the order of magnitude of several thousands CPU cycles at least, independent computations should not get blocked by an ongoing network transaction. This is even more true with the increasingly dense SMP, multicore, SMT architectures where many computing units share a few NICs. Since portability is one of the most important requirements for communication runtime systems, the usual approach to implement asynchronous processing is to use threads (such as Posix threads). Popular communication runtimes indeed are starting to make use of threads internally and also allow applications to also be multithreaded. Low level communication libraries also make use of multithreading. Such an introduction of threads inside communication subsystems is not going without troubles however. The fact that multithreading is still usually optional with these runtimes is symptomatic of the difficulty to get the benefits of multithreading in the context of networking without suffering from the potential drawbacks. We advocate the importance of the cooperation between the asynchronous event management code and the thread scheduling code in order to avoid such disadvantages. We intend to propose a framework for symbiotically combining both approaches inside a new generic I/O event manager.
The RUNTIME group is working on the design of efficient runtime systems for parallel architectures. We are currently focusing our efforts on High Performance Computing applications that merely implement numerical simulations in the field of Seismology, Weather Forecasting, Energy, Mechanics or Molecular Dynamics. These time-consuming applications need so much computing power that they need to run over parallel machines composed of several thousands of processors.
Because the lifetime of HPC applications often spreads over several years and because they are developed by many people, they have strong portability constraints. Thus, these applications are mostly developed on top of standard APIs (e.g. MPI for communications over distributed machines, OpenMP for shared-memory programming). That explains why we have long standing collaborations with research groups developing parallel language compilers, parallel programming environments, numerical libraries or communication software. Actually, all these “clients” are our primary target.
Although we are currently mainly working on HPC applications, many other fields may benefit from the techniques developed by our group. Since a large part of our efforts is devoted to exploiting multicore machines and GPU accelerators, many desktop applications could be parallelized using our runtime systems (e.g. 3D rendering, etc.).
The Common Communication Interfaceaims at offering a generic and portable programming interface for a wide range of networking technologies (Ethernet, InfiniBand, ...) and application needs (MPI, storage, low latency UDP, ...).
CCIis developed in collaboration with the Oak Ridge National Laboratoryand several other academics and industrial partners.
CCIis in early development and currently composed of 19 000 lines of C.
Hardware Locality( hwloc) is a library and set of tools aiming at discovering and exposing the topology of machines, including processors, cores, threads, shared caches, NUMA memory nodes and I/O devices.
It builds a widely-portable abstraction of these resources and exposes it to the application so as to help them adapt their behavior to the hardware characteristics.
hwloctargets many types of high-performance computing applications , from thread scheduling to placement of MPI processes. Most existing MPI implementations, several resource managers and task schedulers already use hwloc.
hwlocis developed in collaboration with the Open MPIproject. The core development is still mostly performed by Brice Goglinand Samuel Thibaultfrom the Runtimeteam-project, but many outside contributors are joining the effort, especially from the Open MPIand MPICH2 communities.
hwlocis composed of 33 000 lines of C.
KNem( Kernel Nemesis) is a Linux kernel module that offers high-performance data transfer between user-space processes.
KNemoffers a very simple message passing interface that may be used when transferring very large messages within point-to-point or collective MPI operations between processes on the same node.
Thanks to its kernel-based design, KNemis able to transfer messages through a single memory copy, much faster than the usual user-space two-copy model.
KNemalso offers the optional ability to offload memory copies on IntelI/O AT hardware which improves throughput and reduces CPU consumption and cache pollution.
KNemis developed in collaboration with the MPICH2 team at the Argonne National Laboratory and the Open MPIproject. These partners already released KNemsupport as part of their MPI implementations.
KNemis composed of 7000 lines of C. Its main contributor is Brice Goglin.
Marcelis the two-level thread scheduler (also called N:M scheduler) of the PM
The architecture of Marcelwas carefully designed to support a large number of threads and to efficiently exploit hierarchical architectures (e.g. multicore chips, NUMA machines).
Marcelprovides a seedconstruct which can be seen as a precursor of thread. It is only when the time comes to actually run the seed that Marcelattempts to reuse the resources and the context of another, dying thread, significantly saving management costs.
In addition to a set of original extensions, Marcelprovides a POSIX-compliant interface which thus permits to take advantage of it by just recompiling unmodified applications or parallel programming environments (API compatibility), or even by running already-compiled binaries with the Linux NPTL ABI compatibility layer.
For debugging purpose, a trace of the scheduling events can be recorded and used after execution for generating an animated movie showing a replay of the execution.
The Marcelthread scheduling library is made of 80 000 lines of code.
Marcel has been supported for 2 years (2009-2011) by the INRIA ADT Visimar.
ForestGOMPis an OpenMPenvironment based on both the GNU OpenMPrun-time and the Marcelthread library.
It is designed to schedule efficiently nested sets of threads (derived from nested parallel regions) over hierarchical architectures so as to minimize cache misses and NUMA penalties.
The ForestGOMPruntime generates nested Marcelbubbles each time an OpenMPparallel region is encountered, thereby grouping threads sharing common data.
Topology-aware scheduling policies implemented by BubbleSchedcan then be used to dynamically map bubbles onto the various levels of the underlying hierarchical architecture.
ForestGOMPallowed us to validate the BubbleSchedapproach with highly irregular, fine grain, divide-and-conquer parallel applications.
The Open-MXsoftware stack is a high-performance message passing implementation for any generic Ethernetinterface.
It was developed within our collaboration with Myricom, Inc. as a part of the move towards the convergence between high-speed interconnects and generic networks.
Open-MXexposes the raw Ethernetperformance at the application level through a pure message passing protocol.
While the goal is similar to the old GAMMA stack or the recent iWarp implementations, Open-MXrelies on generic hardware and drivers and has been designed for message passing.
Open-MXis also wire-compatible with Myricom MX protocol and interface so that any application built for MX may run on any machine without Myricom hardware and talk other nodes running with or without the native MX stack.
Open-MXis also an interesting framework for studying next-generation hardware features that could help Ethernethardware become legacy in the context of high-performance computing. Some innovative message-passing-aware stateless abilities, such as multiqueue binding and interrupt coalescing, were designed and evaluated thanks to Open-MX , .
Brice Goglinis the main contributor to Open-MX. The software is already composed of more than 45 000 lines of code in the Linux kernel and in user-space.
StarPUpermits high performance libraries or compiler environments to exploit heterogeneous multicore machines possibly equipped with GPGPUs or Cell processors.
StarPUoffers a unified offloadable task abstraction named codelet.In case a codelet may run on heterogeneous architectures, it is possible to specify one function for each architectures (e.g. one function for CUDA and one function for CPUs).
StarPUtakes care to schedule and execute those codelets as efficiently as possible over the entire machine. A high-level data management library enforces memory coherency over the machine: before a codelet starts (e.g. on an accelerator), all its data are transparently made available on the compute resource.
StarPUobtains portable performances by efficiently (and easily) using all computing resources at the same time.
StarPUalso takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models.
StarPUcan also leverage existing parallel implementations, by supporting parallel tasks, which can be run concurrently over the machine.
StarPUprovides a reductionmode, which permit to further optimize data management when results have to be reduced.
StarPUprovides integration in MPI clusters through a lightweight DSM over MPI.
StarPUcomes with a plug-in for the GNU Compiler Collection (GCC), which extends languages of the C family with syntactic devices to describe StarPU's main programming concepts in a concise, high-level way.
NewMadeleineis communication library for high performance networks, based on a modular architecture using software components.
The NewMadeleineoptimizing scheduler aims at enabling the use of a much wider range of communication flow optimization techniques such as packet reordering or cross-flow packet aggregation.
NewMadeleinetargets applications with irregular, multiflow communication schemes such as found in the increasingly common application conglomerates made of multiple programming environments and coupled pieces of code, for instance.
It is designed to be programmable through the concepts of optimization strategies, allowing experimentations with multiple approaches or on multiple issues with regard to processing communication flows, based on basic communication flows operations such as packet merging or reordering.
The reference software development branch of the NewMadeleinesoftware consists in 90 000 lines of code. NewMadeleineis available on various networking technologies: Myrinet, Infiniband, Quadrics and Ethernet. It is developed and maintained by Alexandre Denis.
PadicoTM is a high-performance communication framework for grids. It is designed to enable various middleware systems (such as CORBA, MPI, SOAP, JVM, DSM, etc.) to utilize the networking technologies found on grids.
PadicoTM aims at decoupling middleware systems from the various networking resources to reach transparent portability and flexibility.
PadicoTM architecture is based on software components. Puk (the PadicoTM micro-kernel) implements a light-weight high-performance component model that is used to build communication stacks.
PadicoTM component model is now used in NewMadeleine. It is the cornerstone for networking integration in the projects “LEGO” and “COOP” from the ANR.
PadicoTM is composed of roughly 60 000 lines of C.
PadicoTM is registered at the APP under number IDDN.FR.001.260013.000.S.P.2002.000.10000.
MAQAOis a performance tuning tool for OpenMP parallel applications. It relies on the static analysis of binary codes and the collection of dynamic information (such as memory traces). It provides hints to the user about performance bottlenecks and possible workarounds.
MAQAOrelies on binary codes and inserts probes for instrumention directly inside the binary. There is no need to recompile. The static/dynamic approach of MAQAOanalysis is the main originality of the tool, combining performance model with values collected through instrumentation.
MAQAOhas a static performance model for x86 architecture and Itanium. This model analyzes performance of the predecoder, of the decoder and of the different pipelines of the x86 architecture, in particular for SSE instructions.
The dynamic collection of data in MAQAOenables the analysis of thread interactions, such as false sharing, amount of data reuse, runtime scheduling policy, ...
MAQAOis in the project ”ProHMPT” from the ANR. A demo of MAQAOhas been made in Jan. 2010 for SME/INRIA days and in Nov. 2010 at SuperComputing, INRIA Booth.
QIRAL is a high level language (expressed through LaTeX) that is used to described Lattice QCD problems. It describes matrix formulations, domain specific properties on preconditionings, and algorithms.
The compiler chain for QIRAL can combine algorithms and preconditionings, checking validity of the composition automatically. It generates OpenMP parallel code, using libraries, such as BLAS.
This code is developped in collaboration with other teams participating to the ANR PetaQCD project.
TreeMatchis a library for performing process placement based on the topology of the machine and the communication pattern of the application.
TreeMatchprovides a permutation of the processes to the processors/cores in order to minimize the communication cost of the application.
Important features are : the number of processors can be higher than the number of processes ; it assumes that the topology is a tree and does not require valuation of the topology (e.g. communication speed) ; it implements different placement algorithms that are switched according to the input size.
TreeMatchis implemented as a load-balancer in Charm++ and as an tool for performing rank reordering in OpenMPI and MPICH-2
KNemis known to improve the performance of point-to-point intra-node MPI communication significantly , .
We designed an extended RMA interface in KNemthat suits the needs of point-to-point, collective and RMA operations.
We showed that the native use of KNemin MPI collective implementations enabled further optimization by combining the knowledge of collective algorithms with the mastering of KNemregion management and copies .
This work was initiated in the context of our collaboration with the MPICH2 team and is now also pursued within the Open MPIproject in collaboration with the University of Tennessee in Knoxville.
We demonstrated in the past that the locality of I/O devices within modern computing nodes has the significant impact of the MPI communication performance ( Non-Uniform I/O Access, NUIOA).
A first way to deal with such affinities would be to privilege I/O-intensive processes by placing them near the network interfaces. However, determining the communication-intensiveness may be tricky. Also, some applications have uniform communication patterns. The other way to deal with I/O affinities is to modify the implementation of communication operations given a predetermined task placement.
We demonstrated that the implementation of collective operations should take I/O affinities into account. Deciding which steps and leaders should be involved in the algorithms based on NUIOA effects led us to improve broadcast performance significantly , .
NewMadeleineis our communication library designed for high performance networks in clusters. We have worked on optimizations on low-level protocols so as to improve point-to-point performance.
We have proposed auto-tuning mechanisms for most parameters of a communication library: rendez-vous threshold, multi-rail ratio, optimization strategies.
We have proposed a communication protocol for InfiniBand that completely amortizes the cost of memory registration, through the use of a superpipeline that overlaps communication and memory copies. We have modeled the behavior of the network and proposed auto-tuning mechanism to adapt the protocol to the hardware properties.
SALOME platform is an open source software devlopped by EDF, CEA, and OpenCascade. It is an open simulation platform with pre-processing, post-processing, interoperability with CAD models, integration with computation kernels.
YACS is the workflow engine used for code coupling applications in SALOME. It leverages CORBA for communications between kernels. We have ported YACS atop PadicoTM, our communication platform for grids. It enables CORBA connections to use InfiniBand networks. Benchmarks show a significant improvement in code coupling performance.
We have expanded our previous work dealing with MPI process placement. Indeed, our approach relied on tools and techniques which were outside the scope of the MPI standard itself. In order to allow the users to utilize our work in a portable way, we enhanced some routines of the MPI standard. We worked mainly with the MPICH2 implementation but we are also working on an Open MPIversion as well.
Instead of modifying the binding of the MPI processes onto the physical cores on the underlying architecture, we chose to create a new communicator for which the logical topology organization is optimized for the hardware. This work has been published in and show interesting performance improvements for some class of MPI applications.
The problem of process placement, which can be reduced to a NP-hard graph partitionning problem, can be dealt with several famous applications like Scotch or ParMETIS. To evaluate these solution with TreeMatch, we ran several benchmarks using NAS Parellel Benchmarks and a real CFD application. On the one hand we study the quality of processes permutation (which will impact the execution time) and on the other hand the computation time of the permutation. These results will allow us to conclude about the pertinence of what graph partitioner can be used to bind processes on process units or to do a dynamic processes reordering
We continued our work on extending StarPUto master exploitation of Heterogeneous Platforms.
We have extended the StarPUscheduler into managing parallel taskswhich permit a better exploitation of CPUs and load balancing with GPUs.
We have designed over StarPUa lightweight DSM over MPI, which permits to seamlessly execute StarPUapplications over an MPI cluster of GPU-enhanced nodes.
We have been developing a GCC plug-in which extends the C language with pragmas and attributes that make writing StarPUapplications a lot easier.
We have brought to StarPUsupport for automatically converting data between CPU and GPU formats (typically arrays of structures vs structures of arrays). We are now optimizing it.
We have added an OpenCL interface to StarPU, SOCL , which permits to execute unmodified OpenCL applications over StarPU.
We have introduced in StarPUtheoretical bound support : from a record of the set of tasks submitted by the application, StarPUuses linear programming to give the execution time of an ideal scheduling, which can then be compared with the actual results.
We have continued collaboration with the University of Tennessee, Knoxville for StarPUsupport in the state-of-the art dense linera algebra library, Magma, in particular LU and QR factorizations. We have also collaborated with the University of Mons and Linköping .
Today's embedded applications are increasingly demanding in terms of computational power, especially in real-time digital signal processing (DSP) where tight timing requirements are to be fulfilled. More specifically, when it comes to video decoding (e.g. H.264/AVC) not only has it been almost impossible for some time to run such codecs on a stand-alone embedded processor, but it now also becomes quite impractical to execute them on homogeneous multicore platforms. In this context, STMicroelectronics is developing a scalable heterogeneous system-on-chip template called P2012 and aimed at meeting the latest codecs' requirements.`
This year, the privileged axis of research was directed towards dataflow-based models, which benefit from such strong, well-known properties as analyzability, schedulability and expressivity. Furthermore, dataflow programming has already been used extensively in DSP, yielding a number of dedicated software synthesis tools. We have proposed a first version of the programming model that will be evaluated later.
We started a study on sparse matrix factorization and system resolution on heterogeneous platforms in collaboration with Pascal Hénon from company Total, in the context of oil extraction simulation. Sparse matrix computations are notoriously difficult to efficiently run on heterogeneous platforms in the general case due to the irregular memory access patterns they generate.
However, in the specific context of this study, Corentin Rossignon showed as part of his Master Thesis that the sparsity layout of matrices generated by such oil extraction simulation problems can lead to a much higher level of efficiency on hetereogeneous platforms thanks when using a suitable sparse internal representation together with carefully written operators such as the sparse matrix-vector product together with the StarPU heterogeneous scheduler.
Corentin Rossignon is now starting a Phd. Thesis in partnership with Total to build on these promising results.
As part of Project FP3C with Japan, we started a study on to explore the use of StarPU as possible target runtime system for the XcalableMP language and compiler developed by Prof. Sato's team from University of Tsukuba. XcalableMP is a pragma-based language designed for parallelising application on clusters of multicore processors. The compiler is responsible to expand XcalableMP pragma into complex work mapping, communication and data redistribution commands.
The study of porting XcalableMP on top of StarPU was conducted by Cyril Roelandt during his Master Thesis , starting from the idea that computing node with one or more attached accelerating expansion cards can be seen as a distributed platform. The results of the study showed that on the one side, the power of the XcalableMP language itself is very interesting for the goal of simplifying the port of applications on hetereogeneous platforms. However, a current assumption of the XcalableMP model is that the compiler does not insert implicit commands and behaviour except at the exact location of pragma annotations, which limit the range of optimizations available to the dynamic scheduler and memory manager of StarPU. We will thus continue to collaborate with Prof. Sato's team within the FP3C to see how these limitations could be reduced or lifted when using XcalableMP with StarPU.
In an effort to make it easier for C programmers to benefit from StarPU, the team-project has been working on extensions to the C language allowing important StarPU
concepts to be expressed concisely. These C extensions are provided as a plug-in for the GNU Compiler Collection (GCC
The GCC plug-in extends the syntax and semantics of C and related languages (C++, Objective-C) using attributesand pragmas. Attributes are used, for instance, to declare StarPU tasksand their implementationsfor the available targets (CPU, OpenCL, CUDA, etc.) Pragmas are used notably to provide programmers a way to describe data buffers that are passed to tasks, which in turn allows the StarPU run-time support to manage data transfers between main memory and GPUs as it sees fit. Finally, tasks are invoked like regular C functions.
In addition to easing application development, the GCC plug-in, thanks to its higher-level view of the program structure, is able to report certain classes of errors at compile-time, which would otherwise lead to run-time errors.
This project has been led by Ludovic Courtès of Inria's Development and Experimentation Department (SED) at Bordeaux, as part of a joint development action with the SED.
Within the ADT Ampli project, we contributed to the Concha CFD library developed by R. Becker's Inria Team Concha in Pau. Together with R. Becker, E. Bergounioux and D. Trujillo from Concha Team, and François Rue from SED Bordeaux we designed and experimented with the MPI parallelization and the hybrid MPI+OpenMP parallelization of the library.
The MPI parallelization is now finalized. The OpenMP level has been successfully tested on the Vanka smoother and is now being spread in the library. We will thus continue to contribute to this parallelization work, in particular with respect to the support of 3D simulation cases.
Within the context of the ANR ProHMPT project, we contributed a thorough analysis of hot spots, data structure usages and locality issues in memory accesses of an aerodynamics application from partner CEA CESTA.
In accordance with these results, a new version of this application has been written by the CESTA Team with redesigned, locality-friendly data structures and simplified loop scheme. This new version perfoms much better than the previous one on both 2D and 3D cases.
We also conducted tests about the port of selected kernels of this application on accelerated hetereogeneous platforms. The results of these tests were desappointing with the first version of the application due to the layout of the main data structures that led to a lot of memory transfers between the central memory and the accelerated memory.
We are now working on conducting these experiments with the redesigned version of the application whose new data structures should dramatically reduce the amount of data transfers.
We propose a new approach for OpenCL programming, using a unique virtual accelerator instead of using the physical accelerator. Placement on the real hardware is handled by the runtime instead of the user, improving productivity and performance scalability. This proposition relies on OpenCL standard but changes the way its API is used.
We have shown on some simple examples how this approach, using StarPU as a runtime, enables executions with a better load balance and performance. We are working on how to generalize this to more complex benchmarks. This work has been presented in Renpar workshop.
Given a parallel task graph, a runtime such as StarPU can place each task on different hardware. However, there is still the need to adapt the number of tasks, the granularity of these tasks, according to the target hardware. Due to architectures with CPUs and GPUs, it is potentially interesting to have tasks of different granularities. We explore transformations that enable to either automatically split tasks into small ones, or given some user knowledge on the tasks, decide how and when to split a large task into small ones.
This work starts from a high-level representation of the code, using an explicit data-flow graph.
We build a model to predict performance of HPC code on the SCC ship. This model can predict runtime of regular code as well as power consumption for different frequency.
This allows users to choose either to optimize power consumption, power efficiency or raw performance.
We are building a fine grained cache model to understand common cache coherence issue.
This model is built on a set of micro-benchmarks and can also be used to improve find some bottlenecks in memory bound code. Our set of micro-benchmarks can also be used as a test bed for new architectures .
We propose a performance analysis of OpenMP codes, based on memory accesses and cache hierarchies.
This analysis relies on memory traces for multi-threaded codes and on static analysis of binary code. Memory traces are obtained through MAQAO by static binary rewriting and are compressed online, building polyhedral iteration domains. The static analysis, mostly induction variable detection on binary code, provides the same analysis whenever possible, removing the need in some cases for dynamic instrumentation.
The analysis focuses on a number of issues in multi-threaded executions: thread affinity issues, false sharing, cache pollution.
This work is in collaboration with Exascale Computing Lab.
We develop a new approach for stencil code generation, optimizing data-layout for multi-threaded, SIMD code on multicores and CUDA code on GPU. The transformation handles different stencil parameters, and memory constraints.
The code generated reaches high levels of performance, outperforming related works for multicores and with similar performance on GPUs. This work is submitted to publication and was first presented in a workshop .
We participate to a contract between INRIA and EDF R&D which was granted a 6 month funding (apr. – sept. 2011). It aims at optimizing the communications of YACS, the workflow engine of the SALOME simulation platform, using our PadicoTM communication framework.
STMicroelectronics is paying the CIFRE PhD Thesis of Paul-Antoine Arras on The development of a flexible heterogeneous system-on-chip platform using a mix of programmable processing elements and hardware acceleratorsfrom October 2011 to October 2014.
Total funded a study (apr. – sept. 2011) on porting sparse matrix computations and system resolution on heterogeneous platforms for oil extraction simulations.
We participate to a research proposal to the ANR
Cosinusprogram called “COOP” which was granted a three-year funding (dec. 2009 – dec. 2012). It aims at establishing generic cooperation mechanisms between resource management,
runtime systems, and application programming frameworks to simplify programming models, and improve performance through adaptation to the resources. It involves academic partners and
EDF R&D. (
http://
We participate to the joint ANR-JST project FP3C ( Framework and Programming for Post Petascale Computing). The goal of this project is to contribute to establish software technologies, languages and programming models to explore extreme performance computing beyond petascale computing, on the road to exascale computing.
We lead a research proposal to the ANR
Cosinusprogram called “ProHMPT” which was granted a three-year funding (jan. 2009 – jun. 2012). It aims at focusing the joint research work of several teams about compilers, runtimes
and libraries on programming heterogeneous platforms such as GPU and accelerators. It involves academic partners, companies (Bull, CAPS entreprise) and CEA teams. Olivier
Aumageis the head of the ANR ProHMPT project. (
http://
The runtime team is member of the large wigspan project Hémera started in 2010, that aims at demonstrating ambitious up-scaling techniques for large scale distributed computing by carrying out several dimensioning experiments on the Grid’5000 infrastructure, at animating the scientific community around Grid’5000 and at enlarging the Grid’5000 community by helping newcomers to make use of Grid’5000. It is not restricted to INRIA teams.
We participate to a research proposal to the ANR
CONTINTprogram called “MEDIAGPU” which was granted a 30-month funding (jan. 2010 - jun. 2012). It will develop a software architecture and will review and adapt a number of classical
multimedia algorithms, considering the latest advances offered by the new hardware architectures, such as combinations of CPUs and GPUs (
http://
Title: Performance Portability and Programmability for Heterogeneous Many-core Architectures
Type: COOPERATION (ICT)
Defi: Computing Systems
Instrument: Specific Targeted Research Project (STREP)
Duration: October 2010 - December 2012
Coordinator: Universität Wien (Austria)
Others partners: Chalmers Tekniska Högskola AB (Sweden), Codeplay Software Limited (United Kingdom), Intel GmbH (Germany), Linköpings Universitet (Sweden), Movidia Ltd. (Ireland), Universität Karlsruhe (Germany)
See also:
http://
Abstract: PEPPHER will provide a unified framework for programming architecturally diverse, heterogeneous many-core processors to ensure performance portability. PEPPHER will advance state-of-the-art in its five technical work areas:
Methods and tools for component based software
Portable compilation techniques
Data structures and adaptive, autotuned algorithms
Efficient, flexible run-time systems
Hardware support for autotuning, synchronization and scheduling
Program: COST
Project acronym: ComplexHPC
Project title: Open Network for High-Performance Computing on Complex Environments
Duration: may 2009 – may 2013
Coordinator: Emmanuel Jeannot
Other partners: 24 European Countries, 2 non-European counties.
Abstract: The goal of the Action is to establish a European research network focused on high performance heterogeneous computing in order to address the whole range of challenges posed by these new platforms including models, algorithms, programming tools and applications.
The goal of Matrices Over Runtime Systems at Exascale (MORSE) project is to design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale multicore systems with GPU accelerators, using all the processing power that future high end systems can make available. To develop software that will perform well on petascale and exascale systems with thousands of nodes and millions of cores, several daunting challenges have to be overcome, both by the numerical linear algebra and the runtime system communities. By designing a research framework for describing linear algebra algorithms at a high level of abstraction,the MORSE team will enable the strong collaboration between research groups in linear algebra and runtime systems needed to develop methods and libraries that fully benefit from the potential of future large-scale machines. Our project will take a pioneering step in the effort to bridge the immense software gap that has opened up in front of the High-Performance Computing (HPC) community.
The Runtime project is the representative of Inria within the
MPI Forumwhich designs and maintains the
Message Passing Interface Standard(
http://
We established a collaboration with the Open MPIproject in the context of development of the hwlocsoftware (see Section ). This collaboration was also informally extended to the development of high-performance intra-node communication with Open MPIover our KNemdriver (see Section ).
Runtime is a member of the CCI project together with the Oak Ridge National Laboratory and several other american academic and industrial partners (
http://
The Runtime project is part of the joint laboratory that was setup between INRIA and University of Illinois Urbana-Champaign (UIUC) about Petascale Computing (
http://
Jan Perhacfrom Trondheim University visited the runtime team as an ERCIM Fellow from March 7 to March 11. We worked on the Thor runtime system.
Keisuke Fukudafrom Tokyo Tech visited from December 12th to Friday 16th, for the FP3C project, to port an FMM application on top of StarPU.
Tetsuya Odajimafrom University of Tsukuba, Japan visited the Runtime Team from September 2th to September 16th, for the FP3C Project, to integrate the XcalableMP language environment with StarPU.
Satoshi Ohshimafrom Tokyo University visited from April 4th to April 15th, for the FP3C project, to work on FEM methods.
The Runtime project organized the Euro-Par 2011 conference in August in Bordeaux.
Emmanuel Jeannotand Raymond Namystare chairs and organizers of the 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011)
Raymond Namystis vice-chair of the Research and Training Department in Mathematics and Computer Science (UFR Math-Info) of the University of Bordeaux 1. He is also a member of the Scientific Committee of the University of Bordeaux 1
Raymond Namystis the head of the LaBRI-CNRS “SATANAS” ( Runtime systems and algorithms for high performance numerical applications) research team (about. 50 people) that includes the Bacchus, Hiepacsand RuntimeINRIA groups.
Raymond Namystchairs the scientific committee of the ANR “Numerical Models” program for the 2011-2013 period.
Raymond Namystserves as an expert for the following initiatives/institutions:
EESI ( European Exascale Software Initiative, since 2010) ;
CEA/DAM (as a “scientific advisor” for the 2008-2010 period) ;
CEA-EDF-INRIA School technical committee (since 2009) ;
GENCI (
http://
ORAP (
http://
In 2011, Raymond Namystwas co-chair of topic “Architecture and Networks” for the SuperComputing (SC) conference.
Raymond Namysthas been reviewer for the PhD dissertation of Swann Perarnau(University of Grenoble) and Christiane Pousa(University of Grenoble). He served as a member of the jury for the PhD defense of Souad Koliaï(University of Versailles Saint Quentin) and Fabrice Dupros(BRGM, Orleans).
Raymond Namystwas a program committee member of the following international conferences: SC11, EuroMPI 2011, CASS 2011, PMEA 2011, A4MMC 2011.
Brice Goglinis member of the following program committees: SuperComputing 2012, EuroMPI 2011 and 2012, ISPAN 2011. He was also a reviewer for PLDI 2011 and CCGrid 2011 conferences, the CASS 2011 workshop, and the JPDC journal.
Denis Barthouhas served as topic chair for Euro-par 2011 conference, has been member of the external review commitee of IEEE/ACM PLDI 2011, member of the program committee of SMART 2011.
Denis Barthouhas been reviewer of the PhD dissertation of Guillaume Rizk (University of Rennes 1), Samir Ammenouche (University of Versailles Saint Quentin) and has been member of the qualifying exam for the PhD of Alexandre Duchateau (University of Illinois, Urbana Champaign).
Denis Barthouwas involved in the reviewing process of one ANR project proposal for the call “`Infrastructures matérielles et logicielles pour la société numérique”. He serves as expert in the Exascale Research Lab.
Guillaume Mercieris member of the CCGrid 2011 program comittee and was involved in the reviewing process for the Computer and Fluids Journal.
Olivier Aumagewas involved in the reviewing process of two ANR project proposals for the call “Infrastructures matérielles et logicielles pour la société numérique”. He was also involved in the reviewing process of JPDC and Parallel Computing journals.
Emmanuel Jeannotis member of the steering committee and the direction committee of the ADT Aladdin-G5K and serves as head of the Bordeaux site since October 2009.
Emmanuel Jeannothas been reviewer of the PhD dissertation of Alexandru Dobilla (Université de Franche-Comté) and Robbert Higgins (University College Dublin, Ireland).
Emmanuel Jeannotserved as reviewers of following journals: IEEE Trans. on Parallel and Dist. Syst., Parallel computing, Computing, Journal of Parallel and Distributed Computing.
Emmanuel Jeannotis member of the steering committee of the IEEE cluster conference.
Emmanuel Jeannotis associate editor of the International Journal of Parallel, Emergent and Distributed Systems.
Emmanuel Jeannotis member of the program committee of IPDPS 2012, heteropar 2011, PPAM, Cluster 2011 and Renpar 20 conferences.
Samuel Thibaultis member of the program committee of HPCVirt. He was also involved in the reviewing process of the JPDC, SPE, and CCPE journals.
Raymond Namystgave a keynote speech at the International Conference On Preconditioning Techniques For Scientific And Industrial Applications (Preconditioning 2011) about “Programming heterogeneous, accelerator-based multicore machines:current situation and main challenges”.
Raymond Namystgave a keynote speech at the 9 thInternational Conference on Parallel Processing and Applied Mathematics (PPAM 2011) about “Programming heterogeneous, accelerator-based multicore machines: a runtime system's perspective”.
Raymond Namystgave an invited talk at the 4th Workshop on UnConventional High Performance Computing (UCHPC 2011) about “programming heterogenous systems”.
Raymond Namystgave a lecture about “hybrid programming” at the 2011 CEA-EDF-INRIA school (Sophia Antipolis) about “Petaflop numerical simulation over hybrid parallel machines.
Members of Runtime project gave thousands of hours of teaching at University of Bordeaux and ENSEIRB-MATMECA engineering schools, covering a wide range of topics from basic use of computers and C programming to advance topics such as operating systems, parallel programming and high-performance runtime systems.
PhD & HdR:
PhD : Stéphanie Moreaud, Mouvement de données et placement des tâches pour les communications haute performance sur machines hiérarchiques, Univ. Bordeaux, 12/10/2011, Brice Goglinand Raymond Namyst
PhD : Cédric Augonnet, Scheduling Tasks over Multicore machines enhanced with Accelerators: a Runtime System's Perspective, Univ. Bordeaux, 09/12/2011, Samuel Thibaultand Raymond Namyst
PhD in progress : Bertrand Putigny, Modèles de performance pour l'ordonnancement sur architectures multicoeurs hétérogènes, 2010/11, Brice Goglinand Denis Barthou
PhD in progress : François Tessier, Placement d'applications hybrides sur machine non-uniformes multicœurs, 2011/10 Emmanuel Jeannotand Guillaume Mercier
PhD in progress : Paul-Antoine Arras, Development of a Flexible Heterogeneous System-On-Chip Platform using a mix of programmable Processing Elements and harware accelerators. 2011/10, Emmanuel Jeannotand Samuel Thibault
PhD in progress : Jérôme Clet-Ortega, Exploitation efficace des architectures parallèles de type grappes de NUMA à l'aide de modèles hybrides de programmation, 2007/10, Raymond Namystand Guillaume Mercier
PhD in progress : Sylvain Henry, Modèles de programmation et systèmes d'exécution pour architectures hétérogènes, 2009/10, Denis Barthouand Alexandre Denis
PhD in progress : Andres Charif-Rubial, Performance analysis and tuning of memory accesses for multi-core codes, 2008/10, Denis Barthouand William Jalby(Université de Versailles Saint Quentin en Yvelines)
PhD in progress : Julien Jaeger, Iterative compilation for irregular applications, 2007/10, Denis Barthou
PhD in progress: Andra Hugo, Composability of parallel codes over heterogeneous platforms, 2013/10, Abdou Guermoucheand Pierre-André Wacrenierand Raymond Namyst
PhD in progress: Cyril Bordage, Parallélisation de la méthode multipôle sur architecture hybride, 2012/10, Raymond Namystand David Goudin(CEA Le Barp)
PhD in progress: Corentin Rossignon, Design of an object-oriented runtime system for oil reserve simulations on heterogeneous architectures, 2013/10, Olivier Aumageand Pascal Hénon(TOTAL) and Raymond Namystand Samuel Thibault
Brice Goglinis in charge of the diffusion of the scientific culture for the INRIA Research Center of Bordeaux. He is also a member of the National INRIA working group on Scientific Mediation.
Brice Goglinpublished two papers explaining multiprocessor operating systems and affinities in modern computers in Interstices.
Brice Goglinpresented the team's research work to one hundred high-school students at the “Fête de la Science”.
Stéphanie Moreaudand Brice Goglinpresented research careers at the Aquitec student exhibition.