Performance driven data cache prefetching in a dynamic software optimization system

CAMUS Compilation pour les Architectures MUlti-coeurS

Architecture and Compiling

Algorithmics, Programming, Software and Architecture

Laboratoire des sciences de l'image,de l'informatique et de la télédétection (L.S.I.I.T) CNRS Université de Strasbourg Compiling Embedded Systems Hardware Accelerators Proofs Of Programs Formal Methods Processors Philippe Clauss UnivFr Enseignant

Nancy

Team leader, Professor, Université de Strasbourg oui Éric Violard UnivFr Enseignant

Nancy

Associate Professor, Université de Strasbourg oui Vincent Loechner UnivFr Enseignant

Nancy

Associate Professor, Université de Strasbourg Alain Ketterlin UnivFr Enseignant

Nancy

Associate Professor, Université de Strasbourg Julien Narboux UnivFr Enseignant

Nancy

Associate Professor, Université de Strasbourg Nicolas Magaud UnivFr Enseignant

Nancy

Associate Professor, Université de Strasbourg Benoît Pradelle UnivFr PhD

Nancy

Université de Strasbourg Alexandra Jimborean UnivFr PhD

Nancy

Université de Strasbourg Jean-François Dollinger UnivFr PhD

Nancy

Université de Strasbourg Overall Objectives Overall Objectives

The CAMUS team is focusing on developping, adapting and extending automatic parallelizing and optimizing techniques, as well as proof and certification methods, for the efficient use of current and future multicore processors.

The team's research activities are organized into five main issues that are closely related to reach the following objectives: performance, correction and productivity. These issues are: static parallelization and optimization of programs (where all statically detected parallelisms are expressed as well as all “hypothetical” parallelisms which would be eventually taken advantage of at runtime), profiling and execution behavior modeling (where expressive representation models of the program execution behavior will be used as engines for dynamic parallelizing processes), dynamic parallelization and optimization of programs (such transformation processes running inside a virtual machine), object-oriented programming and compiling for multicores (where object parallelism, expressed or detected, has to result in efficient runs), and finally program transformations proof (where the correction of many static and dynamic program transformations has to be ensured).

Highlights

The Associate Team Ancome, regrouping Camus and their Argentinian partners from the LaFhis team of the University of Buenos Aires, has started officially its activities in January 2011.

Philippe Clauss, Vincent Loechner and Diego Garbervetsky (Associate Team Ancome), published a chapter in the book Energy-Aware Memory Management for Embedded Systems, CRC Press 2012, on the paramatric estimation of memory consumption of programs.

Benoît Pradelle defended his PhD thesis December the 20th 2011 at the University of Strasbourg. He presented his results on dynamic selection of parallel code versions, automatic parallelization of binary codes and polyhedral dynamic and speculative parallelization. His jury was composed by Albert Cohen (reviewer), from INRIA, Sanjay Rajopadhye (reviewer), Colorado State University, USA, John Cavazos (examiner), University of Delaware, USA, Philippe Clauss, Alain Ketterlin and Vincent Loechner (advisors).

Alexandra Jimborean received the best poster award at the international conference Code Generation and Optimization (CGO) that was held in Chamonix, France, in April 2011.

Scientific Foundations Research directions

The various objectives we are expecting to reach are directly related to the search of adequacy between the sofware and the new multicore processors evolution. They also correspond to the main research directions suggested by Hall, Padua and Pingali in . Performance, correction and productivity must be the users' perceived effects. They will be the consequences of research works dealing with the following issues:

Issue 1: Static parallelization and optimization

Issue 2: Profiling and execution behavior modeling

Issue 3: Dynamic program parallelization and optimization, virtual machine

Issue 4: Object-oriented programming and compiling for multicores

Issue 5: Proof of program transformations for multicores

Efficient and correct applications development for multicore processors needs stepping in every application development phase, from the initial conception to the final run.

Upstream, all potential parallelism of the application has to be exhibited. Here static analysis and transformation approaches (issue 1) must be processed, resulting in a multi-parallelintermediate code advising the running virtual machine about all the parallelism that can be taken advantage of. However the compiler does not have much knowledge about the execution environment. It obviously knows the instruction set, it can be aware of the number of available cores, but it does not know the effective available resources at any time during the execution (memory, number of free cores, etc.).

That is the reason why a “virtual machine” mechanism will have to adapt the application to the resources (issue 3). Moreover the compiler will be able to take advantage only of a part of the parallelism induced by the application. Indeed some program information (variables values, accessed memory adresses, etc.) being available only at runtime, another part of the available parallelism will have to be generated on-the-fly during the execution, here also, thanks to a dynamic mechanism.

This on-the-fly parallelism extraction will be performed using speculative behavior models (issue 2), such models allowing to generate speculative parallel code (issue 3). Between our behavior modeling objectives, we can add the behavior monitoring, or profiling, of a program version. Indeed current and future architectures complexity avoids assuming an optimal behavior regarding a given program version. A monitoring process will allow to select on-the-fly the best parallelization.

These different parallelizing steps are schematized on figure .

The more and more widespread usage of object-oriented approaches and languages emphasizes the need for specific multicore programming tools. The object and method formalism implies specific execution schemes that translate in the final binary by quite distant elementary schemes. Hence the execution behavior control is far more difficult. Analysis and optimization, either static or dynamic, must take into account from the outset this distortion between object-oriented specification and final binary code: how can object or method parallelization be translated (issue 4).

Our project lies on the conception of a production chain for efficient execution of an application on a multicore architecture. Each link of this chain has to be formally verified in order to ensure correction as well as efficiency. More precisely, it has to be ensured that the compiler produces a correct intermediate code, and that the virtual machine actually performs the parallel execution semantically equivalent to the source code: every transformation applied to the application, either statically by the compiler or dynamically by the virtual machine, must preserve the initial semantics. They must be proved formally (issue 5).

In the following, those different issues are detailed while forming our global and long term vision of what has to be done.

Static parallelization and optimization Vincent Loechner Philippe Clauss Éric Violard Alexandra Jimborean

Static optimizations, from source code at compile time, benefit from two decades of research in automatic parallelization: many works address the parallelization of loop nests accessing multi-dimensional arrays, and these works are now mature enough to generate efficient parallel code . Low-level optimizations, in the assembly code generated by the compiler, have also been extensively dealt for single-core and require few adaptations to support multicore architectures. Concerning multicore specific parallelization, we propose to explore two research directions to take full advantage of these architectures. They are described below.

State of the art

Upstream, an easy interprocedural dependence analysis allows to handle complete programs (but recursivity: recursive functions must be transformed into iterative functions). Concerning iterative control we will use the polyhedral model, a formalism developped these last two decades, which allows to represent the execution of a loop nest by scanning a polytope.

When compiling an application, if it contains loop nests with affine bounds accessing scalars or arrays accessed using affine functions, the polyhedral model allows to:

compute the dependence graph, which describes the order in which the dependent instructions must be executed ;

generate a schedule, which extracts some parallelism from the dependence graph , ;

generate an allocation, which assigns a processor (or a core) to a set of iterations of the loop nest to be scanned.

This last allocation step needs a thorough knowledge of the target architecture, as many crucial choices will result in performance hazards: for example, the volume and flow of inter-processor communications and synchronization; the data locality and the effects of the TLB (Translation Lookaside Buffer) and the various cache levels and distributions; or the register allocation optimizations. There are many techniques to control these parameters, and each architecture needs specific choices, of a valid schedule, of a parallel loop iterations distribution (bloc-, cyclic-, or tiled), of a loop-unrolling factor, as well as a memory data layout and a prefetch strategy (when available). They require powerfull mathematical tools, such as counting the number of integer points contained in a parametric polytope.

Our own contributions in this area are significant. Concerning schedule and data placement, we proposed new advances in minimizing the number of communications for parallel architectures and in cache access optimizations . We also proposed essential advances in parametric polytope manipulation , , developped the first algorithm to count integer points in a parametric polytope as an Ehrhart polynomial , and proposed successive improvements of this algorithm . We implemented these results in the free software PolyLib, utilized by many researchers around the world.

Adapting parallelization to multicore architecture

The first research direction to be explored is multicore specific efficient optimizations. Indeed, multicore architectures need specific optimizations, or we will get underlinear accelerations, or even decelerations. Multicore architectures may have the following properties: specific memory hierarchy, with distributed low-level cache and (possibly semi-) shared high level caches; software-controlled memory hierarchies (memory hints, local stores or scratchpads for example); optimized access to contiguous memory addresses or to separate memory banks; SIMD or vectorial execution in groups of cores, and synchronous execution; higher register allocation pressure when several threads use the same hardware (as in GPGPUs for example); etc.

A schedule and an allocation must be chosen wisely in order to obtain good performances. On NVIDIA GPGPUs, using the CUDA language, Baskaran et al. obtained interesting results that have been implemented in their PLuTo compiler framework. However, they are based on many empirical and imprecise techniques, and require simulations to fine-tune the optimizations: they can be improved. Memory hierarchy efficient control is a cornerstone of tomorrow's multicore architectures performance. Compiler-optimizers have to evolve to meet this requirement.

Simulation and (partial-) profiling may however remain necessary in some cases, when static analysis reaches its intrinsic limits: when the execution of a program depends on dynamic parameters, when it uses complex pointer arithmetic, or when it performs indirect array accesses for example (as is often the case in whileloops, out of the scope of the classical polyhedral model). In these cases, the compiler should rely on the profiler, and generate a code that interacts with the dynamic optimizer. This is the link with issues 2 and 3 of this research project.

Expressing many potential parallelisms

The dynamic optimizer (issue 3) must be able to exploit various parallel codes to compare them and the best one to choose, possibly swapping from a code to another during execution. The compiler must therefore generate different potentially efficient versions of a code, depending on fixed parameters such as the schedule or the data layout, and dynamic parameters such as the tile size or the unrolling factor.

The compiler then generates many variants of effectiveparallelism, formally proved by the static analyzer. It may also generate variants of code that have not been formally validated, due to the analyzer limits, and that have to be checked during execution by the dynamic optimizer: hypotheticalparallelism. Hypothetical parallelism could be expressed as a piece of code, valid under certain conditions. Effective and hypothetical parallelisms are called potential parallelism. The variants of potential parallelism will be expressed in an intermediate language that has to be discovered.

Using compiler directives is an interesting way to define this intermediate language. Among the usual directives, we distinguish schedule directives for shared memory architectures (such as the OpenMP http:// www. openmp. org paralleldirective), and placement directives for distributed memory architectures (for example the HPF http:// hpff. rice. edu ALIGNdirective). These two types of directives are conjointly necessary to take full profit of multicore architectures. However, we have to study their complementarity and solve the interdependence or conflict that may arise between them. Moreover, new directives should allow to control data transfers between different levels of the memory hierarchy.

We are convinced that the definition of such a language is required in the next advances in compilation for multicore architectures, and there does not exist such an ambitious project to our knowledge. The OpenCL project http:// www. khronos. org/ opencl, presented as an general-purpose and efficient multicore programming environment, is too low-level to be exploitable. We propose to define a new high level language based on compilation directives, that could be used by the skilled programmer or automatically generated by a compiler-optimizer (like OpenMP, recently integrated in the gcccompiler suite).

Profiling and execution behavior modeling Alain Ketterlin Philippe Clauss Benoît Pradelle

The increasing complexity of programs and hardware architectures makes it ever harder to characterize beforehand a given program's run time behavior. The sophistication of current compilers and the variety of transformations they are able to apply cannot hide their intrinsic limitations. As new abstractions like transactional memories appear, the dynamic behavior of a program strongly conditions its observed performance. All these reasons explain why empirical studies of sequential and parallel program executions have been considered increasingly relevant. Such studies aim at characterizing various facets of one or several program runs, e.g., memory behavior, execution phases, etc. In some cases, such studies characterize more the compiler than the program itself. These works are of tremendous importance to highlight all aspects that escape static analysis, even though their results may have a narrow scope, due to the possible incompleteness of their input data sets.

Selective profiling and interaction with the compiler

In its simplest form, studying a given program's run time behavior consists in collecting and aggregating statistics, e.g., counting how many times routines or basic blocks are executed, or counting the number of cache misses during a certain portion of the execution. In some cases, data can be collected about more abstract events, like the garbage-collector frequency or the number and sizes of sent and received messages. Such measures are relatively easy to obtain, are frequently used to quantify the benefits of some optimization, and may suggest some way to improve performance. These techniques are now well-known, but mostly for sequential programs.

These global studies have often been complemented by local, targeted techniques focused on some program portions, e.g., where static techniques remain inconclusive for some fixed duration. These usages of profiling are usually strongly related to the optimization they complement, and are set up either by the compiler or by the execution environment. Their results may be used immediately at run time, in which case they are considered a form of run time optimization . They can also be used offline to provide hints to a subsequent compilation cycle, in which case they constitute a form of profile-guided compilation, a strategy that is common in general purpose compilers.

For instance, in the context where a set of possible parallelizations have been provided by the compiler (see issue 1), a profiling component can easily be made responsible for testing some relevant condition at run time ( e.g., that depends on input data) and for selecting the best between various versions of the code. Beyond such simple tasks, we expect that profiling will, at the beginning of the execution, have enough resources to conduct more elaborate analyzes. We believe that combining an “open” static analysis with an integrated profiling component is a promising approach, first because it may relieve the programmer of a large part of the tedious task of implementing the distribution of computations, and second to free the compiler of the obligation to choose between several optimizations in the absence of enough relevant data. The main open question here is to define precisely the respective roles of the compiler and the profiler, and also the amount and nature of information the former can transmit to the latter.

Profiling and dynamic optimization

In the context of dynamic optimization, that is, when the compiler's abilities have been exhausted, a profiler can still do useful work, provided some additional capabilities . If it is able to instrument the code the way, e.g., a PIN-tool does , it has access to the whole program, including libraries (or, for example, the code of a low-level library called from a scripting language). This means that it has access to portions of the program that were not under the compiler's control. The profiler can then perform dynamic inter-procedural analyzes, for instance to compute dependencies to detect parallelism that wasn't apparent at compile time because of a function call in the body of a loop. More generally, if the profiler is able to reconstruct at run time some representation of the whole program, as in for example, it is possible to let it search for any construct that can be optimized and/or parallelized in the context of the current execution. Several virtual machines, e.g., for Java or Microsoft CLR, have opened this way of optimizing programs, probably because virtual machines need to maintain an intermediate, structured representation of the running program.

The possibility of running programs on architectures that include a large number of computing cores has given rise to new abstractions , , . Transactional memories, for instance, aim at simplifying the management of conflicting concurrent accesses to a shared memory, a notoriously difficult problem . However, the performance of a transaction-based application heavily depends on its dynamic behavior, and too many conflicting accesses and rollbacks, severely affect performance. We bet that the need for multicore specific programming tools will lead to other abstractions based on speculative execution. Because of the very nature of speculation, all these abstractions will require run time evaluation, and maybe correction, to avoid pathological cases. The profiler has a central role here, because it can be made responsible for diagnosing inefficient use of speculative execution, and for taking corrective action, which means that it has to be integrated to the execution environment. We also think that the large scope and almost infinite potential uses of a profiling component may well suggest new parallel program abstractions, specially targeted at run time evaluation and adaptation.

Run time program modeling

When profiling goes beyond simple aggregation of counts, it can, for example, sample a program's behavior and split its execution into phases. These phases may help target a subsequent evaluation on a new architecture . When profiling instruments the whole program to obtain a trace, e.g., of memory accesses, it is possible to use this trace for:

simulation, e.g., by varying the parameters of the memory hierarchy,

for modeling, e.g., to reconstruct some specific model of the program , or to extract dynamic dependencies that help identifying parallel sections .

Handling such large execution traces, and especially compressing them, is a research topic by itself , . Our contribution to this topic is unusual in that the result of compression is a sequence of loop nests where memory accesses and loop bounds are affine functions of the enclosing loop indices. Modeling a trace this way leads to slightly better average compression rates compared to other, less expressive techniques. But more importantly, it has the advantage to provide a result in symbolic form, and this result can be further analyzed with techniques usually restricted to the static analysis of source code. We plan to apply, in the short term, similar techniques to the modeling of dynamic dependencies, so as to be able to automatically extract parallelism from program traces.

This kind of analysis is representative of a new kind of tools than could be named “parallelization assistants” , . Properties that can't be detected by the compiler but that appear to hold in one or several executions of a program can be submitted to the programmer, maybe along a suitable reformulation of its program using some class of abstraction, e.g., compiler directives. The goal is to provide help and guidance in adapting source code, in the same way a classical profiling tool helps pinpoint performance bottlenecks. Control and data dependencies are fundamental to such a tool. An execution trace provides an observed reality; for example a trace of memory addresses. If the observed dynamic dependencies provide a set of constraints, they also suggest a complete family of potential correct executions, be they parallel or sequential, and all these executions are equivalent to the reference execution. Being able to handle large traces, and representing them in some manageable way, means being able to highlight medium to large grain parallelism, which is especially interesting on multicore architectures and often difficult for compilers to discover, for example because of the use of pointers and the difficulty of eliminating potential aliasing. This can be seen as a machine learning problem, where the goal is to recover a hidden structure from a large sequence of events. This general problem has various incarnations, depending on how much the learner knows about the original program, on the kind of data obtained by profiling, on the class of structures sought, and on the objectives of the analysis. We are convinced that such studies will enrich our understanding of the behavior of programs, and of the programming concepts that are really useful. It will also lead to useful tools, and will open up new directions for dynamic optimization.

Dynamic parallelization and optimization, virtual machine Philippe Clauss Alain Ketterlin Vincent Loechner Benoît Pradelle Alexandra Jimborean

This link in the programming chain has become essential with the advent of the new multicore architectures. Still being considered as secondary with mono-core architectures, dynamic analysis and optimization are now one of the keys for controling those new mechanisms complexity. From now on, performed instructions are not only dedicated to the application functionalities, but also to its control and its transformation, and so in its own interest. Behaving like a computer virus, such a process should rather be qualified as a “vitamin”. It perfectly knows the current characteristics of the execution environment and owns some qualitative information thanks to a behavior modeling process (issue 2). It appends a significant part of optimizing ability compared to a static compiler, while observing live resources availability evolution.

State of the art

Dynamicanalysis and optimization, that is to say simultaneous to the program execution, have motivated a growing interest during the last decade, mainly because of the hardware architectures and applications growing complexity. Indeed, it has become more and more difficult to anticipate any program run simply from its source code, either because its control structures introduce some unknown objects before run (dynamic memory allocation, pointers, ...), or because the interaction between the target architecture and the program generates unpredictable behaviors. This is notably due to the appearance of more optimizing hardware units (prefetching units, speculative processing, code cache, branch prediction, etc.). With multicore architectures, this interest is growing even more. Works achieved in this area for mono-core processors have permitted to establish some classification of the so-called dynamic approaches, either based on the used methodologies or on the objectives.

The first objective for any dynamic approach is to extract some live information at runtime relying on a profiling process. This essential step is the main objective of issue 2 (see sub-section ).

Identifying some “hotspots” thanks to profiling is then used for performance improvement optimizations. Two main approaches can be distinguished:

the profile-guidedapproach, where analysis and optimization of profile information are performed off-line, that is to say statically. A first run is only performed to extract information for driving a re-compilation. Related to this approach, iterative compilationconsists in running a code that has been transformed following different optimization possibilities (nature and sequencing of the applied optimizations), and then in re-compiling the transformed code guided by the collected performance information, and so on until obtaining a “best” program version. In order to promote a rapid convergence towards a better solution, some heuristics or some machine learning mechanisms are used , , . The main drawback of such approaches relates to the quality of the generated code which depends on the reference profiled execution, and more precisely on the used input data set, but also on the used hardware.

the on-the-flyapproach consists in performing all steps at each run (profiling, analysis and transformation). The main constraint of this approach is that the time overhead has to be widely compensated by the benefits it generates. Several works propose such approaches dedicated to specific optimizations. We personally successfully implemented a dynamic data prefetching system for the Itanium processor .

Although all these works provided some efficient dynamic mechanisms, their adaptation to multicore architectures yields difficult issues, and even challenges them. It is indeed necessary to control interactions between simultaneous tasks, imposing an additional complexity level which can be fateful for a dynamic system, while becoming too costly in time and space.

Some dynamic parallelizing techniques have been proposed in the last years. They are mainly focusing on parallelizing loop-nests, as programs generally spend most of their execution time in iterative structures.

The LRPD test is certainly one of the foundation strategies. This method consists in speculatively parallelizing loops. Privatization and reduction transformations are applied to promote a successful application of the strategy. During execution, some tests are performed to verify the speculation validity. In case of invalid speculation, the targeted loop is re-executed sequentially. However, the application range is limited to loops accessing arrays; pointers cannot be handled. Moreover the method is not fully dynamic since an initial static analysis is needed.

In , Cintra and Llanos present a speculative parallel execution mechanism for loops, where iteration chunks are executed in sliding windows of $n$ threads. The loops are not transformed and the sequential schedule remains as a reference to define a total order on the speculative threads. In order to verify whether some dependencies are violated during the program run, all data structures qualified as speculative, that is to say those being accessed in read-write mode by the threads, are duplicated for each thread and tagged following those states: not accessed, modified, exposed loadedor exposed loaded and later modified. For example, a read-after-writedependency has been violated if a thread owns a data tagged as exposed loadedor exposed loaded and modified, and if a predecessor thread, following the sequential total order, owns the same data but tagged as modifiedor exposed loaded and modified, while this data has not yet been committed in main memory. Such an approach can be memory-costly as each shared data structure is duplicated. It can be tricky to adjust verification frequencies to minimize time overhead. Some other methods based on the same principle of verifying speculation relatively to the sequential schedule have been proposed recently as in , where each iteration of a loop is decomposed into a prologue, a speculative body and an epilogue. The speculative bodies are performed in parallel and each body completion induces a verification. This approach seems to be only well suited for loops which bodies represent significant computation time.

Another recent work is the development of SPICE which is a speculative parallelizing system where an entire first run of a loop is initially observed. This observation serves in determining the values reached by some variables during the run. During a next run of the loop, several speculative threads are launched. They consider as initial values of some variables the values that have been observed at the previous run. If a thread reaches the starting value of another thread, it stops. Thus each thread performs a different portion of the loop. But if the loop behavior changes and if another thread starting value is never reached, the run goes on sequentially until completion.

The main limits of these propositions are:

they do not alter the initial sequential schedule since always contiguous instruction blocks are speculatively parallelized;

their underlying parallelism is out of control: the characteristics of the generated parallel schedule are completely unknown since they randomly depend on the program instructions, their dependencies and the target machine. If bad performance is encountered, no other parallelization solution can be proposed. Moreover, the effective instruction schedule occurring at program run can significantly vary from one run to another, hence leading to a confusing performance inconsistency.

A strategy that would uniquely be based on a transactional memory mechanism, with rollbacks in the case of data races, yields a totally uncontrolable parallelism where performance can not be ensured and not even strongly expected.

While being based on efficient prediction mechanisms, a better control over parallelization will permit to provide solutions that are well suited to a varying execution context and to parallelize portions of code that can be parallelized only in some particular context. It is indeed crucial to maximize the potential parallelism of the applications to take advantage of the forthcoming processors comprising several tens of cores.

General objective: building a virtual machine

As it has already been mentioned, dynamic parallelization and optimization can take place inside a virtual machine. All the research objectives that are presented in the following are related to its construction.

Notice that the term of “virtual machine” is employed to group a set of dynamic analysis and optimization mechanisms taking as input a binary code, eventually enriched with specific instructions. We refer to a process virtual machine which main role is dynamic binary optimization from one instruction set to the same instruction set. The taxonomy given in includes this kind of virtual machine.

Notice that this virtual machine can run in parallel on the processor cores during the four initial phases (see figure ), but also simultaneously to the target application, either by sharing some cores with light processes, or by using cores that are useless for the target application. It will also support a transactional memory mechanism, if available. However the foreseen parallelizing strategies do not depend on such a mechanism since our speculative executions are supposed to be as reliable as possible thanks to efficient prediction models, and since they are supported by a specific and higher level rollback mechanism. Anyway if available, a transactional memory mechanism would allow to take advantage of “nearly perfect” prediction models.

The virtual machine takes as input an intermediate code expressing several kinds of parallelism on several code extracts. Those kinds of parallelism are either effective, that is to say that the corresponding parallel execution is obviously semantically correct, or hypothetical, that is to say that there is still some uncertainty on the parallelism correctness. In this case, this uncertainty will have to be resolved at run time. This intermediate “multi-parallel” code is generated by the static parallelization described subsection . It also contains generic descriptions of parallelizing or optimizing transformations which parameters will have to be instanciated by the virtual machine, thanks to its knowledge about the target architecture and the program run-time behavior.

Adaptation of the intermediate code to the target architecture

The virtual machine first phase is to adapt this intermediate code to the target multicore architecture. It consists in answering the following questions:

What is the suitable kind of parallelism?

What is the suitable parallel task granularity?

What is the suitable number of parallel tasks?

Can we take advantage of a specialized instruction set for some operations?

What are the parameter values for some parallelization or optimization?

The multi-parallel intermediate code exhibits different parameters allowing to adapt some parallelizing and optimizing transformations to the target architecture. For example, a loop unrolling will be parametrized by the number of iterations to be unrolled. This number will depend, for example, on the number of available registers and the size of the instruction cache. A parallelizing transformation will depend on several possible parallel instruction schedules. One or several schedules will be selected, for example, depending on the kind of memory hierarchy and the cache sharing among cores.

Concerning hypothetical parallelism, this first phase will reduce the number of these propositions to solutions that are well suited to the target architecture. This phase also instruments the intermediate code in order to install the dynamic mechanisms related to profiling and speculative parallel execution.

High level parallelization and native code creation

From these target architecture related adaptations, a parallel intermediate code is generated. It contains instructions that are specific to the dynamic optimizing and parallelizing mechanisms, i.e., instrumentation instructions to feed the profiling process as well as calls to speculative execution management procedures. A translation into native code executable by the target processor follows. This translation also allows to keep trace of the code extracts that have to be modified during the run.

Low level parallelization

The binary version of the code exhibits new parallelism and optimization sources that are specific to the instruction set and to the target architecture capabilities. Moreover, some dynamic optimizations are dedicated to specific instructions, or instruction blocks, as for example the memory reads which time performances can be dynamically improved by data prefetching . Thus the binary code can be transformed and instrumented as well.

Distribution, execution and profiling

The so built executable code is then distributed among the processor cores to be run. During the run, the instrumentation instructions feed the profiler with information for execution monitoring and for behavior models construction (see subsection ). An accurate knowledge of the binary code, thanks to the control of its generation, also permits at this step to dynamically control the insertion or deletion of some instrumentation instructions. Indeed it is important to manage execution monitoring through sampling based instrumentations in varying frequencies, following the changing behavior frequency (see in and a description of this kind of mechanism), as such instrumentations necessarily induce overheads that have to be minimized.

Re-parallelization, thread mutation or rollback

Depending on the information collected from instrumentation, and depending on the built prediction models, the profiling phase causes a re-transformation of some code parts, thus causing the mutation of the concerned threads. Such re-transformation is done either on the binary code whether it consists in low level and small modifications, as for example the adjustement of a data prefetching distance, or on the intermediate code if it consists in a complete modification of the parallelizing strategy. For example, such a processing will follow the observation of a bad performance, or of a change in the computing resources availability, or will be caused by the completion of a dependency prediction model allowing the generation of a speculative parallelization. From such a speculative execution, a re-transformation can consist in rolling back to a sequential execution version when the considered hypothetical parallelism, and thus the associated prediction model, has been evaluated wrong.

Proof of program transformations for multicores Éric Violard Julien Narboux Nicolas Magaud Vincent Loechner Alexandra Jimborean State of the art Certification of low-level codes.

Among the languages allowing to exploit the power of multicore architectures, some of them supply the programmer a library of functions that corresponds more or less to the features of the target architecture : for example, CUDA http:// www. nvidia. com/ object/ cuda_what_is. htmlfor the architectures of type GPGPU and more recently the standard OpenCL http:// www. khronos. org/ openclthat offers a unifying programming interface allowing the use of most of the existing multicore architectures or a use of heterogeneous aggregate of such architectures. The main advantage of OpenCL is that it allows the programmer to write a code that is portable on a large set of architectures (in the same spirit as the MPI library for multi-processor architectures). However, at this low level, the programming model is very close to the executing model, the control of parallelism is explicit. Proof of program correctness has to take into account low-level mechanisms such as hardware interruptions or thread preemption, which is difficult.

In , Feng et al.propose a logic inspired from the Hoare logic in order to certify such low-level programs with hardware interrupts and preempted threads. The authors specify this logic by using the meta-logic implemented in the Coq proof assistant .

Certification of a compiler.

The problem here is to prove that transformations or optimizations preserve the operational behaviour of the compiled programs.

Xavier Leroy in , formalizes the analyses and optimizations performed by a C compiler: a big part of this compiler is written in the specification language of Coq and the executable (Caml) code of this compiler is obtained by automatic extraction from the specification.

Optimizing compilers are complex softwares, particularly in the case of multi-threaded programs. They apply some subtle code transformations. Therefore some errors in the compiler may occur and the compiler may produce incorrect executable codes. Work is to be done to remedy this problem. The technique of validation a posteriori , is an interesting alternative to full verification of a compiler.

Semantics of directives.

As it was mentioned in subsection , the use of directives is an interesting approach to adapt languages to multicore architectures. It is a syntactic means to tackle the increasing need of enriching the operational semantics of programs.

Ideally, these directives are only comments: they do not alter the correction of programs and they are a good means to improve their performance. They allow the separation of concerns: correctionand efficiency.

However, using directives in that sense and in the context of automatic parallelization, raises some questions: for example, assuming that directives are not mandatory, how to ensure that directives are really taken into account? How to know if a directive is better than another? What is the impact of a directive on performance?

In his thesis , that was supervised by Éric Violard, Philippe Gerner addresses similar questionings and states a formal framework in which the semantics of compilation directives can be defined. In this framework, any directive is encoded into one equation which is added to an algebraic specification. The semantics of the directives can be precisely defined via an order relation (called relation of preference) on the models of this specification.

Definition of a parallel programming model.

Classically, the good definition of a programming model is based on a semantic domain and on the definition of a “toy” language associated with a proof system, which allows to prove the correctness of the programs written in that language. Examples of such “toy” languages are CSP for control parallelism and $ℒ$ for data parallelism. The proof systems associated with these two languages, are extensions of the Hoare logic.

We have done some significant works on the definition of data parallelism . In particular, a crucial problem for the good definition of this programming model, is the semantics of the various syntactic constructs for data locality. We proposed a semantic domain which unifies two concepts: alignment(in a data-parallel language like HPF) and shape(in the data-parallel extensions of C).

We defined a “toy” language, called PEI, that is made of a small number of syntactic constructs. One of them, called change of basis, allows the programmer to exhibit parallelism in the same way as a placement or a scheduling directive .

Programming models for multicore architectures.

The multicore emergence questions the existing parallel programming models.

For example, with the programming model supported by OpenMP, it is difficult to master both correctness and efficiency of programs. Indeed, this model does not allow programmers to take optimal advantage of the memory hierarchy and some OpenMP directives may induce unpredictable performances or incorrect results.

Nowadays, some new programming models are experienced to help at designing both efficient and correct programs for multicores. Because memory is shared by the cores and its hierarchy has some distributed parts, some works aim at defining a hybrid model, between task parallelism and data parallelism. For example, languages like UPC (Unified Parallel C) http:// upc. gwu. eduor Chapel http:// chapel. cs. washington. educombine the advantages of several programming paradigms.

In particular, the model of memory transactions (or transactional memory ) retains much attention since it offers the programmer a simple operational semantics including a mutual exclusion mechanism which simplifies program design. However, much work remains to define the precise operational meaning of transactions and the interaction with the other languages features . Moreover, this model leaves the compiler a lot of work to reach a safe and efficient execution on the target architecture. In particular, it is necessary to control the atomicity of transactions and to prove that code transformations preserve the operational semantics.

Refinement of programs.

Refinement , is a classical approach for gradually building correct programs: it consists in transforming an initial specification by successive steps, by verifying that each transformation preserves the correctness of the previous specification. Its basic principle is to derive simultaneously a program and its own proof. It defines a formal framework in which some rules and strategies can be elaborated to transform specifications written by using the same formalism. Such a set of rules is called a refinement calculus.

Unity and Gamma are classical examples of such formalisms, but they are not especially designed for refining programs for multicore architectures. Each of these formalisms is associated with a computing model and thus each specification can be viewed as a program. Starting with an initial specification, a proof logic allows a user to derive a specification which is more suited to the target architecture.

Refinement applies for the programming of a large range of problems and architectures. It allows to pass the limitations of the polyhedral model and of automatic parallelization. We designed a refinement calculus to build data parallel programs .

Main objective: formal proof of analyses and transformations

Our main objective consists in certifying the critical modules of our optimization tools (the compiler and the virtual machine). First we will prove the main loop transformation algorithms which constitute the core of our system.

The optimization process can be separated into two stages: the transformations consisting in optimizing the sequential code and in exhibiting parallelism, and those consisting in optimizing the parallel code itself. The first category of optimizations can be proved within a sequential semantics. For the other optimizations, we need to work within a concurrent semantics. We expect the first stage of optimizations to produce data-race free code. For the second stage of optimizations, we will first assume that the input code is data-race free. We will prove those transformations using Appel's concurrent separation logic . Proving transformations involving program which are not data-race free will constitute a longer term research goal.

Proof of transformations in the polyhedral model

The main code transformations used in the compiler and the virtual machine are those carried out in the polyhedral model , . We will use the Coq proof assistant to formalize proofs of analyses and transformations based on the polyhedral model. In , Cachera and Pichardie formalized nested loops in Coq and showed how to prove propertiesof those loops. Our aim is slightly different as we plan to prove transformationsof nested loops in the polyhedral model. We will first prove the simplest unimodular transformations, and later we will focus on more complex transformations which are specific to multicore architectures. We will first study scheduling optimizations and then optimizations improving data locality.

Validation under hypothesis

In order to prove the correction of a code transformation $T$ it is possible to:

prove that $T$ is correct in general, i.e., prove that for all $x$ , $T (x)$ is equivalent to $x$ .

prove a posteriorithat the applied transformation has been correct in the particular case of a code $c$ .

The second approach relies on the definition of a program called validatorwhich verifies if two pieces of program are equivalent. This program can be modeled as a function $V$ such that, given two programs $c_{1}$ and $c_{2}$ , $V (c_{1}, c_{2}) = t r u e$ only if $c_{1}$ has the same semantics as $c_{2}$ . This approach has been used in the field of optimizations certification , . If the validator itself contains a bug then the certification process is broken. But if the validator is proved formally (as it was achieved by Tristan and Leroy for the Compcert compiler , ) then we get a transformed program which can be trusted in the same way as if the transformation is proved formally.

This second approach can be used only for the effective parallelism, when the static analysis provides enough information to parallelize the code. For the hypothetical parallelism, the necessary hypotheses have to be verified at run time.

For instance, the absence of aliases in a piece of code is difficult to decide statically but can be more easily decided at run time.

In this framework, we plan to build a validator under hypotheses: a function $V^{'}$ such that, given two programs $c_{1}$ and $c_{2}$ and an hypothesis $H$ , if $V^{'} (c_{1}, c_{2}, H) = t r u e$ , then $H$ implies that $c_{1}$ has the same semantics as $c_{2}$ . The validity of the hypothesis $H$ will be verified dynamically by the virtual machine. This verification process, which is part of the virtual machine, will have to be proved as correct as well.

Rejecting incorrect parallelizations

The goal of the project is to exhibit potential parallelism. The source code can contain many sub-routines which could be parallelized under some hypothesis that the static analysis fails to decide. For those optimizations, the virtual machine will have to verify the hypotheses dynamically. Dynamically dealing with the potential parallelism can be complex and costly (profiling, speculative execution with rollbacks). To reduce the overhead of the virtual machine, we will have to provide efficient methods to rule out quickly incorrect parallelism. In this context, we will provide hypotheses which are easy to check dynamically and which can tell when a transformation cannot be applied, i.e., hypotheses which are sufficient conditions for the non-validity of an optimization.

Application Domains Application Domains

Performance being our main objective, our developments' target applications are characterized by intensive computation phases. Such applications are numerous in the domains of scientific computations, optimization, data mining and multimedia.

Applications involving intensive computations are necessarily high energy consumers. However this consumption can be significantly reduced thanks to optimization and parallelization. Although this issue is not our prior objective, we can expect some positive effects for the following reasons:

Program parallelization tries to distribute the workload equally among the cores. Thus an equivalent performance, or even a better performance, to a sequential higher frequency execution on one single core, can be obtained.

Memory and memory accesses are high energy consumers. Lowering the memory consumption, lowering the number of memory accesses and maximizing the number of accesses in the low levels of the memory hierarchy (registers, cache memories) have a positive consequence on execution speed, but also on energy consumption.

Software PolyLib

PolyLib http:// icps. u-strasbg. fr/ PolyLibis a C library of polyhedral functions, that can manipulate unions of rational polyhedra of any dimension, through the following operations: intersection, difference, union, convex hull, simplify, image and preimage. It was the first to provide an implementation of the computation of parametric vertices of a parametric polyhedron, and the computation of an Ehrhart polynomial (expressing the number of integer points contained in a parametric polytope) based on an interpolation method.

It is used by an important community of researchers (in France and the rest of the world) in the area of compilation and optimization using the polyhedral model. Vincent Loechner is the maintainer of this software. It is distributed under GNU General Public License version 3 or later, and it has a Debian package maintained by Serge Guelton (Symbiose Projet, IRISA).

ZPolyTrans

ZPolyTrans http:// ZPolyTrans. gforge. inria. fris a C library and a set of executables, that permits to compute the integer transformation of a union of parametric $ℤ$ -polyhedra (the intersection between lattices and parametric polyhedra), as a union of parametric $ℤ$ -polyhedra. The number of integer points of the result can also be computed. It is build upon PolyLib and Barvinok library. This work is based on some theoretical results obtained by Rachid Seghir and Vincent Loechner, that will be published in ACM TACO in 2011 .

It allows for example to compute the number of solutions of a Presburger formula by eliminating existancial integer variables, or to compute the number of different data accessed by some array accesses contained in an affine parametric loop nest.

The authors of this software are Rachid Seghir (Univ. Batna, Algeria) and Vincent Loechner. It is distributed under GNU General Public License version 3 or later.

NLR

We have developed a program implementing our loop-nest recognition algorithm, detailed in . This standalone, filter-like application takes as input a raw trace and builds a sequence of loop nests that, when executed, reproduce the trace. It is also able to predict forthcoming values at an arbitrary distance in the future. Its simple, text-based input format makes it applicable to all kinds of data. These data can take the form of simple numeric values, or have more elaborate structure, and can include symbols. The program is written is standard ANSI C. The code can also be used as a library.

We have used this code to evaluate the compression potential of loop nest recognition on memory address traces, with very good results. We have also shown that the predictive power of our model is competitive with other models on average. The software is available upon request to anybody interested in trying to apply loop nest recognition. It has been distributed to a dozen of colleagues around the world.

We plan on using this software as the base for a new tool we currently design, for the analysis of parallel traces.

Dynamic version selector

We are developing a toolchain to automatically select between different versions of parallel loop nests, as described in subsection . It generates the profiling code and selection code from a loop nest source code and different schedules, expressed in the CLooG format.

Benoît Pradelle (PhD) wrote this toolchain, based on python scripts. It is not yet distributed.

Binary files decompiler

Our research on efficient memory profiling has lead us to develop a sophisticated decompiler. This tool analyzes x86-64 binary programs and libraries, and extracts various structured representations of the code. It works on a routine per routine basis, and first builds a loop hierarchy to characterize the overall structure of the algorithm. It then puts the code into Static Single Assignment (SSA) form to highlight the fine-grain data-flow between registers and memory. Building on these, it performs the following analyzes:

All memory addresses are expressed as symbolic expressions involving specific versions of register contents, as well as loop counters. Loop counter definitions are recovered by resolving linearly incremented registers and memory cells, i.e., registers that act as induction variables.

Most conditional branches are also expressed symbolically (with registers, memory contents, and loop counters). This captures the control-flow of the program, but also helps in defining what amounts to loop “trip-counts”, even though our model is slightly more general, because it can represent any kind of iterative structure.

This tool embodies several passes that, as far as we know, do not exist in any existing similar tool. For instance, it is able to track data-flow through stack slots in most cases. It has been specially designed to extract a representation that can be useful in looking for parallel (or parallelizable) loops . It is the basis of several of our studies.

Because binary program decompilation is especially useful to reduce the cost of memory profiling, our current implementation is based on the Pin binary instrumenter. It uses Pin's API to analyze binary code, and directly interfaces with the upper layers we have developed (e.g., program skeletonization, or minimal profiling). However, we have been careful to clearly decouple the various layers, and to not use any specific mechanism in designing the binary analysis component. Therefore, we believe that it could be ported with minimal effort, by using a binary file format extractor and a suitable binary code parser. It is also designed to abstract away the detailed instruction set, and should be easy to port (even though we have no practical experience in doing so).

We feel that such a tool could be useful to other researchers, because it makes binary code available under abstractions that have been traditionally available for source code only. If sufficient interest emerges, e.g., from the embedded systems community, or from researchers working on WCET, or from teams working on software security, we are willing to distribute and/or to help make it available under other environments.

Dynamic dependency analyser

We have recently started developing a dynamic dependence analyzer. Such a tool consumes the trace of memory (or object) accesses, and uses the program structure to list all the data dependences appearing during execution. Data dependences in turn are central to the search for parallel sections of code, with the search for parallel loops being only a particular case of the general problem. Most current works of these questions are either specific to a particular analysis (e.g., computing dependence densities to select code portions for thread-level speculation), or restricted to particular forms of parallelism (e.g., typically to fully parallel loops). Our tool tries to generalize existing approaches, and focuses on the program structures to provide helpful feedback either to a user (as some kind of “smart profiler”), or to a compiler (for feedback-directed compilation). For example, the tool is able to produce a dependence schema for a complete loop nest (instead of just a loop). It also targets irregular parallelism, for example analyzing a loop execution to estimate the expected gain of parallelization strategies like inspector-executor.

We have developed this tool in relation to our minimal profiling research project. However, the tool itself has been kept independent of our profiling infrastructure, getting data from it via a well-defined trace format. This intentional design decision has been motivated by our work on distinct execution environments: first on our usual x86-64 benchmark programs, and second on less regular, more often written in Java, real-world applications. The latter type of applications is likely the one that will most benefit from such tools, because their intrinsic execution environment does not offer enough structure to allow effective static analysis techniques. Parallelization efforts in this context will most likely rely on code annotations, or specific programming language constructs. Programmers will therefore need tools to help them choose between various constructs. Our tool has this ambition. We already have a working tool-chain for C/C++/Fortran programs (or any binary program). We are in the process of developing the necessary infrastructure to connect the dynamic dependence profiler to instrumented Java programs. Other managed execution environments could be targeted as well, e.g., Microsoft's .Net architecture, but we have no time and/or workforce to devote to such time-consuming engineering efforts.

VMAD software and LLVM

For dynamic analysis and optimization of programs, we are developing a virtual machine called VMAD, and specific passes to the LLVM compiler suite, plus a modified Clang frontend. It is fully described in subsection .

We implemented for now a memory access predictor in loop nests, based on the computation of linear interpolation functions. The profiling is very fast compared to other existing tools, as it samples only the first few iterations of each loop in the nest, then it is deactivated to return to the original, faster version. Other tools like PIN or PEBIL do not support such activation/deactivation mechanism.

New annotations for the final user, taken as input by LLVM, and new VMAD modules will be developed, as these tools have been designed to be very evolving.

Alexandra Jimborean (PhD), Matthieu Herrmann (Master student) and Luis Mastrangelo (Master student) are the main contributors of this software. It is not yet distributed.

Polyhedral prover Nicolas Magaud Julien Narboux Éric Violard correspondant

We are currently developing a formal proof of program transformations based on the polyhedral model. We use the CompCert verified compiler as a framework. This tool is written in the specification language of Coq. It is connected to the activity described in section .

New Results VMAD and LLVM

The goal is to provide a set of annotations (pragmas) that the user can insert in the source code to perform low level analyses (profiling) or optimizations (dynamic parallelization for example).

We are developing a virtual machine handling advanced dynamic analyses and transformations of programs. VMAD is organized as a sequence of basic operations, where external modules associated to specific strategies are dynamically loaded when required. The program binary files handled by VMAD are previously instrumented at compile time to include necessary data, instrumentation instructions and callbacks to the virtual machine. Dynamic information, such as memory locations of launched modules, are patched at startup in the binary file. The LLVM compiler has been extended to automatically instrument programs to meet the requirements both of VMAD and of the handled/chosen analysis and transformation strategies.

VMAD uses sampling and multi-versioning to limit the instrumentations time overhead. At runtime, targeted codes are launched by successive chunks that can be either original, instrumented or optimized/parallelized versions. After each chunk execution, decisions can be taken relatively to the current optimization strategy. At this time, VMAD is handling advanced memory access profiling through linear interpolation of the addresses, dynamic dependency analysis and version selction. The last developments are focusing on speculative polyhedral parallelization.

The profiling strategy interpolating the memory addresses accessed in a loop nest has been run on some of the SPEC2006 and Pointer Intensive benchmark suites, showing a very low time overhead, in most cases. More details are available in the research reports and , and publications and .

Dynamic version selector

Adaptive version selection between different parallel versions of code is necessary when the execution context of a program is not known. The execution contexts includes all or some of these possibly variable parameters: the target architecture, the load of the computer at execution time, and the input data.

We have developed a framework handling loops in the polyhedral model, that is able to take a runtime decision about which version to execute. It is based on :

the generation of different code versions of a loop nest;

an install-time profiling to take into account the architecture parameters, that builds a parametric ranking table between the versions;

a runtime selection, predicting the load balance and the execution time of each code version, before executing the best one.

We showed that different versions of a code are required on several polyhedral loop nest benchmarks, depending on both the target architecture and the input data. And we showed speedups compared to any statically chosen version in all execution contexts. More details are available in publication .

Binary parallelization

Our work on parallelizing binary programs has continued in 2011, with several new results. The general principle is to analyze the binary code and extract a model of the most intense loops. The model has to include everything that is related to memory access, and also some part of the computations done in registers. Once a suitable model is extracted, it can be used to derive a new scheduling for each targeted loops, optimizing various criteria: this is where polyhedral techniques are used, providing algorithms to optimize locality, parallelism, or both at the same time. After a new scheduling is computed, the transformed code is generated by a polyhedron-scanning algorithm. Our approach relies on an intermediate representation whose emphasis is on memory accesses, hiding, i.e., outliningall low-level details and retaining only what is needed by the parallelization component: we use raw C constructs, and macros to denote outlined code. Starting with the executable program, a first phase raises the code into our intermediate representation. The second phase uses a stock parallelizing component, producing a transformed C programs. The last phase lowers this intermediate representation into a new binary executable. The system then uses a run-time monitoring component (generated automatically at the same time as the parallel version) that redirects execution to the transformed loops whenever appropriate.

This year's activity on this topic in our team has started by finalizing and presenting a paper at the IMPACT workshop , held during CGO'2011, in Chamonix (France). This workshop focuses on tools and techniques based on the polyhedral model. It has been interesting to hear the various reactions and remarks of researchers attending our presentation: the general position is that our work opens new perspectives on the use of the polyhedral model, and avoids to need to have a complete polyhedral tool-chain. Another major aspect is the fact that our 3-phase strategy clearly separates the polyhedral part of the whole process, in essence providing a basis for a polyhedral programming language that is slightly more general that what was considered before.

The work on this topic has continued on two directions. The first was to effectively abstract the parallelizing phase. This has been done by demonstrating the use of two distinct parallelizers: the first is PLUTO, a polyhedral locality optimizer and parallelizer, and the second is CETUS, a “simple” parallelizer. The second new direction was directed by the complexity of typical “real-world” executable programs. It has consisted in developing new dependence analysis and parallelization techniques, handling more general classes of programs. This current is currently submitted for publication.

These research results will be presented at the forthcoming HiPEACconference, to be held in Paris in January 2012. A full-length paper has been accepted for publication in ACM Transactions on Architecture and Code Optimizationsome time in 2012.

Modeling the dynamic behavior of executable programs

Modeling the dynamic behavior of a given program is useful for several reasons. First, it is a particular way of profiling the program, targeting non trivial characteristics of the execution. As such, it helps programmers understand the behavior of the program, and hopefully helps them optimizing it. Second, the results can be used in a compiler, using run-time information to drive static optimizations. This path has seen considerable development when it comes, for example, to sequence basic blocks so as to leverage branch predictors and/or optimize the usage of instruction caches. We ambition to take it one step further, using run-time information to help a compiler in the task of auto-parallelization. Third, modeling can be used on-line, during the execution of the program, to drive the use of dynamic optimizations.

Our first achievement in 2011 has been the presentation of our paper at the International Symposium on Performance Analysis of Software and Systems . The paper describes an approach that e have called program skeletonization. The basic idea is to perform a static analysis of the code under scrutiny, and locate a small number of register assignments that completely determine the set of memory addresses the program will access. By instrumenting these elementary value assignments and extracting the ensuing computations of addresses, the amount of instrumentation can be dramatically reduced, at the cost of offloading some computations to the profiler. On average, this provides a significant gain in the time needed to obtain a memory trace, and is independent on the particular application, e.g., cache simulation, data race detection, and so on.

Our second main research direction on this topic has been the characterization of semi-regular memory accesses. Semi-regular accesses are caused by the traversal of a data structure linking successive memory cells in no particular memory order, i.e., a linked list or a tree. We have developed a modeling algorithm that is able to detect that a set of instructions perform irregular accesses that are actually highly correlated, differing only by an affine function of the enclosing loop indices. This has several implications in terms of potential optimizations. First it exhibits a kind of abstract iterator, that can later be handled, e.g., by inspector/executor techniques. Second, it reduces the number of potential dependencies that have to be tested.

The third, most recent, research project that we have started this year is the analysis of traces of parallel programs, currently MPI programs. Parallel traces offer new research challenges, because they contain events that are only partially ordered. We have extended our loop nest recognition algorithm to handle parallel traces. Our preliminary results show that this algorithm is highly effective in extracting communication patterns. This has a number of potential applications that we plan to study in the coming months.

Dynamic dependence analysis

We have started a research project on dynamic dependence analysis. The principle is to observe an execution of a given program and collect dependence information. This form of profiling is similar to memory profiling, except that to goal is to directly produce data dependencies. During collection, data dependencies are abstracted into dependence graphs which, in turn, give enough information to decide whether a given code portion is parallel and, if the answer is positive, precisely constrain the set of applicable program transformations. Our implementation currently uses our own profiling infrastructure to obtain data from a running program, and models the sequence of run-time data dependencies in a unique framework. The system is of course sensitive to the fact that the input data is representative of typical inputs: the conclusion it draws cannot be applied without resorting to speculation. Currently, we restrict the system to a parallelization assistant, whose result is a set of suggestion that the programmer is free to follow or ignore.

A fundamental characteristic of a parallelization assistant is the class of parallel constructions that it is able to extract from run-time data. We have designed the profiling component and the analysis algorithm such as to be able to represent the dependencies with various abstractions, from simple boolean dependencies to full dependence polyhedra. However, even the simplest dependence model provides useful hints to parallelize a given program. In this case, our tool flags every loop in the program as either intrinsically sequential, or potentially parallel. To make this widely usable, we have decided to target common parallel programing constructions, namely OpenMP directives. In turns out that our design is flexible enough to let the system also target loops that become parallel after simple privatization transformations. The result of profiling is thus a set of complete OpenMP directives, making the loop parallel but also instructing the compiler to take special measure for local data that (falsely) renders loop iterations dependent.

Because the general design of our dynamic dependence analysis system is generic and modular, we plan to develop and distribute software implementing our approach. This project, called Parwiz, has received support from INRIA under the form of an ADT ( Action de Développement Technologique). The funding should let us hire an engineer for two years. Unfortunately, we have not been able to find a suitable candidate in 2011, and plan to continue the recruitment process in 2012. In the meantime, we have supervised en Argentinian student, José Cacherosky, during 4 months (from July to October 2011), as an intern with funding from the INRIA International Internshipsprogram. The goal of the internship was to extend our framework (developed mainly to work from binary programs) to another execution environment, namely the Java Virtual Machine. Most of the time has been spent on developing an instrumentation infrastructure, but José Cacherosky has also started the implementation of some of the parallelism detection algorithms. We plan to continue this work soon, and provide a tool that could work in several, very distinct environments. Finally, Fabrice Rastello, research scientist at the École Normale Supérieure de Lyon(with the INRIA team COMPSYS), has expressed interest in collaborating with us on this research project.

Dealing with arithmetic overflows in the polyhedral model Nicolas Magaud Julien Narboux Éric Violard

Our goal in collaboration with Alexandre Pilkiewicz, PhD student, and François Pottier, senior researcher at INRIA, is to prove formally the correctness of a compiler based on the polyhedral model and to integrate it in the Compcert compiler.

But as the polyhedral transformations apply to affine loop nests in a mathematical framework where each loop variable is considered to be a mathematical integer, and not a machine integer, we must therefore warrant that no arithmetic overflow occurs when the considered loop nests are executed.

We proposed a solution to produce a compiler which does not ignore the problem of overflows. Our solution consists in generating a formula which captures the presence of overflows in the program, then asking to an external tool (i.e. the iscc calculator), for a sufficient condition about the parameters which implies the absence of overflows. Finally we check this condition dynamically. If the condition holds we can use the optimized version of the program. If it does not, in order to preserve the semantics of the program we keep the original version.

Figure illustrates our solution for overcoming the problem of arithmetic overflows and for ensuring the correctness of polyhedral transformations. In addition to the polyhedral optimizer, our compiler uses an oracle and a validator.

The oracle returns a boolean expression ( $b$ ) which denotes a sufficient condition to ensure that both the original program ( $o r g$ ) and the optimized program ( $o p t$ ) do not produce any overflow. Our transformation then builds a program that we call the resulting program, of the shape $If b then o p t else o r g$ . It dynamically evaluates the boolean expression $b$ and executes $o r g$ , i.e. the original program, if the condition is not fulfilled or $o p t$ , i.e. the optimized program, if the condition is true. The resulting program is then transmitted to the validator.

The validator is a function which takes the original program ( $o r g$ ), the optimized program ( $o p t$ ) and the resulting program, and returns a boolean: if it returns true, then the resulting program is equivalent to the original one and our compiler therefore produces the resulting program.

We now have to formally prove the validator using the Coq proof assistant.

Partnerships and Cooperations National Initiatives

Philippe Clauss, Alain Ketterlin and Vincent Loechner are involved in the proposition of an INRIA Large Wingspan Project ( Action d'Envergure Nationale) entitled “Software for multicores and hardware accelerators” and regrouping several french teams doing researches in compilers, parallel computing and program optimization. Philippe Clauss shares the head of the project with Gilles Muller of the INRIA REGAL team. A new version of the project will be submitted to INRIA at the end of 2011.

International Initiatives INRIA Associate Teams ANCOME

Title: Memory and applications memory behavior

INRIA principal investigator: Philippe Clauss

International Partner:

Institution: Universidad de Buenos Aires (Argentina)

Laboratory: Departamento de Computacion, Facultad de Ciencias Exactas y Naturales

Duration: 2011 - 2013

See also: http:// lafhis. dc. uba. ar/ wiki/ index. php/ EA-Ancome

This associate team focuses on developing original methods for the analysis of programs memory behavior, in particular in the context of applications using dynamic memory allocation. The proposed approaches consists of analyzing and modeling the runtime behavior, where extracted properties are then verified thanks to static analysis processes. Thus pure static approaches limits are overpassed. Further, the case of multi-threaded applications run on multi-core architectures is studied in order to elaborate and extend our analysis techniques and to extract properties specific to this context. The issues are mainly concerned with the conception of real-time applications using dynamic memory allocation.

INRIA International Partners

Rachid Seghir, researcher and teacher at the University of Batna, Algeria, works in close collaboration with Vincent Loechner. He is the co-author of the ZPolyTrans software and of a forthcoming paper in the journal ACM Transactions on Architecture and Code Optimization .

Benoît Meister, Managing Engineer at Reservoir Labs, New York, USA, has collaborated with Vincent Loechner and Rachid Seghir on a forthcoming publication .

Jean Christophe Beyler, Senior HPC Engineer at Intel and in the International Exascale project, University of Versailles, France, is experimenting energy saving strategies using the VMAD framework of the CAMUS team.

Sven Verdoolaege, researcher at the Leiden Institute of Advanced Computer Science, Leiden, The Netherlands, has collaborated with Philippe Clauss, Vincent Loechner and Diego Garbervetsky in the writing of a chapter of the book entitled Energy-Aware Memory Management for Embedded Multimedia Systems: A Computer-Aided Approach .

Visits of International Scientists Visits

Sergio Yovine (from Jun 6 2011 until Jun 12 2011)

Institution: Universidad de Buenos Aires (Argentina), EA INRIA Ancome

Diego Garbervetsky (from Apr 1 2011 until Apr 10 2011 and from Aug 1 2011 until Sept 2 2011)

Institution: Universidad de Buenos Aires (Argentina), EA INRIA Ancome

Internships

Luis Mastrangelo (from Mar 2011 until Aug 2011)

Subject: A Virtual Machine for Automatic Program Parallelization

Institution: Universidad de Buenos Aires (Argentina)

Bruno Cuervo Parrino (from May 2011 until Oct 2011)

Subject: Formalizing a new validation mechasnism under assumptions for speculative parallelism

Institution: Universidad de Buenos Aires (Argentina)

Jose Cacherosky (from Jul 2011 until Dec 2011)

Subject: Dynamic dependence profiling for Java

Institution: Universidad de Buenos Aires (Argentina)

Participation In International Programs

The CAMUS team is associated to the CNRS-CONICET Associated International Laboratory France-Argentina INFINIS https:// dri-dae. cnrs-dir. fr/ spip. php?article3009(INformatique Fondamentale, logIque, laNgages, vérIfication et Systèmes) inaugurated in December 2011.

Dissemination Animation of the scientific community

Vincent Loechner is one of the organizers of the Second International Workshop on Polyhedral Compilation Techniques, IMPACT 2012 http:// impact. gforge. inria. fr/ impact2012, that will be held in January 2012 in Paris, in conjunction with the international conference HiPEAC. He also participates, with Philippe Clauss, to its program committee.

Alexandra Jimborean presented her work about code tracking and cloning in the LLVM compiler at the First LLVM European User Group Meeting http:// llvm. org/ devmtg/ 2011-09-16in London, organized and sponsored by ARM Ltd., Sept. 16 2011.

Philippe Clauss has reviewed papers for the international conference DATE 2012, for the international workshop IMPACT 2011 and the french journal TSI. He has also reviewed projects for the ANR, and the Ile-de-France and Aquitaine Regions.

Philippe Clauss participated to the following PhD jurys in 2011:


Date	Candidate	Place	Role
Feb. 14	Khamar SAJJAD	Univ. Versailles	Rewiever
Jul. 4	Konrad TRIFUNOVIC	Univ. Paris-Sud	Rewiever
Sept. 30	Antoniu POP	ENSM Paris	Rewiever
Nov. 10	Khaled HAMIDOUCHE	Univ. Paris-Sud	Rewiever

Julien Narboux participated to the PhD jury of Tuan Minh Pham (INRIA Sophia-Antipolis) in 2011 as an examiner.

Teaching

$•$ Philippe Clauss, Université de Strasbourg, France


Intitulé du cours	Nombre d'heures	Niveau
Architecture des ordinateurs	27	L2
Compilation	64	M1
Compilation avancée	30	M1
Système et programmation temps-réel	25	M1
OS embarqués et réseaux industriels et de terrain	35	M2
Parallélisme	15	M2
1 Encadrement TER	10	M1
1 Encadrement stage Master	50	M2
2 Encadrements doctorat	200	Doctorat

$•$ Vincent Loechner, Université de Strasbourg, France


Intitulé du cours	Nombre d'heures (éq.TD)	Niveau
Structures de données et algorithmes	15	L2
Programmation système et réseau	52	L2
Fondements des systèmes d'exploitation	38	L3
Langage interprété	40	M1
Compilation avancée	8	M1
Parallélisme	15	M2
Parallélisme	30	école d'ingénieur
Encadrement TER (1)	10	M1
Encadrement stage Master (1)	50	M2
Encadrement doctorat (1)	100	Doctorat

$•$ Alain Ketterlin, Université de Strasbourg, France


Intitulé du cours	Nombre d'heures	Niveau
Bases de données avancées	40	M2
Ingénierie de la preuve	18	M1
Programmation fonctionnelle	22	L2
Encadrement d'apprentis	20	M1

$•$ Julien Narboux, Université de Strasbourg, France


Intitulé du cours	Nombre d'heures (éq. TD)	Niveau
Logique et Programmation Logique	61	L2
Logique	34	1a
Modèles de Calcul	40	L1
Techniques de Programmation	45	L2
Ingénierie de la preuve	27	M1
Certification du Logiciel	36	M2
Constructions et Preuves en Géométrie	6	M2
1 Encadrement Projet 150h	10	M1
Encadrement stage ENSIIE	30	1a
Encadrement stage Master	50	M2

$•$ Éric Violard, Université de Strasbourg, France


Intitulé du cours	Nombre d'heures (éq. TD)	Niveau
Programmation Fonctionnelle	61	L2
Programmation Système & Réseau	36	L2
Compilation	42	M1
Sémantique	45	M1

$•$ Nicolas Magaud, Université de Strasbourg, France


192 heures statutaires en L2 informatique, L3 math, Master Informatique Intitulé du cours	Nombre d'heures (éq. TD)	Niveau
Projet Professionnel Personalisé	20	L2
Structures de Données - Algorithmes	33	L2
Ingénierie de la preuve	12	M1
Preuve et Certification du Logiciel	36	M2
Algorithmique et Programmation	46	ENSIIE-Strasbourg 1A
Algorithmique et Structures de données	30	L3 Math
Génie Logiciel	30	M1 Chemo-informatique

PhD:

PhD : Benoît Pradelle, Static and Dynamic Polyhedral Compilation Methods for the Execution in Multi-core Environments , Université de Strasbourg, December the 20th 2011, Philippe Clauss

PhD in progress : Alexandra Jimborean, Advanced parallelization tools for multicore programming, October 2009, Philippe Clauss and Vincent Loechner

PhD in progress : Jean-François Dollinger, Dynamic optimizations for GPUs, September 2011, Philippe Clauss and Vincent Loechner

Performance driven data cache prefetching in a dynamic software optimization system Jean Christophe Beyler J. C. Philippe Clauss P. ICS '07: Proceedings of the 21st annual international conference on Supercomputing New York, NY, USA ACM 2007 202–209 http:// doi. acm. org/ 10. 1145/ 1274971. 1275000 A meta-predictor framework for prefetching in object-based DSMs Jean Christophe Beyler J. C. Michael Klemm M. Philippe Clauss P. Michael Philippsen M. Concurr. Comput. : Pract. Exper. 21 September 2009 1789–1803 http:// dx. doi. org/ 10. 1002/ cpe. v21:14 Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs Philippe Clauss P. ICS '96: Proceedings of the 10th international conference on Supercomputing New York, NY, USA ACM 1996 278–285 http:// doi. acm. org/ 10. 1145/ 237578. 237617 Symbolic polynomial maximization over convex sets and its application to memory requirement estimation Philippe Clauss P. Federico Javier Fernández F. J. Diego Garbervetsky D. Sven Verdoolaege S. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17 8 Aug 2009 983-996 http:// hal. inria. fr/ inria-00504617 AR BE Parametric Analysis of Polyhedral Iteration Spaces Philippe Clauss P. Vincent Loechner V. J. VLSI Signal Process. Syst. 19 2 1998 179–194 http:// dx. doi. org/ 10. 1023/ A:1008069920230 Philippe Clauss P. Irina Tchoupaeva I. A Symbolic Approach to Bernstein Expansion for Program Analysis and Optimization LNCS 2985 Springer April 2004 120-133 Prediction and trace compression of data access addresses through nested loop recognition Alain Ketterlin A. Philippe Clauss P. 6th annual IEEE/ACM international symposium on Code generation and optimization États-Unis Boston ACM April 2008 94-103 http:// dx. doi. org/ 10. 1145/ 1356058. 1356071 http:// hal. inria. fr/ inria-00504597/ en Precise data locality optimization of nested loops Vincent Loechner V. Benoît Meister B. Philippe Clauss P. Journal of Supercomputing 21 1 January 2002 37–76 Kluwer Academic Pub. Parameterized Polyhedra and their Vertices Vincent Loechner V. Doran K. Wilde D. K. International Journal of Parallel Programming 25 6 December 1997 Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions Sven Verdoolaege S. Rachid Seghir R. Kristof Beyls K. Vincent Loechner V. Maurice Bruynooghe M. Algorithmica 48 1 2007 37–66 http:// dx. doi. org/ 10. 1007/ s00453-006-1231-0 A Semantic Framework to Address Data Locality in Data Parallel Languages Éric Violard É. Parallel Computing 30 1 2004 139-161 Static and Dynamic Polyhedral Compilation Methods for the Execution in Multi-core Environments Benoît Pradelle B. Université de Strasbourg December 2011 PhD thesis Philippe Clauss P. Diego Garbervetsky D. Vincent Loechner V. Sven Verdoolaege S. Polyhedral Techniques for Parametric Memory Requirement Estimation Energy-Aware Memory Management for Embedded Multimedia Systems: A Computer-Aided Approach Florin Balasa F. Dhiraj Pradhan D. Taylor & Francis 2011 Integer Affine Transformations of Parametric Z-polytopes and Applications to Loop Nest Optimization Rachid Seghir R. Vincent Loechner V. Benoît Meister B. 1544-3566 ACM Transactions on Architecture and Code Optimization 2011 http:// hal. inria. fr/ inria-00582388/ en to appear in 2011. DZ US Formalization of Wu's simple method in Coq Jean-David Genevaux J.-D. Julien Narboux J. Pascal Schreck P. Jean-Pierre Jouannaud J.-P. Zhong Shao Z. CPP 2011 First International Conference on Certified Programs and Proofs Kenting, Taiwan, Province Of China Lecture Notes in Computer Science Springer-Verlag December 2011 http:// hal. inria. fr/ inria-00618745/ en International Conference on Certified Programs and Proofs 1 CPP VMAD: a Virtual Machine for Advanced Dynamic Analysis of Programs Alexandra Jimborean A. Matthieu Herrmann M. Vincent Loechner V. Philippe Clauss P. IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS Austin, United States IEEE April 2011 http:// hal. inria. fr/ inria-00544501/ en International Symposium on Performance Analysis of Systems and Software 2011 ISPASS Handling Multi-Versioning in LLVM: Code Tracking and Cloning Alexandra Jimborean A. Vincent Loechner V. Philippe Clauss P. WIR 2011: Workshop on Intermediate Representations, in conjunction with CGO 2011 Chamonix, France Florent Bouchez, Sebastian Hack, Eelco Visser April 2011 http:// hal. inria. fr/ inria-00572785/ en Workshop on Intermediate Representations 2011 WIR Efficient Memory Tracing by Program Skeletonization Alain Ketterlin A. Philippe Clauss P. IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS Austin, United States IEEE April 2011 http:// hal. inria. fr/ inria-00544497/ en International Symposium on Performance Analysis of Systems and Software 2011 ISPASS A Coq-based Library for Interactive and Automated Theorem Proving in Plane Geometry Tuan Minh Pham T. M. Yves Bertot Y. Julien Narboux J. The 11th International Conference on Computational Science and Its Applications (ICCSA 2011) Santander, Spain Lecture Notes in Computer Science 6785 Springer-Verlag 2011 368-383 http:// hal. inria. fr/ inria-00584918/ en International Conference on Computational Science and Its Applications 11 ICCSA Adaptive Runtime Selection of Parallel Schedules in the Polytope Model Benoît Pradelle B. Philippe Clauss P. Vincent Loechner V. High Performance Computing Symposium Boston, États-Unis ACM/SIGSIM April 2011 http:// hal. inria. fr/ inria-00564311/ en/ High Performance Computing Symposium 19 HPCS Transparent Parallelization of Binary Code Benoît Pradelle B. Alain Ketterlin A. Philippe Clauss P. First International Workshop on Polyhedral Compilation Techniques, IMPACT 2011, in conjunction with CGO 2011 Chamonix, France Christophe Alias, Cédric Bastoul April 2011 http:// hal. inria. fr/ inria-00572797/ en International Workshop on Polyhedral Compilation Techniques 1 IMPACT Using Machine Learning to Focus Iterative Optimization F. Agakov F. E. Bonilla E. J. Cavazos J. B. Franke B. G. Fursin G. M. F. P. O'Boyle M. F. P. J. Thomson J. M. Toussaint M. C. K. I. Williams C. K. I. CGO '06: Proceedings of the International Symposium on Code Generation and Optimization Washington, DC, USA IEEE Computer Society 2006 295–305 http:// dx. doi. org/ 10. 1109/ CGO. 2006. 37 On the Correctness of Refinement Steps in Program Development R.J.R. Back R. University of Helsinki 1978 Ph. D. Thesis The <span class="smallcap" align="left">Gamma</span>Model and its Discipline of Programming J.-P. Banâtre J.-P. D. Le Métayer D. Science of Computer Programming 15 1 1990 55-79 The Coq Proof Assistant Reference Manual : Version 6.1 Bruno Barras B. Samuel Boutin S. Cristina Cornes C. Judicael Courant J. Jean-Christophe Filliatre J.-C. Eduardo Gimenez E. Hugo Herbelin H. Gerard Huet G. Cesar Munoz C. Chetan Murthy C. Catherine Parent C. Christine Paulin-Mohring C. Amokrane Saibi A. Benjamin Werner B. 1997 A compiler framework for optimization of affine loop nests for GPGPUs Muthu Manikandan Baskaran M. M. Uday Bondhugula U. Sriram Krishnamoorthy S. J. Ramanujam J. Atanas Rountev A. P. Sadayappan P. ICS '08: Proceedings of the 22nd annual international conference on Supercomputing New York, NY, USA ACM 2008 225–234 http:// doi. acm. org/ 10. 1145/ 1375527. 1375562 Code Generation in the Polyhedral Model Is Easier Than You Think Cédric Bastoul C. PACT'13 IEEE International Conference on Parallel Architecture and Compilation Techniques Juan-les-Pins, France 2004 7–16 http:// hal. ccsd. cnrs. fr/ ccsd-00017260 A Structured Approach to Proving Compiler Optimizations Based on Dataflow Analysis Yves Bertot Y. Benjamin Grégoire B. Xavier Leroy X. TYPES 2004 2004 66-81 A Proof System for a Simple Data-Parallel Programming Language L. Bougé L. Y. Le Guyadec Y. G. Utard G. B. Virot B. IFIP WG 10.3, Applications in Parallel and Distributed Computing Caracas (Venezuela) North-Holland April 1994 Revisiting the Sequential Programming Model for Multi-Core Matthew Bridges M. Neil Vachharajani N. Yun Zhang Y. Thomas Jablin T. David I. August D. I. MICRO '07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture Washington, DC, USA IEEE Computer Society 2007 69–84 http:// dx. doi. org/ 10. 1109/ MICRO. 2007. 35 The VPC Trace-Compression Algorithms Martin Burtscher M. Ilya Ganusov I. Sandra J. Jackson S. J. Jian Ke J. Paruj Ratanaworabhan P. Nana B. Sam N. B. IEEE Trans. Comput. 54 11 2005 1329–1344 Embedding of Systems of Affine Recurrence Equations in Coq David Cachera D. David Pichardie D. Proc. of 16th International Conference on Theorem Proving in Higher Order Logics (TPHOLs'03) Lecture Notes in Computer Science 2758 Springer-Verlag 2003 155–170 Parallel Program Design: <i>A Foundation</i> K.M. Chandy K. J. Misra J. Addison Wesley 1988 Design Space Exploration of a Software Speculative Parallelization Scheme Marcelo Cintra M. Diego R. Llanos D. R. IEEE Trans. Parallel Distrib. Syst. 16 6 2005 562–576 http:// dx. doi. org/ 10. 1109/ tpds. 2005. 69 Dataflow analysis of scalar and array references Paul Feautrier P. International Journal of Parallel Programming 20 1 1991 23–53 Some efficient solutions to the affine scheduling problem, Part 1 : one dimensional time Paul Feautrier P. International Journal of Parallel Programming 21 5 1992 313–348 Some efficient solutions to the affine scheduling problem, Part 2 : multidimensional time Paul Feautrier P. International Journal of Parallel Programming 21 6 1992 Automatic Parallelization in the Polytope Model Paul Feautrier P. The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications Springer-Verlag 1996 79–103 Certifying low-level programs with hardware interrupts and preemptive threads Xinyu Feng X. Zhong Shao Z. Yuan Dong Y. Yu Guo Y. SIGPLAN Not. 43 6 2008 170–182 http:// dx. doi. org/ 10. 1145/ 1379022. 1375603 Velodrome: a sound and complete dynamic atomicity checker for multithreaded programs Cormac Flanagan C. Stephen N. Freund S. N. Jaeheon Yi J. PLDI '08: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation New York, NY, USA ACM 2008 293–303 http:// dx. doi. org/ 10. 1145/ 1375581. 1375618 La sémantique des directives au compilateur : application au parallélisme de données Philippe Gerner P. Université Louis Pasteur 2002 Ph. D. Thesis A Theoretical Framework of Data Parallelism and Its Operational Semantics Philippe Gerner P. Éric Violard É. Euro-Par 2000 LNCS 1900 Springer 2001 668–677 Stepwise refinement and concurrency: the finite-state case E. P. Gribomont E. P. Sci. Comput. Program. 14 2-3 1990 185–228 http:// dx. doi. org/ 10. 1016/ 0167-6423(90)90020-E Compiler research: the next 50 years Mary Hall M. David Padua D. Keshav Pingali K. Commun. ACM 52 2 2009 60–67 http:// doi. acm. org/ 10. 1145/ 1461928. 1461946 Oracle Semantics for Concurrent Separation Logic Aquinas Hobor A. Andrew W. Appel A. W. Francesco Zappa Nardelli F. Z. ESOP 2008 353-367 Recovering the Memory Behavior of Executable Programs Alain Ketterlin A. Philippe Clauss P. 10th IEEE Working Conference on Source Code Analysis and Manipulation, SCAM Roumanie Timisoara IEEE Computer Society Press Sep 2010 http:// hal. inria. fr/ inria-00502813 Optimistic parallelism requires abstractions Milind Kulkarni M. Keshav Pingali K. Bruce Walter B. Ganesh Ramanarayanan G. Kavita Bala K. L. Paul Chew L. P. PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation New York, NY, USA ACM 2007 211–222 http:// doi. acm. org/ 10. 1145/ 1250734. 1250759 Transactional memory James Larus J. Christos Kozyrakis C. Commun. ACM 51 7 2008 80–88 The Problem with Threads Edward A. Lee E. A. Computer 39 5 2006 33–42 http:// dx. doi. org/ 10. 1109/ MC. 2006. 180 Loop Parallelization in the Polytope Model C. Lengauer C. Parallel Processing Letters 4 3 1994 Formal verification of a realistic compiler Xavier Leroy X. Communications of the ACM July 2009 To appear The Compcert verified compiler, software and commented proof Xavier Leroy X. January 2010 http:// compcert. inria. fr SUIF Explorer: an interactive and interprocedural parallelizer Shih-Wei Liao S.-W. Amer Diwan A. Robert P. Bosch R. P. Anwar Ghuloum A. Monica S. Lam M. S. PPoPP '99: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming New York, NY, USA ACM 1999 37–48 http:// doi. acm. org/ 10. 1145/ 301104. 301108 Data Sequence Locality: a Generalization of Temporal Locality Vincent Loechner V. Benoît Meister B. Philippe Clauss P. Euro-Par 2001 Manchester, UK Springer 2001 Communication Optimization for Affine Recurrence Equations using Broadcast and Locality Vincent Loechner V. Catherine Mongenet C. International Journal of Parallel Programming 28 1 2000 Pin: building customized program analysis tools with dynamic instrumentation Chi-Keung Luk C.-K. Robert Cohn R. Robert Muth R. Harish Patil H. Artur Klauser A. Geoff Lowney G. Steven Wallace S. Vijay Janapa Reddi V. J. Kim Hazelwood K. PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation New York, NY, USA ACM 2005 190–200 http:// doi. acm. org/ 10. 1145/ 1065010. 1065034 High-level small-step operational semantics for transactions Katherine F. Moore K. F. Dan Grossman D. POPL '08: Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages New York, NY, USA ACM 2008 51–62 http:// dx. doi. org/ 10. 1145/ 1328438. 1328448 Identifying potential parallelism via loop-centric profiling T. Moseley T. D. A. Connors D. A. D. Grunwald D. R. Peri R. CF '07: Proceedings of the 4th international conference on Computing frontiers ACM 2007 143–152 Translation validation for an optimizing compiler George C. Necula G. C. SIGPLAN Not. 35 5 2000 83–94 http:// doi. acm. org/ 10. 1145/ 358438. 349314 The Code Validation Tool (CVT) - Automatic verification of code generated from synchronous languages A. Pnueli A. O. Shtrichman O. M. Siegel M. Software Tools for Technology Transfer 2 1998 Iterative optimization in the polyhedral model: part II, multidimensional time Louis-Noël Pouchet L.-N. Cédric Bastoul C. Albert Cohen A. J. Cavazos J. PLDI '08: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation New York, NY, USA ACM 2008 90–100 http:// doi. acm. org/ 10. 1145/ 1375581. 1375594 Iterative Optimization in the Polyhedral Model: part I, One-Dimensional Time Louis-Noël Pouchet L.-N. Cédric Bastoul C. Albert Cohen A. Nicolas Vasilache N. CGO '07: Proceedings of the International Symposium on Code Generation and Optimization Washington, DC, USA IEEE Computer Society 2007 144–156 http:// dx. doi. org/ 10. 1109/ CGO. 2007. 21 Visualizing potential parallelism in sequential programs Graham D. Price G. D. John Giacomoni J. Manish Vachharajani M. PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques New York, NY, USA ACM 2008 82–90 http:// doi. acm. org/ 10. 1145/ 1454115. 1454129 Spice: speculative parallel iteration chunk execution Easwaran Raman E. Neil Vachharajani N. Ram Rangan R. David I. August D. I. CGO '08: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization New York, NY, USA ACM 2008 175–184 http:// doi. acm. org/ 10. 1145/ 1356058. 1356082 The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger L. David Padua D. IEEE Trans. Parallel Distrib. Syst. 10 2 1999 160–180 http:// dx. doi. org/ 10. 1109/ 71. 752782 Méthodes de dénombrement de points entiers de polyèdres et applications à l'optimisation de programmes Rachid Seghir R. Université de Strasbourg December 2006 Ph. D. Thesis Automatically characterizing large scale program behavior Timothy Sherwood T. Erez Perelman E. Greg Hamerly G. Brad Calder B. ASPLOS-X: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems New York, NY, USA ACM 2002 45–57 http:// doi. acm. org/ 10. 1145/ 605397. 605403 Virtual Machines: Versatile Platforms for Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design) Jim Smith J. Ravi Nair R. Morgan Kaufmann Publishers Inc.

San Francisco, CA, USA

2005 Copy or Discard Execution Model For Speculative Parallelization On Multicores C Tian C. M Feng M. V Nagarajan V. R Gupta R. IEEE/ACM 41st International Symposium on Microarchitecture, MICRO 41 November 2008 330-341 Formal verification of translation validators: a case study on instruction scheduling optimizations Jean-Baptiste Tristan J.-B. Xavier Leroy X. SIGPLAN Not. 43 1 2008 17–27 http:// dx. doi. org/ 10. 1145/ 1328897. 1328444 Verified Validation of Lazy Code Motion Jean-Baptiste Tristan J.-B. Xavier Leroy X. Programming Language Design and Implementation 2009 ACM Press 2009 To appear Refinement of Data Parallel Programs in PEI Éric Violard É. Stéphane Genaud S. Guy-René Perrin G.-R. Proceedings of the IFIP TC 2 WG 2.1 international workshop on Algorithmic languages and calculi London, UK, UK Chapman & Hall, Ltd. 1997 107–131 Safe futures for Java Adam Welc A. Suresh Jagannathan S. Antony Hosking A. OOPSLA '05: Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications New York, NY, USA ACM 2005 439–453 http:// doi. acm. org/ 10. 1145/ 1094811. 1094845 Evaluating Sampling Based Hotspot Detection Qiang Wu Q. Oskar Mencer O. International Conference on Architecture of Computing Systems, ARCS March 2009 Efficient program execution indexing Bin Xin B. William N. Sumner W. N. Xiangyu Zhang X. PLDI '08: Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation New York, NY, USA ACM 2008 238–248 http:// doi. acm. org/ 10. 1145/ 1375581. 1375611 A Static-Dynamic Collaborative Framework for Nested Loops Instrumentation and Profiling Alexandra Jimborean A. Matthieu Herrmann M. Vincent Loechner V. Philippe Clauss P. Université de Strasbourg 05 2010 http:// hal. inria. fr/ inria-00534745/ en/ Research Report VMAD: a Virtual Machine for Advanced Dynamic Analysis of Programs Alexandra Jimborean A. Matthieu Herrmann M. Vincent Loechner V. Philippe Clauss P. Université de Strasbourg 09 2010 http:// hal. inria. fr/ inria-00534748/ en/ Research Report