A3 has been a INRIA pre-project since 1996 and was created as a
project-team in december 1998. A3 is a joint project with the Prism
lab of the Université de Versailles-Saint-Quentin. It was accepted by CNRS in july 1999. The A3 project-team
is ending at the end of 2003 and will become the new INRIA FutURs Alchemy team.
A3 research focus on program analysis together with its applications in code performance optimization on high performance processors, especially the optimization of memory hierarchy management and instruction-level parallelism. A3 designs methods and tools used by compilers or users for code analysis and optimization that exploit at best as possible architectural features of processors.
Objectives of the A3 project-team are:
develop new methods for program data-flow analysis,
generalize traditional methods of static analysis to code optimization,
take into account architectural features in code analysis,
develop new methods of code optimization,
develop new methods and tools for dynamic code analysis, as well as new optimization methods that exploit results of this analysis.
Applications targeted by A3 are:
optimization of programs known as ``computation intensive'', on high performance processors available in personal computers as well as workstations,
analysis and optimization of programs of specialized processors and/or embedded processors,
program parallelization on servers or workstations with a limited number of processors.
Software or hardware? Computer programming has always been a trade-off between both. Specific processors at one extreme point, general-purpose microprocessors at the opposite, this question is exacerbated when performance issues are concerned. High performance processors architectures are actually constantly evolving and their efficient programmation requires a more and more specialized expertise. While programmers could easily vectorize or parallelize the source program on 1980' supercomputers, they have today to take into account memory hierarchy and instruction level parallelism, that are typically managed at the assembly code level.
The step of semantic analysis of programs, data access patterns,
anticipation of dynamic execution behaviours, is a prerequisite of any
optimisation. In traditional compilers, code analysis relies on very
strong theories of program semantics and applies to any general
code. But the resulting information is not very precise, especially in
the case of regular data types. At the opposite automatic
parallelizers provide a very fine grain analysis of
array data accesses. But this kind of analysis is restricted to
programs with very regular control patterns as well as regular data
memory accesses.
Likewise optimizations performed in traditional compilers are basically aiming at reducing instructions count - for instance removing invariants in loops avoids wasting time in useless computations. These methods apply to any kind of programs. At the opposite optimizing for high performance architecture usually requires changing the schedule of instructions. This is easy and efficient in the very restricted case of straight line code or loop without branchs - for instruction level parallelism - and regular data access patterns - for memory hierarchy management; how to extend the scope of optimizations is still an open problem.
Hence in both code analysis and code optimization two trends are
emerging among methods. First kinds of methods are general purpose but
ignore architectural features. Second kinds of methods are efficient
but restricted to specific code constructs. How to combine generality
and efficiency? Research done in the A3 project team hinges on this
issue.
Applications targeted in A3 are basically code optimization in high
performance architectures. This includes general purpose microprocessors
and specialized embedded processors or DSP (Digital
Signal Processor), as well as
servers with a small number of processors or supercomputers with
shared memory programming model.
As for programs A3 targets performance critical applications. This
includes scientific computations (european MHAOTEU project,
1998-2001), multimedia embedded applications (european OCEANS project,
1996-1999), or videographics applications (Sandra project described
in section ).
The objective of performance can be understood with various meanings: maximization of execution speed, minimization of data transfers or memory access, minimization of power consumption or power dissipation, minimization of the size of the generated code, minimization of the cost of design of the processor...
Methods and algorithms developed in the A3 team apply to any level in the programmation process:
programming environment (code reengineering, interactive optimization tools, code transformation toolbox). The MHAOTEU ESPRIT project (1998-2001) focused on this kind of applications for memory hierarchy optimization.
source to source preprocessor: for instance PAF (Paralléliseur
Automatique de FORTRAN) or TOPS (source to source software
pipelining, section ).
compiler ;
postprocessor of assembly code optimization, such as in the
Salto framework of the IRISA Caps team in which the PiLo and
LoRA software were integrated;
moreover, A3 research work makes it possible to characterize the difficulty of use of each device of the specialized processors, and thus to influence their architecture, as in the case of the project Sandra.
The european OCEANS ESPRIT project, that ended in 1999, hinged on the interaction between the 2 phases of pre- and post-processing for VLIW architectures.
distributed simulation of microprocessors with dynamic warm-up and trace partitioning.
extensive environment for writing, running and debugging OCaml programs in Emacs/XEmacs. Thousands of installations worldwide, distributed as part of Debian GNU/Linux 3.0.
gate-level simulator
for the LC-2 processor (Little Computer 2 from Yale Patt and
Sanjay Patel)
CLooG (Chunky LOOp Generator) is a software and a library that generates the
loop code for scanning integer points of polyhedra
PIP is a now well-known solver of parameterized problems of
integer linear
programming
PiLo is a software pipelining package. It was implemented by Antoine
Sawaya for his PhD PiLo is based on the DESP method (Decomposed Software
Pipelining Salto (OCEANS project).
LoRA LoRA is based on the meeting graph and trades off register
pressure against loop unrolling degree.
LoRA is connected to PiLo (see above) and was also integrated in
MOST (Modulo Scheduling Testbed) developed in McGill University (Montreal).
TOPS is a tool for loop software pipelining in FORTRAN
programs. User or compiler specifies which loops should be software
pipelined by inserting directives. One can also specify, based on
analysis or expertise which data should be prefetched for avoiding
processor stalls in the case of cache miss. TOPS was designed by Min
Dai for his PhD
A number of interface tools have been designed by François Thomasset, these tools connect Maple, MuPad and the Polylib and Piplib libraries.
The SAREQ software was developed by Xavier Redon (LIFL,
Université of Lille I) in a collaborative work with Denis Barthou and
P. Feautrier on the recognition of algorithms. It makes it possible
to test the equivalence of two closely connected systems of equations
of recurrence. It is written in Ocaml and uses the
« Polylib » et « Omega » polyhedric
libraries.
We achieved the formalization and the extension of Instancewise
Analysis fo Recursive Programs
Following our preliminary results
Execution tracing is a convenient way of understanding the program
behaviour while going beyond limits of static analysis. But it can
be difficult to express traces in a comprehensible way or to represent
them through a model being used for simulation, prediction and
optimization. In this article, we propose to model traces of programs
by nests of loops by identifying
repetitive and periodic patterns. The generated loops can then be used
for all the objectives quoted above while benefitting from the
developments carried out within the framework of the polyhedric model
Several mathematical tools for static analysis of programs were
developed in the last decades. Although these tools are particularly
useful, they have of course limits. In particular, integer
multi-variable polynomials appear in many cases of analysis methods,
that are unable to handle them.
Although specific methods were already proposed, they consider only
subsets of such expressions. This article presents a general and
original approach of a symbolic version of the Bernstein expansion.
This approach makes it possible to give bounds to the
values taken by a multivariable polynomial on a "box" (interval), and
is in general more precise than the traditional methods of
approximation by intervals
Static cost models have a hard time coping with hardware components
exhibiting complex run-time behaviors, calling for alternative
solutions. Iterative optimization is emerging as a promising research
direction, but currently, it is mostly limited to finding the
parameters of program transformations compositions of program transformations.
This work is one of the cornerstones of our Center for Program
Tuning (RNTL project, 2004–2006) described in section . The main goal is to facilitate
the expression and search of compositions of program transformations.
This framework relies on a unified polyhedral representation of loops
and statements. The key to our framework is to clearly separate the
impact of each program transformation on the following three
components: the iteration domain, the statements schedule and the
memory access functions. Within this framework, composing a long
sequence of program transformations induces no code explosion. As a
result, searching for compositions of transformations is not hampered
by the multiplicity of compositions, and ultimately, it is equivalent
to testing different values of the matrices parameters in many cases.
Our techniques have been implemented ot top of the Open64/ORC
compiler. In addition, we are beginning the design of a robust
infrastructure for iterative optimization, based on machine learing
techniques (operation research, e.g., genetic algorithms). This
infrastructure distributes simulations, dynamic profiles,
compilations, transformations, while interacting with a
machine-learning component or with an expert user. Validation of these
concepts and application of the tools will be a critical issue in the
center for program tuning.
To achieve the best performance on single processors, optimizations need to target most components of the architecture simultaneously, focusing on the memory hierarchy (including registers), branch prediction and instruction-level parallelism. Typical examples of good candidates for aggressive optimization technologies include regular and numerical computations from scientific, signal processing or multimedia applications.
More irregular programs can also be data and compute intensive, but less architecture-aware optimizations have been proposed for such programs. Still, speculative and very complex transformations are available for such codes in the context of massively parallel computers. We investigated the applicability and extension/adaptation of some of these techniques for the optimization on uniprocessors, and our results were extremely promising in the case of two approximate string-matching codes (for computational biology). Hybrid static-dynamic optimizations for such programs are also being considered, driving the selection of optimization parameters at run-time through the fine-grain tracking of the behaviour of the application (performance counters). Other codes will be considered in the future, including additional bioinformatics examples (matching for sequencing, protein and RNA folding), as well as shape recognition algorithms (classification), data-mining and irregular numerical codes, e.g., meshes.
Cache designers based themselves on probabilistic arguments : each
program addresses the main memory randomly. If this model is
appropriate for the usual irregular programs, it fails for the regular
programs stemming from scientific computation or digital signal
processing. Moreover embedded applications require that the execution
times be predictible, and thus that the cache misses are kept under
control. In our approach, we propose to bypass the replacement
mechanism that is hard to plan in advance. The program is divided of
pieces (chunks). At the beginning of each chunk, the cache is
empty. The size of each chunk is adjusted so that its data exactly
hold in the cache. Hence the replacement strategy has no effect. At
the end of each chunk, the cache is flushed before starting the next
one. In this model, it is possible to give an asymptotic estimation of
data traffic and we use this information to calculate an optimal
chunking, that may induce transformation of the source program.
In the context of his PhD, C. Bastoul defined the chunking algorithms
for the most significant cases: group reuse, self reuse, multiple
reuse, and has implemented the Chunky prototype. The program is then
rebuilt by the CLooG code generator (section ). The
experiments are very encouraging for both data locality
improvement
In the continuity of the Sandra project (section )
about high performance architectures for videoprocessing we consider
the problem of specifying, controlling, verifying physical time
properties directly in programs. This year we have worked with Marc
Pouzet of the Université de Paris 6 and have started to specify
the HPN (Hierarchical Process Network) Lucid synchrone synchrounous and functional
programming language.
There are two main issues that had to be clarified in order to use
Lucid synchrone. The first one was to be able to handle the data flow sequences
by blocks. This implies keeping track of their recent
history, with possibly parameterized length. We succeeded in specifying block computations in pure
Lucid synchrone. Additionally the size of the blocks can be itself a flow.
The second issue was to specify the burstiness property of HPN,
meaning that arrival of data can be delayed but only within a time
window. We could specify the burstiness as well but at the price of
specifying the times of arrivals by a specific clock.
Therefore on the latter case the clock can not be synthetized.
The follow-up of this work will be to consider if the mechanism used
in Lucid synchrone for clock computing can be adapted for accounting for other
parameters of implementation such as ressource usage.
This work was done by Meriem Belguidoum for her DESS degree
The Sandra project is a collaborative work between INRIA A3 and Philips
Research (Eindhoven). It started with the PRF (Philips Research
France) in year 2003.
The topic is to design a specification and programming environment for
video-graphics processors that are of growing importance todays with
the advent of digital and high definition television. In this context
the flow of images are coming from different sources at possibly
different rates. They have to be combined on a single screen or
other drivers such as videorecorders after some processing steps from
simple scaling upto more 2D or 3D complex transformations. In past
years we have designed a programming language, schedulers and the HPN
environment for real-time properties verification Lucid synchrone language for clock
verification (see ). We have also worked with the INRIA Compose project-team
in order to set up a collaboration on domain-specific languages for
videographics processing.
We submitted an exploratory RNTL project (long-term
academic-industrial research project, funding from the ministry of
research) called "Centre d'Optimisation de Programmes" (Cop) or
"Center for Program Tuning" (CPT), in January 2004. This submission
was successful and the project was ranked 1st (ex aequo with 5 other
projects). The project is coordinated by Olivier Temam from LRI,
Paris-South University, with partners from IRIT (Toulouse), CEA
Saclay, STMicroelectronics Crolles and HP France. Funding starts in
January 2004.
Véronique Donzeau-Gouge, professor at CNAM is the official supervisor of Pierre Amiranoff.
A3 organizes a joint seminar with CRI (Centre de Recherches en Informatique, Ecole des Mines de Paris) and LRI (Laboratoire de Recherches en Informatique, Université Paris-Sud). Talks of 2003 are given below.
january 7th, 2003 :Simulating a $2M Commercial Server on a $2K PC, Mark D. Hill
(University of Wisconsin-Madison);
july 15th, 2003 :Optimisation de code de Pattern Matching sur l'architecture EPIC :
tranformation, expériences, automatisation, Patrick Carribault, INRIA/A3;
july 21st, 2003 :Reconnaissance de templates d'algorithmes, Christophe Alias,
PRISM, Université de Versailles-Saint-Quentin;
august 28th, 2003 :Analysis of Induction Variables Using Chains of Recurrences:
Extensions, Sebastian Pop, ICPS (Université Louis Pasteur,
Strasbourg) and CRI;
november 4th, 2003 :Validation de Codes Assembleurs Relogeables par Interprétation
Abstraite,
Matthieu Martel, CEA, Laboratoire de Sûreté des Logiciels (LSL);
september 11th, 2003 :Multivariate techniques for benchmark selection,
Koen de Bosschoere, Ghent University, Belgium.
Olivier Temam is in the steering committee of the proposal of the european HiPeac (High Performance architectures and compilers) Network of Excellence. Scientists of the A3 project-team are members of this network as well. The aim is to federate the european research in current and future processor architectures and compilers.
A3 has participated in the elaboration of the SpotLight proposal for a
european STREP
PAI Procope with Prof. Christian Lengauer at the University of Passau, Germany: Christoph Herrmann, associate researcher, visited INRIA Rocquencourt in september (two weeks); Peter Faber, PhD student, visited INRIA Futurs in december (one week).
A new Procope contract has been agreed for two years (2004–2006), between Albert Cohen and Christoph Herrmann, to collaborate and promote common research on metaprogramming and domain-specific program optimization.
In the context of the Sandra project (voir ), Marc
Duranton (Philips Research, Eindhoven)
visited INRIA on regular basis.
We collaborate with Jean-Luc Gaudiot (University of Irvine at Los Angeles) and Guang Gao (University of Delaware) on ``I-structures'' and their use in program optimization. J.-L. Gaudiot visited INRIA on April 22nd and Gao on April 24th.
Albert Cohen is a member of the program comittee for the DATE'04 conference; topic B11 ``design aspects of emerging technologies and applications''.
Albert Cohen was in charge of the interaction with research groups at INRIA Rocquencourt until december, for the preparation of the 6th Framework Program of the European Union.
Albert Cohen is a member of the department committee of the PRiSM laboratory, University of Versailles.
Christine Eisenbeis served on the program committe of SCOPES 2003 ( 7th International Workshop on
Software and Compilers for Embedded Systems,
Vienna , Austria , September 2003), CGO
2004 (International Symposium on Code Generation and Optimization with
Special Emphasis on Feedback-Directed and Runtime Optimization, Palo
Alto, March 2004) and CC 2004 (International Conference on Compiler Construction,
Barcelona, March 2004).
Christine Eisenbeis has organized In February 2003 a one-week workshop in Dagstuhl,
Germany with Mary-Lou Soffa and Tom Conte, ``Emerging Technologies:
can optimization technology meet their demands?'' (
Ph. Clauss, A. Cohen, Ch. Eisenbeis et P. Feautrier are in the steering committee of the CNRS ``réseau thématique pluridisciplinaire'' (RTP) that was initiated in september 2002 on ``Architecture and compilers'', with Ph. Clauss as the co-leader together with Pascal Sainrat (IRIT, Toulouse). Ph. Clauss et Ch. Eisenbeis are the co-leaders of the CNRS ``action spécifique'' (AS) on ``Compilers for embedded systems''; they have organized 3 one or two-days meeting in 2003. A. Cohen and Ch. Eisenbeis are also members of the CNRS ``action spécifique'' on ``Future technologies and future paradigms for processor architectures'', whose leader is Nathalie Drach-Temam (LRI).
Pierre Amiranoff was ATER at CNAM (algorithmics, programming, Ada) and then PRAG (initiation to computer tools, initiation to programming, Pascal).
C. Bastoul teached as a ``moniteur'' then as an ``ATER'' at the Université de Versailles-Saint-Quentin (``programming'' and ``algorithmics'').
Ph. Clauss teaches at the DEA of the Université Louis Pasteur of Strasbourg (« Parallelism » and « High performance compilers »).
Albert Cohen teaches at the Master of Computer Science of the Paris South University (compilation and optimization for high performance and embedded systems) and at École Polytechnique (third year computer architecture and first year Java programming labs).
The project-team members have given the following talks and attended the following conferences:
Pierre Amiranoff, Cédric Bastoul, Albert Cohen, Christine Eisenbeis: participation to the CPC'03 workshop, Amsterdam, june 2003 (talk of Pierre Amiranoff, ``Instancewise Array Dependence Test for Recursive Programs'', talk of Cédric Bastoul, ``Reordering methods for data locality improvement'').
Sylvain Girbal: article and presentation at the ACM SIGMetrics'03 conference, San Diego, june 2003, ``DiST: A Simple, Reliable and Scalable Method to Significantly Reduce Processor Architecture Simulation Time''.
Albert Cohen, Sylvain Girbal and Olivier Temam: participation to the joint FCRC'03 conference meeting in San-Diego, june 2003, and more specifically conferences PLDI, ISCA, SIGMetrics, PPoPP, SAS, LCTES, and workshops IVME, WCAE, WCED and NSC.
Albert Cohen: article and presentation to the LCPC'O3 conference (Languages and Compilers for Parallel Computers), College-Station, Texas, october 2003, ``Putting Polyhedral Loop Transformations to Work''.
Cédric Bastoul and Albert Cohen: participation to the LCPC'O3 conference (Languages and Compilers for Parallel Computers), College-Station, Texas, october 2003.
Albert Cohen and Sid-Ahmed-Ali Touati: invited participation and presentations to the Dagstuhl seminar on future challenges in compilation and optimization (organized by Christine Eisenbeis, Mary-Lou Soffa and Tom Conte), february 2003.
Albert Cohen and Paul Feautrier: invited participation and presentations to the Dagstuhl seminar on metaprogramming, skeletons, generative programming and domain-specific languages (organized by Don Batory, Christian Lengauer and Charles Consel), march 2003.
Albert Cohen: invited participation and presentation to the Dagstuhl seminar on memory consistency models (organized by Sam Midkiff, Jens Knoop, David Padua and Jae-Jin Lee), october 2003.
Albert Cohen: presentations at HP Labs Palo Alto (june 2003), University of Illinois at Urbana-Champaign (june 2003), Philips National Labs Eindhoven (december 2003), University of Passau (december 2003).
Cédric Bastoul: article and presentation at the IEEE ISPDC'03 conference (Parallel and Distributed Computing), Ljubljana, october 2003, "Efficient Code Generation for Automatic Parallelization and Optimization".
Cédric Bastoul: article and presentation at the CC'03 conference (Compiler Construction), Warsaw, april 2003, "Improving data locality by chunking".
Cédric Bastoul: seminar presentation at École Normale Supérieure de Lyon, september 2003.