The Compsys team has been created in January 2002 as
part of Laboratoire de l'Informatique du
Parallélisme (Lip), umr cnrs–ens-Lyon–Inria 5668),
and is located at ens-Lyon. Compsys is also a pre-project of
Inria and hopes to become a full project in 2004. The objective of
Compsys is to adapt and extend optimization techniques, primarily
designed for high performance computing, to the special case of
embedded computing systems.
An embedded computer is a digital system, part of a larger system (appliances like phones, TV sets, washing machines, game platforms, or larger systems like radars and sonars), which is not directly accessible to the user. In particular, this computer is not programmable in the usual way. Its program, if it exists, has been loaded as part of the fabrication process, and is seldom (or never) modified.
The objective of Compsys is to adapt and extend optimization techniques, primarily designed for high performance computing, to the special case of embedded computing systems. Compsys has four research directions, centered on compilation methods for simple or nested loops. These directions are:
code optimization for
specific processors (mainly dsp and vliw processors),
high-level code transformations (including loop transformations for memory optimization)
silicon compilation (with a link to micro-electronics).
These researches are supported by a marked
investment in polyhedra manipulation tools, with the aim of
constructing operational software tools, not just theoretical
results. Hence the forth research theme is centered on the
development of this tools.
We expect that the Compsys experience on key problems in the design
of parallel programs (scheduling, loop transformations) and the
support of our respective parent organizations (Inria, cnrs,
Ministry of Education) will allow us to contribute
significantly to the European research on embedded computing systems.
The term embedded system has been used for naming a wide
variety of objects. More precisely, there are two category of
so-called embedded systems: (1) control oriented and hard
real time embedded systems (automotive, nuclear plants, airplanes
etc.) and (2) compute intensive embedded systems (signal
processing, multi-media, stream processing). The Compsys team
is primarily concerned with the second type of embedded systems
which is now referred as embedded computing system. Design
and compilation methods proposed by the team will be efficient on
compute intensive processing with big sets of data processed in a
pipelined way.
Today, the industry sells much more embedded processors than general purpose processors; the field of embedded systems is one of the few segments of the computer market where the European industry still has a substantial share. Hence the importance of embedded system research in the European research initiatives.
Compsys' aims are to develop new compilation and optimization
techniques for embedded systems. The field of embedded
computing system design is large, and Compsys does not intend to cover it
in its entirety. We are mostly interested in the automatic design
of accelerators, for example optimizing a piece of (regular) code
for a dsp or designing a vlsi chip for a digital filter.
Compsys specificity is the study of code transformations intended
for optimization of features that are specific to embedded systems,
like performances, power consumption, die size. Our project is
related to code optimization (like in the Inria project
A3/ Alchemy), and to high-level architectural synthesis
(like in the Inria project r2d2). It belongs to the more
general theme of silicon compilation, which is today one of the
challenges that we have to meet if Europe is to become ``the leader
for system integration on silicon chips''
Recent decisions of the French government clearly show the emergence
of the research field ``tools for the conception of embedded
systems''. The Ministry Call for Proposals rntl 2003
explicitly cite embedded software as a
priority
The embedded system market is expanding. Among many factors, one can quote pervasive digitalization, low cost products, appliances, etc.
Software engineering for embedded systems is not well developed in
France, especially if one considers the importance of actors like
Alcatel, ST-Microelectronics, Matra, Thalès, and others.
Since embedded systems have an increasing complexity, new problems are emerging: computer aided design, a shorter time-to-market, a better reliability, modular design, and component reuse.
In 2001, the second working group of the rntl
More recently, several tools for high-level synthesis have appeared.
These tools are mostly based on C or C++
(SystemC vcc, and others).
The support for parallelism in these tools is minimal, but academic
projects are more concerned:
Flex mit,
Piperench HP-Labs and at the Synfora
The basic problem that these projects have to face is that the
definition of performance is more complex than in classical
systems. In fact, the problem is a multi-criteria optimization and one
has to take into account the execution time, the size of the program,
the size of the data structures, the power consumption, the
fabrication cost, etc. The incidence of the compiler on these costs is
difficult to assess and control. Success will be the consequence of a
detailed knowledge of all steps of the design process, from a
high-level specification to the chip layout. A strong cooperation
between the compilation and chip design communities is needed.
The recent creation of the ``Architecture and Compilation''
pluridisciplinary research
initiative cnrs (including a ``Compilation for embedded systems''
subfield) is a clear evidence of the increasing interest in the
French research community. In Europe, the work plan for years
2003-2004 explicitly quotes ``distributed and embedded
systems''
Computer-aided design of silicon systems is a wide field. The Compsys team members expertise is in regular computation parallelization and optimization. Hence, we will target applications with a large potential parallelism, but we will attempt to integrate our solutions into the big picture of CAD environments for embedded systems. This is an essential part of Compsys activities and will be a test of its success.
The Compsys team has almost doubled its size since it became an
Inria pre-project in January 2002. The new members are Paul Feautrier
(ens-Lyon professor) and Fabrice Rastello (Inria research scientist).
As a consequence, we had to slightly reshuffle our research
subjects, without any modification to the central theme.
Compsys will go on cooperating with Insa-Lyon (Anne Mignotte and
Antoine Fraboulet are external collaborators). One of our priorities is
the recruitment of several PhD students. The recruitment of
Antoine Scherrer, whose co-advisors are Antoine Fraboulet and Tanguy Risset,
is a first step in this direction.
Twenty years ago, the subject of compilation was considered to be mature enough to become an industry, using tools like Lex and Yacc for syntax analysis, and Graham-Glanville generators of code generator. The subject was reactivated by the emergence of parallel systems and the need for automatic parallelizers. The hot topic is now the intermediate phase between syntax analysis and code generation, where one can apply optimizations, particularly those that exploit parallelism, whether in an autonomous way or with the help of the programmer. In fact, there is parallelism in all types of digital systems, from super computers to PCs to embedded systems.
Compilation consists in a succession of code transformations. These transformations are applied to an intermediate representation that may be very similar to the source code (high-level optimization), or very similar to machine code (assembly code and even Register Transfer Level for circuit specification). Almost always, the main constraint is that the meaning (or semantics) of the source program must not be altered. Depending on the context, one may have to express the fact that the degree of parallelism must not exceed the number of available resources (processors, functional units, registers, memories). Finally, the specification of the system may enforce other constraints, like latency, bandwidth, and others. In the case of a complex transformation, one tries to express it as a constrained optimization problem.
For instance, in automatic parallelization, the French community
has mainly targeted loop optimization. If the source program obeys
to a few regularity constraints, one can obtain linear
formulations for many of the constraints. This way, the
optimization problem is reduced to a linear program to be solved
either in rationals, or, in few cases, in integers. These are well
known techniques, which are based on the theory of convex
polyhedra – hence the name polyhedral model which is often
affixed to the method. Based on this theory, efficient software
tools have been implemented. Mono- and multi-dimensional
scheduling techniques
Extending these methods to embedded systems is difficult because the objective function is complex to express. Performance, for instance, is no longer an objective but a constraint, the goal being to minimize the ``cost'' of the system, which may be a complex mixture of the design, the fabrication, and the operation costs. For instance, minimizing the silicon area improves the yield and hence decreases the fabrication cost. Power consumption is an important factor for mobile systems. Computer scientists are used to a paradigm in which the architecture is fixed and the only free variable is the program. The critical problem is thus to extend our optimization methods to handle much more free variables, mostly of a discrete nature.
In parallel with compiler research, the circuit design community
has developed its own design procedures. These techniques have as
input a structural specification of the target architecture, and
use many heavy-weight tools for synthesis, placement, and routing.
These tools mainly use sophisticated techniques for boolean
optimization and do not consider loops. When trying to raise the
level of abstraction, circuit designers have introduced the terms
architectural synthesis and behavioral synthesis, but
the tools did not follow, due to the above mentioned problems
(increased complexity of the constraints, increasing number of
free variables).
Technological advances in digital electronics have motivated the
emergence of standards for design specifications and design
methodologies. Languages like vhdl, Verilog, and SystemC
have been widely accepted. The concepts of off-the-shelf
components (intellectual property or ip) and of platform-based
design are gaining importance. However, the problem remains the
same: how to transform a manual design process into a compilation
process?
The first proposal was to use several tools together. For
instance, the hardware-software partitioning problem is handled by
architecture explorations, which rely on rough performance
estimates, and the degree of automation is low. But since the
complexity of systems on chip still increases according to Moore's
law, there is a pressing need to improve the design process, and
to target other architectures, like dsp, or reconfigurable
fpga platforms. The next generation of systems on chip will
probably mix all the basic blocks of today technology (dsp,
Asic, fpga, network and a memory hierarchy with many
level). We intend to participate in the design and programming of
such platforms.
Our vision of the challenges raised by these new possibilities is
the following: one needs to understand the technological
constraints and the existing tools in order to propose
innovative, efficient, and realistic compilation techniques for
such systems. Our approach consists in modeling the optimization
process as precisely as possible, and then to find powerful
techniques towards the optimal solution. Past experience has shown
that taking simultaneously all aspects of a problem into account
is near impossible.
Compsys has four research directions, each of which is a strong point in the project. These directions are clearly not independent. Their interactions are as follows: ``High-level Code Transformations'' (Section ) is on top of ``Optimization for Special Purpose Processors'' (Section ) and ``Compilation of Parallel Embedded Architectures'' (Section ), since its aim is to propose architecture-independent transformations. ``Federating Polyhedral Tools'' (Section ) is transversal because these tools may be useful in all other actions.
Applications for embedded computing systems generate complex
programs and need more and more processing power. This evolution
is driven, among others, by the increasing impact of digital
television, the first instances of umts networks, and the
increasing size of digital supports, like recordable dvd.
Furthermore, standards are evolving very rapidly (see for instance
the successive versions of mpeg). As a consequence, the
industry has rediscovered the interest of programmable structures,
whose flexibility more than compensates for their larger size and
power consumption. The appliance provider has a choice between
hard-wired structures (Asic), special purpose processors (Asip),
or quasi-general purpose processors (dsp for multimedia
applications). Our cooperation with ST-Microelectronics leads us to
investigate the last solution, as implemented in the ST100 (dsp)
and ST200 dsp Very Long Instruction Word processors (vliw).
Embedded applications have special program profiles and dataflows.
The power consumption is more than proportional to the clock
frequency. Since the program is loaded in permanent memory (rom, Flash, etc.) its compilation time is not significant. In
these conditions, it is interesting to use aggressive and costly
compilation techniques, including the use of exact solutions to
NP-hard problems. Our aim is thus to find exact or heuristic
solutions to combinatorial problems that arise in compilation for
vliw and dsp processors, and to integrate these methods into
industrial compilers for dsp processors (mainly the ST100 and
ST200). These combinatorial problems arise mainly in the removal
of the multiplexor functions (i.e., ssa form (``Static Single
Assignment''
One of the challenging features of today's processors is
predication vliw and dsp processors?
Compilation for embedded processors is difficult because the architecture
and operations are specially tailored to the task at hand, because the amount
of resources is strictly limited, and lastly, because one would like
to take the time to implement costly solutions. For instance, predication,
the potential for instruction level parallelism (simd, mmx), the limited
number of registers and the small size of the memory, the use of
direct-mapped instruction caches, but also the special form of
applications ssa
The degree of parallelism of an application and the degree of
parallelism of the target architecture do not usually coincide.
Furthermore, most applications have several levels of parallelism:
coarse grained parallelism as expressed, for instance, in a Kahn
process network, (see Sect. ), loop level
parallelism, which can be expressed by vector statements or parallel
loops, instruction level parallelism as in ``bundles'' for Epic or
vliw processors. One of the task of the compiler is to match the
degree of parallelism between the application and the architecture, in
order to get maximum efficiency. This is equivalent to finding a
schedule which respects dependences and meet resource constraints.
This problem has several variants, depending on the level of
parallelism and the target architecture.
For instruction level parallelism, the classical solution, which
is found on many industrial strength compilers, is to do software
pipelining using a technique known as modulo scheduling. This can
be applied to the innermost loop of a nest and, typically,
generates code for an Epic, vliw, or super-scalar processor.
The problem of optimal software pipelining can be exactly
formulated as an integer linear program, and recent research has
allowed many constraints to be taken into account, as for instance
register constraints. However the codes amenable to these
techniques are not fully general (at most one loop) and the
complexity of the algorithm is still quite high. Several phenomena
are still not perfectly taken into account. Some examples are
register spilling, and loops with a small number of iterations.
One of our aims is to improve these techniques, and to adapt them
to the ST-Microelectronics processors.
It seems to be difficult to extend the software pipelining method to loop nests. However, embedded computing systems, especially those concerned with image processing, are two-dimensional or more. Parallelization methods for loop nests are well known, especially in tools for automatic parallelization, but these do not take resource constraints into account. The usual method consists in finding totally parallel loops, for which the degree of parallelism is equal to the number of iterations. The iterations of these loops are then distributed among the available processors, either statically or dynamically. Most of the time, this distribution is the responsibility of the underlying runtime system (consider for instance the ``directives'' of the OpenMP library). This method is efficient only because the processors in a supercomputer are identical. It is difficult to adapt it to heterogeneous processors executing programs with variable execution time. One of today's challenges is to extend and merge these techniques into some kind of multi-dimensional software pipelining, or resource constrained scheduling. In the Yaka software prototype (see Section ), we are exploring a heuristics in which resource constraints are simulated by data constraints. This method is not completely satisfactory (the results may be far from the optimum) but we hope that the experience we have acquired in this way will help us to find a direct solution.
Embedded systems generate new problems in high-level code optimization, especially in the case of loop optimization. During the last 20 years, with the advent of parallelism in supercomputers, the bulk of research in code transformation was mainly concerned with parallelism extraction from loop nests. This resulted in automatic or semi-automatic parallelization. It was clear to all concerned that there were other factors governing performance, as for instance the optimization of locality or a better use of registers, but these factors where considered, wrongly, to be less important than parallelism extraction. Today, we have realized that performance is a resultant of many factors, and, especially in embedded systems, everything that has to do with data storage is of prime importance, as it impacts power consumption and chip size.
In this respect, embedded systems have two mains characteristics. Firstly, they are mass-produced. This means that the balance between design costs and production costs has shifted, giving more importance to production costs. For instance, each transformation that reduces the physical size of the chip has the side-effect of increasing the yield, hence reducing the fabrication cost. Similarly, if the power consumption is high, one has to include a fan with is costly, noisy and unreliable. Another point is that many embedded systems are powered from batteries with bounded capacity. Architects have proposed purely hardware solutions, in which unused parts of the circuits are put to sleep, either by gating the clock or by cutting off the power. It seems that the efficient use of these new features needs help from the compiler. However, power reduction can be obtained also when compiling, e.g. by making better use of the processors or of caches. For these optimization, loop transformations are the most efficient tool.
As the size of the needed working memory
may change by orders of magnitude, high-level code optimization
also has much influence on the size of the resulting circuit. If
the system includes high performance blocks like dsp or ASICs,
the memory bandwidth must match the requirements of these blocks.
The classical solution is to provide a cache, but this is adverse
to the predictability of latencies, and the resulting throughput
may not be sufficient. In that case, one resort to the use of
scratch-pad memories, which are simpler than a cache but require
help from the programmer and/or compiler to work efficiently. The
compiler is a natural choice for this task. One then has to solve
a scheduling problem under the constraint that the memory size is
severely limited. This is a generalization of the classical
problem of register allocation. Loop transformations reorder the
computations, hence change the lifetime of intermediate values,
and have an influence on the size of the scratch-pad memories.
The theory of scheduling is mature for cases where the objective function is the make-span or is related to the make-span. For other, non-local objective functions (i.e. when the cost cannot be directly allocated to a task), there are still many interesting open problems. This is specially true for memory-linked problems.
Many local memory optimization problems have already been solved
theoretically.
Some examples are loop fusion and loop alignment for array contraction and for
minimizing the length of the reuse vector
Theoretical studies here search for new scheduling techniques, with objective functions which are no longer linear. These techniques may found both high-level applications (for source-to-source transforms) and low-level applications (e.g. in the design of a hardware accelerator). Both cases share the same computation model, but objective functions may differ in details.
One should keep in mind that theory will not be sufficient to solve these problems. Experiments are required to check the adequation of the various models (computation model, memory model, power consumption model) and to select the most important factors according to the architecture. Besides, optimizations do interact: for instance reducing memory size and increasing parallelism are often antagonistic. Experiments will be needed to find a global compromise between local optimizations.
In the framework of a cooperation with Cadence, Antoine Fraboulet has the
opportunity of evaluating these methods with the help of codesign
tools like VCC r2d2 project, on loop
compilation as a tool for the design of specialized hardware
coprocessors. This will be a way to validate our theoretical
models.
Alain Darte, who is cooperating on a regular basis with the PiCo project at HPLabs, has already proposed some solutions to the memory minimization problem. These ideas will be implemented in the PiCo compiler in order to find their strengths and weaknesses.
Embedded systems have a very wide range of power and complexity. A circuit for a game gadget or a pocket calculator is very simple. On the other hand, a processor for digital TV needs a lot of computing power and bandwidth. Such performances can only be obtained by aggressive use of parallelism.
The designer of an embedded system must meet two challenges:
he has to specify the architecture of his system; this must provide the required power, but no more than that;
when this done, he has to write the required software.
These two activities are clearly dependent, and the problem is how to handle their interactions.
The members of Compsys have a long experience in compilation for parallel systems, high-performance computers and systolic arrays. In the design of embedded computing systems, one has to optimize new objective functions, but most the work done in the polyhedral model can be reinvested. Our first aim is thus to adapt the polyhedral model to embedded computing systems, but this is not a routine effort. As we will see below, a typical change is to transform an objective function into a constraint or vice-versa. This operation may transform a linear program into a non-linear one, or a continuous program into an integer program. The models of an embedded accelerator and of a compute-intensive program may be similar, but one may have to use very different solution methods because the unknowns are no longer the same, and this is the scientific challenges of the subject.
The advent of high-level synthesis techniques allows one to create
specific design for reconfigurable architectures, for instance
with MMAlpha fpga will allow designers to use it
with a full knowledge of its possibilities. To reach this goal,
one has first to firm up the underlying methodology and then to
create an interface toward tools for control-intensive
applications.
Toward this goal, the team will uses the know-how that Tanguy Risset has acquired during his participation in the Cosi project (before 2001) and also the knowledge of some members of the Arénaire project (Lip). This work is a natural extension of the ``high level synthesis'' action in the Inria project Cosi. We want to show that, for some applications, we can propose, in less than 10 minutes, a correct and flexible design (including the interfaces) from a high-level specification (in C or Matlab or Alpha). We also hope to demonstrate an interface between our tool, which is oriented toward regular applications, and synchronous language compilers (Esterel, Syndex) which are more control oriented.
Kahn process networks (KPN) were invented thirty years ago
The problem with KPNs is that they have an asynchronous execution
model, while vliw processors and Asic are synchronous or
partially synchronous. Thus, there is a need for a tool for
synchronizing KPNs. This is best done by computing a schedule
which has to satisfy data dependences within each process, a
causality condition for each channel (a message cannot be received
before being sent), and real time constraints. A prototype of such
a scheduler, Yaka (Yet Another KPN Analyzer) is being
developed. This tool extends the scheduling techniques we
developed for high-performance computers. Handling real-time
constraints in this model is especially easy. For instance, if the
constraint is in the form of an upper bound on the latency, one
has to write that all values of the schedule are less than a
maximum. This gives constraints which are similar to the data
dependence constraints and can be solved by the same tools. It is
even possible to keep the clock period as an unknown, and to
select the maximum value for which the problem is still feasible.
Since power consumption is a decreasing function of the clock
period, this is a way of reducing dissipation. We expect to test
these ideas in the Yaka prototype.
The scheduling techniques of MMAlpha and Yaka are complex
and need powerful solvers using methods from operational research.
One may argue that compilation for embedded systems can tolerate
much longer compilation times than ordinary programming, and also
that Moore's law will help in tackling more complex problems.
However, these arguments are invalidated by the empirical fact
that the size and complexity of embedded applications increase at
a higher rate than Moore's law. Hence, an industrial use of our
techniques requires a better scalability, and in particular,
techniques for modular scheduling. Some preliminary results have
been obtained by Triolet and Irigoin at Ecole des Mines de Paris
(especially in the framework of inter-procedural analysis), and in
MMAlpha (definition of structured schedules). This work must be
continued; one of the crucial points is the handling of
off-the-shelf components (ip) in the design of embedded systems.
Off-the-shelf components pose another problem: one has to design
interfaces between them and the rest of the system. This is
compounded by the fact that a design may be the result of
cooperation between different tools; one has to design interfaces,
this time between elements of different design flows. Part of this
work has been done inside MMAlpha; it takes the form of a
generic interface for all linear systolic arrays. Our intention is
to continue in that direction, but also to consider other
solutions, like Networks on Chip and standard wrapping protocols
like vci from vsia
Present day tools for embedded system design have trouble handling loops.
This is particularly true for logic synthesis systems, where loops
are systematically unrolled (or considered as sequential) before
synthesis. An efficient treatment of loops needs the polyhedral model.
This is where past results from the automatic parallelization community
are useful. The French community is leading in this field, mainly
as one of the long term results of the
The polyhedral model is now widely accepted (Inria projects Cosi
and A3, PIPS at Ecole des Mines de Paris, Suif from
Stanford University, Compaan at Berkeley and Leyden, PiCo from
the HPLabs, the dtse methodology at Imec, etc.). Most of these are
research projects, but the increased involvement of industry (Hewlett Packard,
Phillips) is a favorable factor. Polyhedra are also used in test and
certification projects (Verimag, Lande, Vertecs).
Two basic tools that have emerged from this period are
Pip
In the following, we distinguish the development of existing tools, and the conception and implementation of new tools. These tasks are nevertheless strongly related. We anticipate that most of the new techniques will be evolutions of the present day tools rather than revolutionary developments.
In the last two years, we have greatly increased the availability
of Pip and the Polylib. Both tools can now use exact
arithmetic. A cvs archive has been created for cooperative
development. The availability for one year of an odl software
engineer has greatly improved the Polylib code. A bridge for
combined use of the two tools has been created by Cédric Bastoul
(UPMC). These tools have been the core of new code generation
tools
For Pip, algorithmic techniques for better control of the size of intermediate values; comparison with commercial tools like Cplex, for the non-parametric component of the tool.
For the Polylib a better handling of
For higher-level tools, Antoine Fraboulet is working on the integration
of his thesis results into the Suif platform, and aims at
participating in the implementation of the compiler for the
Dart reconfigurable platform in cooperation with the Inria
project r2d2.
For all these tools, we want to strengthen the user community by participating in the Polylib forum and organizing meetings for all interested parties.
Industry is now conscious of the need for special programming
models for embedded systems. Scholars from the University of
Berkeley have proposed new models (process networks, sdl,
etc.). This has culminated in the use of Kahn process networks,
for which a complete overhaul of parallelization techniques is
necessary. Paul Feautrier is working in this direction.
Besides, our community has focused its attention on linear programming tools. For embedded systems, the multi-criteria aspect is pervasive, and this might necessitate the use of more sophisticated optimization techniques (non-linear methods, constraint satisfaction techniques; ``pareto-optimal'' solutions).
Here again, our contributions in these areas will be facilitated by our leadership in polyhedral tools. We nevertheless expect that, as in the past, the methods we need have already been invented in other fields like operational research, combinatorial optimization or constraint satisfaction programming, and that our contribution will be in the selection and adaptation (and possibly the implementation) of the relevant tools.
Polylib (available at
odl software engineer has
firmed up the basic infrastructure of the library. The development
is now shared between Compsys, the Inria project
r2d2 in Rennes, the icps team in Strasbourg and the
University of Leyden. This tool is in use by many groups all over
the world.
Paul Feautrier is the main developer for Pip (Parametric Integer Programming)
since its inception in 1988. Basically, Pip is an ``all
integer'' implementation of the Simplex, augmented for solving
integer programming problems (the Gomory cutsmethod) which also
accepts parameters in the non-homogeneous term. Most of the recent
work on Pip has been devoted to solving integer overflow
problems by using better algorithms. This has culminated in the
implementation of an exact arithmetic version over the gmp
library. Pip is freely available under the gpl at the
following URL:
Tanguy Risset is the main developer of MMAlpha since 1994
(r2d2 in Rennes. MMAlpha is being
evaluated by ST-Microelectronics and has been a basis for Compsys participation
to European and rntl Calls for Proposals.
Antoine Fraboulet is participating in the design of the Dart compiler,
whose target is reconfigurable
architectures r2d2 project at Rennes and
Lannion. This compiler is built over the SUIF1 infrastructure from
Stanford
University
Yaka is a Kahn process network scheduler. Its development has
benefited from the help of François Thomasset
(Inria-Rocquencourt). The main improvements this year were the
automatization of message counting (via the Polylib), and of
the code generation module (using
Cloog
Compsys has bought two fpga cards for rapid prototyping. The make
is WildCard II (from Annapolis Inc), using the Xilinx XCV3000 fpga
circuit. These cards can be plugged into the PCMCIA slot of
any laptop. These cards will be used for as a demonstrator for the
design tools from Compsys. We hope they will contribute to the
visibility of the project.
Results: One of the most
appealing use of a high-level synthesis tool is design space
exploration. In its present version, MMAlpha builds only simple
architectures (the latency is a linear function of the size
parameter). In this case, the architecture exploration is quite
intuitive and can be left to the designer. When a non-linear or
multi-dimensional schedule is needed, we need much more complex
architectures (for instance, memory banks are needed for storing
intermediate values). Design space exploration becomes more
difficult, and the use of a tool similar to MMAlpha is
necessary. We have defined a new design class, which is less
constrained than systolic arrays, but for which the same synthesis
principles can be used. This result has been presented at the ASAP
2003 conference memory generation beside register
generation.
future prospects: Multi-dimensional scheduling in
MMAlpha is only partially implemented. The usual systolic
synthesis is based on affine schedules, and generates simple
arrays, in which the memory elements are shift registers.
Multi-dimensional schedules need real addressable memories and
control signals. Hence, before tools can be constructed, the real
challenge is to understand how to synthesize complex circuits
including memories and implementing nested loops.
Results: As part of the participation of
Compsys in the SocLib projectiit), has defined the constructors that are needed to express
the semantics of the AlpHard subset of Alpha, while obeying
the SocLib design rules. The translation of the hardware design as
given by MMAlpha is nearly operational. One still has to
express the virtual component interface vci as a SystemC
template.
future prospects Thanks to our active cooperation
with SocLib, we can easily relate to the ip community. We plan
to integrate the design rules of SocLib into the designs that are
produced by MMAlpha. The resulting ips will use the vci (Virtual component interface, a simple protocol which
is becoming a standard for ip interfacing).
The challenge here
is to standardize our work on interfacing regular
applications fpga platform we have bought.
Recent progresses of the cmos technology allow the
integration of a complete parallel machine on a single chip.
Connecting the various components of this machine is a challenge,
and the most likely solution is the use of an on chip network. In
cooperation with SocLib, Compsys has access to the
ground-breaking work of Lip6 (the Spin network). With the help of
a master student, we have studied the adequation of the usual
network protocols (the network layer) to specific applications for
systems on chip (mostly for multimedia applications). The
performance of the system has been found to be very sensitive to
the value of of some parameters like the size of the FIFO buffers.
This work has been presented to SympAAA'03 cnrs and the suport of
ST-Microelectronics.
In cooperation with one of our past PhD student,Guillaume Huard, we completed our work on the instruction-shift algorithms which were the subject of his thesis. Our initial aim was to build firm theoretical foundations for this transformation, especially with regard to parallelism detection. Other scholars proposed a polynomial algorithm, but there was a flaw in it, and we proved that the problem is NP-complete. We also considered the problem the problem of maximizing local references. There existed an NP-completeness result for this task, but there was an error in the proof.
We established the NP-completeness of the instruction-shift problem.
This result was presented at STACS 2002
Our expertise in instruction-shift transformations allowed us to
look into an older problem from a new perspective. The problem is
that of reducing the memory footprint of a sequence of loops, which is
important for embedded systems. The transformation consist in contracting
arrays into scalars when this is possible. This problem had exact solutions
and no complexity estimate. Again we proved, using non-trivial methods,
that this problem is NP-complete. We also designed an integer programming
method for its solution; this result was presented to ASAP
2002
During a visit to Rice University (October 2000 to January 2001) we
developed in cooperation with John Mellor-Crumney et. al. a new strategy
for data distribution: multi-partitioning. For scientific applications
that apply ``waves'' of computation to multi-dimensional data, this technique
guarantees perfect load balancing. This technique has been implemented into
the dHPF compiler from Rice. Up to now, this technique could be applied
in 3 dimensions
only if the number of processors was a perfect square (in higher dimensions
It is interesting to note that the techniques we have developed, notably
the use of Hermite normal forms, make explicit links between systolic
array partitioning
Results: When designing hardware accelerators, one
has to solve both scheduling problems (when is a computation
done?) and memory allocation problems (where is the result
stored?). This is especially important because most of these
designs use pipelines between functional units (this appear in the
code as successive loop nests), or between external memory and the
hardware accelerator. To decrease the amount of memory, the
compiler must be able to reuse it (without using a costly cache).
An example is image processing, for which we might want to store
only a few lines and not the entire frame.
Reusing array memory is a recent development. The first study was by
Lefebvre and Feautrier HP-Labs) and Gilles Villard
(Lip, Arénaire project) we have reviewed all past proposals to
solvethis problem. We found that all of them were using some form
of modulo allocation, without clearly saying so. We developed the
necessary mathematical formalism, using the concept of the
critical lattice, and we found that all preceding methods were
heuristics for finding a critical lattice. We were able to
quantify the qualities of these techniques, finding cases where
the results are good, and others in which they can be arbitrarily
far from the optimum. We proved that, in practice, one can
construct approximations which do not differ from the optimum more
than by a multiplicative constant which depends only on the
dimension of the problem. We also have shown that the problem is
connected to basis reduction methods (LLL), and with the
successive minima of a lattice. Lastly, we showed that the optimum
can be found by a clever exhaustive enumeration. These results
have been presented to CASES 2003
future prospects: In this preliminary work, we
have identified the basic mathematical tools we need to solve the
memory reuse problem. However, many questions are still
unanswered. Can the heuristics of
Answering these questions would have an impact both for the practical problem of designing hardware accelerators and for the mathematical problem of finding the critical lattice of an object. This is a fundamental algorithmic problem, which has not attracted much attention from mathematicians. We intend to build a code that search for the critical lattice on top of our present tools, Pip and Polylib. In practice, heuristics for this problem are already in use in tools like PiCo. We hope these researches will give a new interest to our cooperation with Synfora.
Results:
The ssa form (Static Single Assignment) is an intermediate representation
in which multiplexers (called
In cooperation with François de Ferrière and Christophe Guillon
from the MCDT team at ST-Microelectronics, we found a solution for
the optimal removal of ssa form into
machine language. However, they introduce an interesting
technique, pinning, for representing renaming constraints.
Our idea was to use this pinning technique for constraining the renaming
of
We have implemented our algorithm in the LAO assembly code
optimizer from ST-Microelectronics. Comparison with other approaches gave
interesting results, which we explained by comparing to several
hand-coded examples.
All these results are presented in an internal research
report
future prospect: The problem of ssa removal may
be modeled as a classical problem of register coalescing as found
in the register allocation phase. There, one does ``conservative''
coalescing, which means that it does not change the chromatic
number of the interference graph. For aggressive coalescing, we
had to prove new NP-completeness results.
On the other hand, our study of aggressive coalescing and its
interaction with Chaitin coloring techniques has shown that we
were not solving the real problem (except in the case where the
number of physical registers is unlimited). This lead us to
consider a more general register allocation problem, still keeping
in mind that the source code we are translating is in ssa form.
This is specially true in our collaboration with MCDT at
ST-Microelectronics. The goal is thus to remove the ssa
These results are still incomplete and must be firmed up.
Nevertheless, we have introduced new concepts which will allow us
to design new heuristics for fixed register allocation, both in
the general case and for the st220 dsp . The next step
will be to experiment with these new heuristics. LAO2, an
optimizer for assembly code which is under development at
ST-Microelectronics, seems to be the ideal vehicle for these experiments.
Results: A schedule for a Kahn process Network
(KPN) has to meet the following constraints:
the classical data dependences within each process and
messages dependences (a message cannot be received before being sent),
size constraints, which bound the number of messages which have been sent but not received.
The last two type of constraints can be expressed by virtual numbering of the messages. We have given explicit formulas for message counting. These formulas include sums, for which closed expressions can be found using the theory of Ehrhardt polynomials as implemented in the Polylib. This counting code has been implemented in Yaka with the help of François Thomasset from Inria-Rocquencourt.
In image processing, especially for digital TV, these counting functions are linear, and scheduling can proceed using classical methods. In some cases, the counting functions are low degree polynomials, for instance when painting triangular shapes. In that case, we can resort to a heuristics method, in which the triangular loop is enlarged to a rectangular one, a schedule is computed, and the code is corrected to avoid superfluous messages and computations.
future prospects The problem here is how to handle
the message constraints when the counting functions are
polynomial. As said before, there is a heuristics method to handle
this situation, but it entails a loss of processing power. On the
other hand, it seems that in this case the schedule must be
polynomial. This is linked to the problem of multidimensional time
(see Sect. ). We are looking for more direct
solutions to the problem. A first attempt to use the test-point
method (W. Weisspfenning, Passau) did not give significant
results.
Results: The problem here is to optimize locality
of memory references for reducing the traffic between the cache
and the main memory. Our first step has been to reformulate the
problem, using asymptotic estimates of the traffic instead of the
exact formula's, which are too complex to be tractable. This can
be done by emulating the way a scratch-pad memory is used. The
program is divided in chunks. Before starting the execution of a
chunk, the necessary data are copied in the scratch pad memory. In
this way, all misses are concentrated at the beginning of a chunk,
and counting them is easy. We can handle in this way temporal
locality optimization, spatial locality optimization, self reuse
and group reuse, while insuring that the data dependences are
alway satisfied. These techniques are implemented in the Chunky
optimizer and the Cloog code generator. Chunky is probably the
only locality optimizer that tolerates arbitrary dependences in
the source code.
future prospects: Chunky handles currently spatial
and temporal locality, and self reuse as well as group reuse. We
have not attempted to modify array layouts as this is a non-local
transformation that may benefits some parts of the code, but
degrades the performance of other parts.
There is, however, an important technique that is not handled by the present code: tiling. Intuitively, tiling the iteration space seems just another way of building chunks. Nevertheless, we have not yet found a seamless way of unifying chunks and tiles, and this is the most important question we have to answer now.
Our software for code generation, Cloog, has reached maturity and
is widely used. However, it still needs experimenting; in code
generation, there are many degree of freedom which can be
exploited for better efficiency. For instance, we discovered
experimentally that two loops at the same level can sometime be
arbitrarily displaced with respect to each other. Separating them
completely is best for a parallel machine. However, coalescing
them to get the maximum overlap is best for vliw-like machines.
It gives speedups in excess of 2 on the Itanium, while the
improvement is marginal on a Pentium.
In the long term, the design of chunking was guided by the way a local or scratch-pad memory should be used. While the basic techniques are now well understood, a lot of questions still need answer. What is the best programming interface for a scratch-pad memory? How to compute the layout of arrays fragments in the scratch-pad? How does one rewrite array accesses? What is the interplay between scratch-pad memories and parallelism (one should remember that the Cray II had a scratch-pad memory which was useless because the designers had not considered this point)?
Positive answers to these question will increase the usefulness of scratch-pad memories in SoC; their use will be an important step toward solving the problems of predictability for real-time systems.
The instruction cache is a small memory with fast access. All binary
instructions of a program are executed from it. In the ST220
processor from ST-Microelectronics, the instruction cache is direct mapped.
Let
When a program starts executing a block which is not in the cache, one must load it from main memory; this is called a cache miss. This happens either at the first use of a function (call miss), or after a conflict (conflict miss). There is a conflict when two functions share the same cache lines; each of them remove the other from the cache when their execution is interleaved. The cost of a cache miss is of the order of 150 cycles for the ST220. Hence the interest of minimizing the number of conflicts by avoiding line sharing when two functions are executed in the same time slot. This problem has in fact two objective functions:
Minimizing the number of conflicts for a given execution trace. This is equivalent to the ``C-coloring'', ``SHIP-BUILDING'', and ``edge deletion'' problems.
Minimizing the size of the code. This is equivalent to a traveling salesman problem (building an Hamiltonian circuit) on a very special graph.
Classicaly COL then EXP. With Florent Bouchez (a third
year student of ens-Lyon), we have solved EXP for a given
solution of COL. This work is part of our cooperation with
ST-Microelectronics. We have shown that the problem can be solved in
polynomial time. We have also proposed a heuristics with an COL,
and also that it is not
These optimizations have been implemented in the loader of the ST220 processor. Experimentation on several representative benchmarks shows that code expansion is significantly reduced without any increase in the number of cache misses. A technical report collecting these results is in preparation.
The ProCD (Programmable Consumer Devices) contract is
funded by stsi convention. Its objective is the design of
programs for multimedia signal processing on a vliw
architecture. Compsys contribution is to give exact solutions
to combinatorial optimization problems which arise when compiling
programs for vliw processors (see Sect. ). These
methods are then integrated into industrial strength compilers
for the ST220. We have worked on the following problems:
Software pipeline synthesis using modulo scheduling, including register allocation.
ssa removal.
Code placement for instruction cache optimization.
There is another stsi contract between Compsys and ST-Microelectronics
(Crolles plant) on experimentation with MMAlpha for synthesis of
ST-Microelectronics special purpose circuits. Its duration is 8 months. Due
to administrative problems, the funding is not yet available, but the
work has nevertheless begun (overhauling the Alpha-vhdl
translator). The contract is to be renewed next year.
Network of excellence Compsys is involved in two
``expressions of interest'' for the 6th PCRD of the European
Union: HIPEACS and EuroSoc. HIPEACS has successfully met the first
round of evaluation.
STREP Project
Compsys is one of the partners for the Spotlight STREP project proposal
on rapid prototyping of embedded computing system performances
in the field of signal processing and multimedia applications.
Compsys is a partner in two rntl submissions. The first one,
on extensions to the KPN formalism, has been rejected. The second one,
which is currently being evaluated, deals with a connection between
MMAlpha and the B method.
Tanguy Risset and Antoine Fraboulet are members of the SocLib project
(ip cores) for Systems
on Chip.
Compsys participates in the following RTPs: ``Architecture and
Compilation''(soc''
(
A convention between UIUC and cnrs supports visits and cooperation
with David Padua's team and Compsys.
Paul Feautrier cooperation with the University of Passau is supported by a Procope convention which has been recently renewed.
Tanguy Risset is in regular contact with the University of Québec at Trois-Rivières (Canada), where MMAlpha is in use.
Compsys is in regular contact with Sanjay Rajopadhye's team at Colorado State University (USA).
Compsys is in regular contact with Franky Catthoor's team in Leuwen (Belgium) and with Ed Depreterre's team at Leiden University (the Netherlands).
Alain Darte has fruitful relations with the PiCo team,
notably with Rob Schreiber and Bob Rau (HP-Labs, Palo-Alto, USA)
– see the two patents
Paul Feautrier has regular contact with Zaher Mahjoub's team in the Faculté des Sciences de Tunis, notably as the co-advisor of a PhD student.
Compsys is in regular contact with Christine Eisenbeis's
team (Inria project A3/Alchemy), with François Charost and
Patrice Quinton (Inria project r2d2), and with Alain Greiner
and Fréderic Pétrot (Asim, LIP6).
Compsys participates in work meetings with LETI (CEA Grenoble), preparatory to a cooperation on power minimization.
Alain Darte was the program chair of ASAP 2003, with Lothar Thiele and Joe Cavallero, and for track 4 (Compilers for High-Performance) of Europar-2002, with Martin Griebl.
Alain Darte was a member of the program committee the conference CASES 2003 (ACM Int. Conf. on Compilers, Architecture and Synthesis for Embedded Systems).
Alain Darte was a member of the steering committee of CPC 2003 and CPC 2004 (Compilers for Parallel Computer). He has recently been co-opted into the editorial board of the international review ACM TECS (Transactions on Embedded Computing System).
Paul Feautrier is Associate Editor of reviews Parallel Computing and Int. J. of Parallel Computing.
Tanguy Risset is a member of the editorial board of the review Integration: the VLSI Journal.
In 2003-2004, Alain Darte is sharing this course with Yves Robert (Remap).
Paul Feautrier is thesis advisor for Cédric Bastoul (UPMC-Paris VI), Christophe Alias (UVSQ, Denis Barthou is co-advisor) and Yosr Slama (Faculté des Sciences de Tunis, Zaher Mahjoub is co-advisor).
Alain Darte is in charge of the ``Computer Science'' division of
the entrance exam to ens-Lyon.
Paul Feautrier teaches the following subjects for first and second year students:
A guided tour of Unix (MIM1).
Operational Research (MIM2).
Compilation project (MIM2).
Paul Feautrier is the coordinator of the ``Architecture and Compiler''
track of the new Master of ens-Lyon.
In 2003-2004, Tanguy Risset is teaching the Compilation course to second year
students at ens-Lyon (MIM2).
Alain Darte is an elected member of the hiring committee of
ens-Lyon, for the Computer Science section.
Alain Darte is a member, since 2000, of the PhD committee of AFIT (Association Française de l'Informatique Théorique) which gives awards to outstanding PhD dissertations.
Alain Darte and Paul Feautrier were members of the steering committee of RTP ``Compilation and Architecture''.
Paul Feautrier is a member of the Governing Board and of the PhD committee
of ens-Lyon.
Paul Feautrier is a member of the Audit Committee of LSIIT (UMR 7005, Lois Pasteur University, Srasbourg).
Tanguy Risset is in charge of the Polylib mailing-list. This list includes most of the actors on the polyhedral models.
Tanguy Risset is a member of the defense committee for Anne-Claire Guillou (December 19, 2003, IRISA).
Alain Darte was a reviewer of Sid-Ahmed-Ali Touati PhD dissertation ``On the influence of instruction level parallelism on register pressure''. (June 2003, advisor: Christine Eisenbeis). He also was a member of the defense committee for Youcef Bouchebaba (``Data Transfer Organization for Signal Processing: Tiling, Loop Fusion and Array Reallocation'', advisor François Irigoin).
Paul Feautrier is reviewer for the HDR (Professorial Thesis) of Frank Delaplace, to be defended November 28, 2003) at the University of Evry Val-d'Essone.
Alain Darte gave the following seminars:
(Paris), February 2003. Frédéric Pétrot and Alain Greiner's team. MMAlpha and PiCo synthesis methods.
(Palo Alto, USA), October 27, 2003. Memory reuse by modulo allocation, also at Synfora (Mountain View, USA), October 28, 2003.
Tanguy Risset gave the following seminars:
(Paris), October 2003: Translation of Alpha to the CABA (cycle accurate, bit accurate) model of SocLib.
(Grenoble), November 2003 : Using MMAlpha for the synthesis of CEA architectures.
Tanguy Risset has presented the following paper: ``Hardware
Synthesis for Multi-Dimensional Time'' at asap 2003.
Alain Darte attended the following conferences:
Amsterdam, the Netherlands, from January 8 January 10, 2003: Tenth Workshop on Compilers for Parallel Computers.
Germany, from February 9 to February 15: ``Emerging Technologies: Can Optimization Technology meet their Demands?'', T. Conte, C.Eisenbeis, and M-L. Soffa, organizers. Talk: ``PiCo, MMAlpha, etc. Towards the Compilation of Hardware Accelerators''.
Greece, from July 21 to July 23, 2003. ``Systems, Architectures, Modeling, and Simulation'', Shuvra Bhattacharyya, Stamatis Vassiliadis, and Ed F. Deprettere, organizers. Talk: ``Lattice-Based Memory Allocation''.
Paul Feautrier will give a tutorial ``Automatic Memory Allocation'' to JP'03 (Journées du Parallélisme 2003) at Faculté des Sciences de Tunis.