High performance microprocessors are used in various information technology applications ranging from supercomputers, high-end multiprocessor servers, to PCs and workstations, but also high-end embedded applications (avionics, networks, as well as consumer products such as automotive, set-top boxes or cell phones). The theoretical performance of these processors has been increasing continuously for the past two decades. This trend continues at the cost of a rising hardware complexity (transistor count, power consumption, design cost). At the same time, extracting a significant part of this theoretical performance becomes more and more difficult for the end user, even with the assistance of a compiler.
Research in the CAPS project-team ranges from processor architecture to software platforms for performance tuning, including compiler/architecture interactions, and processor simulation techniques. Our objective is to enable the end user to exploit a significant fraction of this theoretical performance while still masking the underlying hardware complexity.
Our research in computer architecture covers memory hierarchy, branch
prediction, superscalar implementation, as well as SMT and
multicore processors.
In the recent past, we have proposed several new complexity-effective
structures for caches and branch
predictors
Instruction Set Architecture (ISA) is where the software
meets the hardware. On the one hand the compiler must optimize the codes to
take advantage of the micro-architecture. On the other hand, the ISA
must be designed in such a way that the compiler can ``understand''
performance issues.
We are exploring the adequation of dynamic execution on
embedded applications (cf. ).
In light of the current knowledge on microarchitecture, we are trying to
define the characteristics that a general purpose ISA
should feature, to allow efficient and cost-effective implementation
(cf. ). Researches in architecture and
code optimizations
require to experiment and to evaluate new architectural
ideas.
We are defining several simulation frameworks that permits to validate both embedded statically
scheduled processor
architectures (ABSCISS, cf. ) and out-of-order execution
for general-purpose
architectures (IAOO, cf. ).
Most scientific
applications make use of vector-like
computations, while the behavior of the memory hierarchies on such
computations is implementation dependent. We are developing an
original approach for the code generation for vector-like
kernels
on IA 64 platforms (cf. ).
Most ISAs feature multimedia instructions. These instructions are
generally not well handled by compilers. We have designed a C source
code optimizer that automatically makes use of these instructions
Some performance issues must be handled at a higher level than the direct interface that lies between the hardware and the instruction set. For heterogeneous SOCs (System On a Chip) featuring special purpose hardware and one or more execution cores, we are exploring thread extraction for the different hardware components (cf. ). Code size is often an issue with embedded systems. We are exploring tradeoffs leveraging code compression and interpretation (cf. ).
Compiler optimizations for embedded systems require advanced software platforms that can support a large family of low end (assembly level) optimizations and that can be easily retargeted with new ISAs as well as new microarchitectures (ALISE, cf. ). For the end user, understanding the effective performance of an application on a hardware platform is a major issue, since it is necessary to recognize the limiting component (hardware, compiler or application implementation). We are developing a performance debugging tool for this purpose (ATLLAS, cf. ).
Finally, we use our knowledge of modern microarchitecture to participate in the definition of an unpredictable random number generator (HAVEGE, cf. ).
Our research is partially supported by industry (Intel, STMicroelectronics, Thomson MMD). We also participate in several institutionally funded projects (OPPIDUM PACMANN, MEDEA+ MESA, ministry of industry EPICEA, ACI Securité UNIHAVEGE). Among our main partners in these institutionally funded projects, let us cite Bull, EADS and STMicroelectonics.
Some of the research prototypes developed by the project during the past few years are currently being transferred to industry through the CAPS Entreprise start-up (cf. ).
Research activities by the CAPS team range from highly focused studies on specific processor architecture components to software environments for performance tuning on embedded systems. In this context, the compiler/architecture interaction is at the heart of the team research.
In this section, we briefly present the remaining challenges in uniprocess architecture, the new challenges and opportunities for architects created by single-chip hardware thread parallelism, and the challenges for compilers on embedded processors.
The advance of integration technology permits the design of wide-issue superscalar processors with a high clock frequency. The gap between the main memory access time and the clock cycle is an ever increasing bottleneck. Moreover the product (pipeline depth)x(issue width) is also increasing. Effective performance is therefore more and more dependent on the management of control and data dependencies. As it stands, the two main challenges for uniprocessor architects to achieve ultimate performance remain 1) hiding the memory hierarchy latency and 2) hiding (breaking) control and data dependencies. However, computer architects must also address the new challenges associated with the power consumption and the design complexity. Finally, performance predictability will also become a major issue in the near future for both computer architects and software designers.
The gap between processor cycle time and main memory access time is increasing at a tremendous rate and is reaching up to 1000 instruction slots. At the same time, the instruction pipeline depth is also increasing (20 cycles on the Intel Pentium 4) and several instructions can be executed within a single cycle. A branch misprediction will soon lead to a 100-instruction slots penalty.
Over the past 10 years, research results have allowed to limit the performance loss due to these two phenomena. The average effective performance of processors has remained in the range of one instruction per cycle, while these two gaps were increasing by an order of magnitude.
The use of a complex memory hierarchy has been generalized over the
past decade. On modern microprocessors, both software and hardware
prefetching are now widely used to
enable the on-time presence of data and instructions in the memory
hierarchy. Highly efficient, but complex data
hardware prefetch mechanisms, have been proposed to hide several hundreds
of instruction slots
Over the past decade, efficient branch prediction mechanisms have been
proposed and implemented
The complexity of many components in the processor (in terms of silicon
area, power consumption and response time) increases superlinearly (and
often quadratically)
with the
issue width e.g. register renaming, instruction scheduling, bypass
network and
register file access. These components are becoming the bottlenecks that
limit the issue width and the cycle time
While the complexity of the processors is steadily increasing, predicting, understanding and explaining the effective behavior of the architecture is becoming a major issue, in particular for embedded systems. Unfortunately, high performance is often synonym with high unpredictability and variability in performance. Designing architectures with predictable and high performance will become a major challenge for computer architects as well as compiler designers in the next few years.
It has now become possible to implement hardware thread parallelism on a single chip. The advantage of single chip hardware parallelism over a conventional parallel machine is that the performance cost of communicating between the different tasks is reduced. Multicores or CMP (chip multiprocessors) as well as SMT (Simultaneous Multithreading) processors are emerging on the server and workstation market. On a multicore, tasks execute on distinct processing units. Resource sharing concerns only one or several on-chip cache levels, and chip pins. This is to be contrasted with SMT processors, on which resource sharing concerns most resources. Key issues concerning SMT / multicore processors are the performance on sequential application and the design complexity. This will determine the extent to which they can be used as universal computing components.
It becomes more and more difficult to exploit higher degrees
of instruction-level parallelism on superscalar processors.
Thus it has been proposed to exploit task parallelism.
Two different approaches exist, namely the multicore approach
and the simultaneous multi-threading (SMT) approach.
Task parallelism is actually a simple way to increase the execution
throughput in certain contexts : embedded applications,
servers, multi-programmed systems, scientific computing, ...
The straightforward way to implement task parallelism is to have multiple distinct processors. Current technology is able to put several hundred millions of transistors on a single die. This allows to integrate several high-performance computing cores on the same chip, and presents several advantages, not the least of which are a reduced communication latency between cores, and a potentially higher communication bandwidth.
multicore processors are already available for some embedded applications,
and IBM has introduced the dual-core POWER4 last year for work-stations
and servers
On a multicore, the tasks execute on distinct processing units.
Resource sharing concerns only one or several
on-chip cache levels, and chip pins.
This is to be contrasted with SMT processors, on which
all resources are shared apart a few buffers
A key issue concerning SMT / multicore processors is whether
they can improve sequential execution.
Among possible improvements,
one may seek to obtain a more reliable execution
(for instance
Embedded processors are proposing many new challenges to the hardware and compiler research community. Code optimization does not just mean performance optimization, but may mean best performance/code size tradeoff, guaranteeing real time constraints, or getting best power consumption/performance tradeoffs. New software environments, with new demands are needed. For instance, performance/power consumption/code size tuning debugging tools are needed. Since wide spectrum of hardware platforms have to be explored, retargetable compiler infrastructures and retargetable ISA simulators are also needed.
Embedded processors range from very small, very low-power systems (for instance for telemetry counter sensors which must run on one battery for 10 years) to power hungry high-end processors used in radars or set-top boxes. The spectrum of softwares range from very small code kernels (a few Kinstructions) to millions of code lines including a real time operating system. The constraints on the code quality vary from ``just no bugs'' to safety critical with hard real time problems, but may also be a fixed performance level at the smallest possible cost or the smallest power consumption.
Therefore embedded processors are presenting many new challenges
Code optimization for embedded processors does not directly fit in the traditional "best speed effort at any price'' assumption used for supercomputers and workstations. First, the ``common case'' paradigm using a set of representative benchmarks (e.g. SPEC2000) for general-purpose processor systems is not relevant for the design of compiler optimizations for an embedded processor: one must concentrate on the few optimizations that will bring performance on the few relevant target applications. Second, execution time is not always the only and ultimate criteria. In many cases, execution time may be less important than memory size or power consumption. Third, binary compatibility, while often important, does not overcome the matching between system cost, application functionalities and time-to-market constraints.
Many challenges have to be addressed at the compiler/optimizer level. These include compiling under constraints and mastering the optimization interactions.
Finding a tradeoff between binary code size and execution time
In the context of real time systems, average performance is often not a
critical issue, but the worst case execution time
(WCET) may be critical.
WCET estimations can be either
obtained by measurement or by program static analysis. However these
techniques are challenged by new hardware which behavior is
fundamentally difficult to predict
Power consumption is becoming a major issue on most processors.
For a given processor, power consumption is highly related
to performance: in most cases, a compiler optimization reducing execution time also reduces power
consumption
While many optimizations and code transformations have been proposed
over the past two decades, the interactions between these optimizations
are not really understood.
The many optimizations used in modern compilers
sometimes annihilate
eachother
Time-to-market is a major challenge for embedded processor designers.
Wide spectrum of possible derived hardware platforms (configurations,
co-processors, etc.) is also major issue for embedded system designers.
Defining or dimensioning an
embedded system (hardware, compiler and application) requires to explore
a large solution space for the best
cost/performance/application.
Retargetable compiler
infrastructures
The Caps team is working on the foundation technologies for computer science: processor architecture and performance oriented compilation. The research results have impacts on any application domain that requires high performance executions (telecommunication, multimedia, biology,health, engineering, environment, ...). Our research activity implies the development of software prototypes (cf. , )
The CAPS team is developing several software prototypes for research purposes: compilers, architectural simulators, programming environments, ....
Among the many prototypes developed in the project, we present here
ABSCISS and HAVEGE, two softwares developed by the
team. ABSCISS is currently being transfered to the
CAPS Entreprise start-up and HAVEGE is freely distributed for
non-commercial use.
Absciss (Assembly-Based System for Compiled Instruction-Set Simulation)
is a retargetable system for high-speed instruction-set simulation (cf. ).
This tool automatically generates compiled simulators from an assembly program
and a description of the target processor instruction set architecture.
As of today, it targets various statically scheduled RISC and VLIW
processors. Within this kind of architectures, the simulators generated by
Absciss are cycle-accurate. Caches can
also be simulated by interfacing to an external module. Other architectures
can be simulated at a functional level, that is, only the behavior of the program
will be simulated.
Absciss is optimized both for high speed simulation and fast simulator
generation (also providing flexibility through an API for extension ``plug
ins''
written in C++ or Python).
Status :
Registered with APP Number IDDN.FR.001.190016.000.S.P.2002.000.10600. Absciss is industrialized by
CAPS entreprise (cf. ).
Contact :
André Seznec
Status :
Registered with APP Number
IDDN.FR.001.500017.001.S.P.2001.000.10000. Available for tests and use
in non-commercial software.
An unpredictable random number generator is a practical approximation of a truly random number generator. Such unpredictable random number generators are needed for cryptography.
Modern superscalar processors feature a large number of hardware mechanisms that targets performance improvements: caches, branch predictors, TLBs, long pipelines, instruction level parallelism,.... The state of these components is not architectural (i.e., the result of an ordinary application does not depend on it), it is also volatile and cannot be directly monitored by the user. On the other hand, every invocation of the operating system modifies thousands of these binary volatile states.
HAVEGE (HArdware Volatile Entropy Gathering and Expansion) is a user-level software unpredictable random number generator for general-purpose computers that exploits these modifications of the internal volatile hardware states as a source of uncertainty. HAVEGE combines on-the-fly hardware volatile entropy gathering with pseudo-random number generation.
The internal state of HAVEGE includes thousands of internal volatile hardware states and is merely unmonitorable. HAVEGE can reach an unprecedented throughput for a software unpredictable random number generator: several hundreds of megabits per second on current workstations and PCs.
The throughput of HAVEGE favorably competes with usual pseudo-random
number generators such as rand() or random(). While HAVEGE was initially designed for cryptology-like applications, this high throughput makes HAVEGE usable for all application domains demanding high performance and high quality random number generators, e.g. Monte Carlo simulations.
Last, but not least, more and more modern appliances such as PDAs or cell phones are built around low-power superscalar processors (e.g. StrongARM, Intel Xscale) and feature complex operating systems. HAVEGE can also be implemented on these platforms. A HAVEGE demonstrator for such a PDA featuring PocketPC2002 OS and a Xscale processor is available.
Visit
Our research in computer architecture covers memory hierarchy, branch
prediction, superscalar implementation, as well as SMT and
multicore issues.
In the recent past, we have proposed several new complexity-effective cache and branch
predictor structures
Past studies on skewed-associativity
Some architecture definitions (e.g. Alpha) allow the use of multiple virtual page sizes even for a single process. Unfortunately, on current set-associative TLBs (Translation Lookaside Buffers), pages with different sizes can not coexist. Thus, processors supporting multiple page sizes implement fully-associative TLBs.
We have shown that the skewed-associative TLB
can accommodate the concurrent use of multiple page sizes within a
single process
On a N-way issue superscalar processor, the front end instruction fetch engine must deliver instructions to the execution core at a sustained rate higher than N instructions per cycle. The instruction address generator/predictor (IAG) has to predict the instruction flow at an even higher rate.
Very complex IAGs featuring different predictors for jumps, returns, conditional and unconditional branches and complex logic are used. Usually, the IAG uses information (branch histories, fetch addresses, ...) available at a cycle to predict the next fetch address(es). Unfortunately, a complex IAG cannot deliver a prediction within a short cycle. Therefore, processors rely on a hierarchy of IAGs with increasing accuracies but also increasing latencies: the accurate but slow IAG is used to correct the fast, but less accurate IAG.
As an alternative to the use of a hierarchy of IAGs, it
is possible to initiate the instruction address generation several cycles
ahead of its use
In the past few years, the CAPS team has been studying branch
predictors built around the skewed predictor model 2bcgskew
predictor simulators (
We have also initiated research on the promising concept of perceptron
predictors
With the continuous shrinking of transistor size, processor designers are facing new difficulties to achieve high clock frequency. In wide issue superscalar processors, register file read time, wake up and selection logic traversal delay, bypass network transit delay, and their respective power consumption constitute such major difficulties.
The general-purpose ISAs currently in use feature a single
logical register file. This central view has also been adopted for the
hardware implementation of dynamically scheduled superscalar processors.
Until now, the following unwritten rule has always been applied:
every general-purpose physical register can be the source or
the result of any instruction executed on any integer functional unit.
In individual physical
register and the overall complexity of the physical register
file, the bypass network and the wake-up logic is decreased.
This proposed hybrid approach is referred to as WSRS (for Write
Specialization Read Specialization).
We are currently exploring instruction allocation policies on functional unit clusters. We also explore the benefits of integrating the simultaneous multithreading paradigm into our WSRS architecture.
As the increase of issue on superscalar processors bring diminishing returns, thread parallelism with a single chip is becoming
a reality. In the past few years, both SMT (Simultaneous MultiThreading)
It has now become possible to put several hundreds of millions of transistors on a single chip. This is large enough to put several high-performance computing cores on the same chip.
Multicore or CMPs (Chip MultiProcessors) should provide large performance gains on multi-programmed workloads and parallel applications. However, CMPs are not intended to speed-up sequential applications, and the first CMPs will be confined to the server market and certain embedded applications. We are searching ways to speed-up sequential execution on a CMP, in order to make CMPs "universal" computing devices. Indeed, integration on a single-chip decreases communication latency and permits a high communication bandwidth. This offers new possibilities.
Our first proposition is based on "execution migration"
The characteristics of the applications, the instruction set
architecture and the compiler/architecture interaction have a
significant impact
on the effective performance that can be achieved by an
application and/or the complexity of the hardware needed to achieve a
predetermined performance level. We are exploring the adequation of dynamic execution on
embedded applications (cf. ).
In the light of current microarchitecture knowledge, we are trying to
define the characteristics that a general purpose ISA
should feature to allow efficient and cost-effective implementation
(cf. ). Researches in architecture and
code optimizations
require to be able to experiment and to evaluate new architectural
ideas.
We are defining simulation frameworks to validate both embedded statically
scheduled processor
architectures (ABSCISS, cf. ) and out-of-order execution general-purpose
architectures (IAOO, cf. ).
Most scientific
applications make use of vector-like
computations, while the behavior of the memory hierarchies on such
computations is very implementation dependent. We are developing an
original approach for the code generation for vector-like
kernels
on IA 64 platforms (cf. ).
Most ISAs feature multimedia instructions. These instructions are
generally not well handled by compilers. We have designed a C source
code optimizer that automatically makes use of these instructions
Needs for performance on embedded applications will lead to the use of dynamic execution on embedded processors in the next few years. However, complete out-of-order superscalar cores are still expensive in terms of silicon area and power dissipation. We have studied the adequation of a more limited form of dynamic execution, namely decoupled architecture, to embedded applications.
Decoupled architecture is known to work very efficiently as long as the
execution does not suffer from inter-processor dependencies causing some
loss of decoupling, called LOD events. We have studied regularity of codes in terms of the LOD events that may occur. We have addressed three aspects of regularity: control regularity, control/memory dependency, and patterns of referencing memory data. Most of the
kernels in MiBench, a set of benchmarks for embedded systems, will be amenable to efficient
performance on a decoupled architecture
Considering the advancements achieved in the micro-architectural and technical fields,
a number of criteria highlight that current general-purpose ISAs (Instruction Set Architecture)
are no longer suited to modern microprocessors. For instance, on RISC ISAs,
instruction source operands are read from the register file and results are written
into it. The pressure exerted on this structure implies that the register file must
provide enough access ports so that performance is not impaired. Therefore, constraining
instructions to use a small fixed number of read and write operands, may lead to a design
simplification of the register file. Furthermore, in pipelined architectures, some operations
need less than one processor cycle to be executed. Thus, combining these operations in single
instructions
Instruction-set simulation can be used to evaluate different instruction-set architectures (ISAs) in the context of architecture exploration, or to validate a compiler back-end, to test, tune and debug programs, on a user-friendly PC or workstation rather than on actual silicon which might not even exist yet.
The increasing size and complexity of embedded software require extremely fast instruction-set simulation. Compiled instruction-set simulation is an approach that is potentially much faster than interpretation, but it has a startup cost due to the generation and compilation of the simulator. This startup cost is often seen as a major drawback and has limited the adoption of compiled instruction set simulation.
We have designed a new approach to compiled instruction-set simulation,
that aims at reconciling flexibility, retargetability, high simulation
speed, and small startup costs. This approach was implemented in Absciss
The ISA (Instruction Set Architecture) has an important impact on the effective implementation of processors. Recently HP and Intel have introduced a new 64-bit ISA for general-purpose systems, called the IA64 or IPF.
Unlike previous generation general purpose ISAs, the IA64 makes extensive use of predication and provides support for speculative execution (advance loads for instance). Unlike traditional ISAs, this new instruction set is very resistant to an out-of-order architecture, because of the resource size as well as the complexity of executing predicated instructions. Although the research community has started to study the execution of predicated instruction within an out-of-order core, no definitive solution has been adopted yet.
We are developing a software simulation platform called IAOO, that is designed to provide the research community with a framework that permits to investigate the out-of-order execution of the IA64-like instruction sets. The framework is built around a set of libraries and applications programs that permit to analyze in details the structure of binary programs as well as emulating or simulating full binary executables.
Additionally, we have begun to investigate a novel register management
policy that is designed to operate smoothly with a fully predicated
ISA. This new system is based on an intermediate representation called
Translation Register Buffer or TRB. The TRB mechanism that
translates a logical register into a physical register is shown to be
effective when an instruction is canceled by a predicate
In the framework of the EPICEA project (cf. ), CAPS collaborates with PRiSM (University of Versailles). PRiSM research group is in charge of benchmarking and analyzing the specific behavior of each implementation of the IA64 architecture (Itanium 1, Itanium 2, Madison...). A significant effort is devoted to the analysis of the memory subsystem, which exhibits many pathological behaviors. Using this knowledge of the implementations of IA64 architecture, CAPS team develop specific technics to use the specific features of the implementation to optimize codes. The main achievement of this collaboration is the development by CAPS research group, of a highly optimized code generator for computation intensive loops. Such portions of code (called computation intensive kernels) are very common in scientific applications, and require very specific optimizations. These kernels usually represent vector computations. The generator can exploit all the IA64 architecture features like software pipeline (implemented in hardware with the rotating registers mechanism) and prefetch instructions. We have shown that the current implementations of the IA64 architecture can essentially be used like a vector machine on vector-like codes.
A methodology for optimizing softwares was developed around this generator. This method includes the detection of portions of code to optimize, the generation of the corresponding optimized kernels and the kernels integration in a library, the source code modification to make use of the generated kernels, and the linking of the modified source code and the library containing the generated kernels.
Managing power consumption/performance tradeoff on high-performance embedded processors has
become a major challenge.
The cache hierarchy is
a typical example of such a power/performance tradeoff design point. On the one hand, a cache allows
to maintain an important fraction of the embedded code and data workload
on-chip, thus reducing the
amount of memory traffic and thereby improving the performance and the power consumption. On the other hand, however,
on some embedded processors, the cache memory accounts for up to
The processor design community reacted to this problem by making the cache subsystem reconfigurable
It is well known
that programs usually execute as a series of phases
Many of today modern processors provide extensions to their instruction set specifically designed for computation-intensive multimedia applications (PowerPc AltiVec, TriMedia, Pentium MMX , ...). These extensions are based on SIMD instructions that operate on data subwords packed into registers. They are usually provided as intrinsics that can be inserted in the C code but they can also be inserted in the assembly code.
Direct insertion in assembly code requires good knowledge of both processor architecture and compilation techniques. Moreover such an approach does not lead to portable codes. Using intrinsics in C source code still requires code transformations (such as vectorization) for highlighting code regions where data parallelism can be exploited.
In the context of the Medea+ MESA project, CAPS has developed a C-to-C retargetable
preprocessor prototype called SWARP
The PACCMAN project (PlAteforme de Composants Cryptographiques
pour Multi-Applications Nationales cf. ) aims at designing and producing
a ``French'' cryptographic platform. This platform will be composed of
a cryptography ASIP (Application Specific Integrated
Processor) and a complete compiler toolchain (including debugger
and profiler).
The goal of PACCMAN platform is to improve the technological independence
and the competitiveness of the French cryptography industry. The two
main industrial application fields of this project are the
implementation of secured PMR (Professional Mobile Radio)
communications and high bandwidth VPNs (Virtual Private Networks).
The CAPS research group has in charge Inria's contribution to PACCMAN. This contribution is four-fold:
CAPS collaborates to the specification of the instruction set
architecture (ISA) of the PACCMAN processor.
CAPS designs, implements and validates the compiler toolchain
for the PACCMAN processor. This compiler toolchain is made of:
an ANSI-C compiler including both state-of-the-art and dedicated optimizing techniques,
a software debugger,
a software profiler.
CAPS designs, implements and validates a cycle-accurate
simulator of the PACCMAN processor for:
functional validation of the compiler toolchain itself,
fine tuning dedicated optimizations implemented in the compiler,
pre-silicon validations of software applications.
CAPS produces an ISA reference manual and technical documentation for the compiler toolchain and the cycle-accurate simulator.
The PACCMAN compiler toolchain and the PACCMAN simulator integrate
several technologies developed by CAPS: (i) a retargetable system for
assembly language transformation and optimization (SALTO
The industrialization and maintenance of the PACCMAN development suite will be transfered to Caps Entreprise (cf. ).
Some performance issues must be handled at higher level than the direct interface between the hardware and the instruction set. For heterogeneous SOCs featuring special purpose hardware and one or more execution cores, we are exploring the thread extraction for the different hardware components . Code size is often an issue on embedded systems. We are exploring tradeoffs based on code compression and interpretation (cf. ). Compiler optimizations for embedded systems necessitate software platforms which can support a large family of low end (assembly level) optimizations and can be easily retargeted with new ISAs as well as new microarchitectures (ALISE, cf. ). For the end user, understanding the effective performance of a code on a hardware platform is a major issue, since one wants to recognize the faulty component (hardware, compiler or application implementation). We are developing a performance debugging tool for this purpose (ATLLAS, cf. ).
Systems On Chip, aka SoC, are highly integrated architectures combining on the same chip at least a programmable processor, some memory and include other computing units. They suit very well embedded applications with some intensive computations.
Synthesizing for an application implies chosing the hardware components fitting the application and then mapping the application on the chosen architecture. Some code sections may require a dedicated component for some specific types of computations while others can be parallelized.
To achieve thread extraction, our approach focuses on two criteria : computational density and potential parallelism granularity. Compute intensive areas are determined through profiling. Dependencies between parts of the application are first statically computed from the call-graph. We then refine this analysis through an instrumented execution, particularly on the accesses to shared data.
Designing an embedded system is essentially a tradeoff between hardware and software. Developers must achieve fast design taking into account various constraints such as memory space, power consumption and application speed. Memory space is a strong constraint as it directly impacts the cost and the functionalities of the system.
One would like to minimize the amount of memory space allocated to
programs to allow more applications to fit in the device or to reduce
the number of memory chips. One may also want to optimize the amount
of memory for an embedded system design, i.e to find the exact quantity needed to
allow good performance
Two major techniques impact code size. On the one hand, optimizations improve performance (especially on architecture featuring instruction level parallelism) while increasing code size. On the other hand, code compression reduces code size while degrading performance.
Finding a global tradeoff between code size and performance consists in allowing large code size on critical sections where such a use provides important performance returns, while saving code space on seldom executed code sections. This tradeoff concept is crucial in the compilation of embedded applications and is dependent on the target system and its applications. Enabling both optimizations and compressions on a single code may allow to reach a near optimal code size versus performance tradeoffs.
We have already investigated
strategies for optimization under code size constraint
We
have defined a software compiler driven compression scheme for this infrastructure
Our scheme achieves a high compression ratio
Our recent work deals with finding tradeoff strategy for tradeoff using this compression scheme.
Optimization
infrastructures are a major asset to produce efficient code for embedded
systems. In particular, optimization infrastructures are needed for
testing and validating alternative code optimizations
alisé is a new retargetable infrastructure for low level
code optimizations. alisé relies on a target description machine
that provides structural information (instruction timing, resource usage
etc.) and instruction semantic information. alisé provides many
functionalities, such as code region cloning, that allows alternative code
optimizations.
alisé is especially adapted to complex DSP processors featuring
non-orthogonal data paths and multiple register files. Using alisé
we have defined a new combined scheduling/code generation
technique. Thanks to semantic equivalence, code sequences can be replaced
by semantically equivalent ones for shorter schedulings. This has been
experimented on the MMDSP architecture from STMicroelectronics.
This work is supported by STMicroelectronics (Central R&D - Crolles).
Multimedia processors, and mainly vliw processors, heavily rely on
compilers for the production of high performance code.
Our objective, supported by thomson r&d, is the design of
a retargetable tuning tool called atllas for Analysis and Tuning
tool for Low Level Assembly and Source code. atllas performs the
static extraction of code quality informations from the optimized
assembly level (number of types of instructions, functional unit
occupation, WCET, critical path) and reports it at the source code level.
It also allows the user to cross-check C
source code and optimized assembly code through a graphical and interactive interface.
Support of multiple target architectures is enabled through the use of
an abstraction of the code-checking coupled with a generic control flow
analysis.
Retargetability is allowed through the use of two software
components. First the processor hardware description is handled through
the Salto
ATTLAS has been fully implemented and tested. We have validated the
principles of iterative interactive performance debugging with ATLLAS.
We currently fully support the Philips Trimedia tm1000, and the Equator Map1000 and to a less extent the
Sparc and the Mips architectures.
ATTLAS also features
an in-depth analysis of the c source code (satisfying the c iso/iec 9899 standard ),
an analysis of the relative performance by a worst case execution
time analysis
HAVEGE (HArdware Volatile Entropy Gathering and Expansion) is a user-level software unpredictable random number generator for general-purpose computers that exploits the modifications of the internal volatile hardware states of a processor as a source of uncertainty.
An unpredictable random number generator is a practical approximation of a truly random number generator. Such unpredictable random number generators are needed for cryptography.
Modern superscalar processors feature a large number of hardware mechanisms which aim at improving performance: caches, branch predictors, TLBs, long pipelines, instruction level parallelism, ... The state of these components is not architectural (i.e. the result of an ordinary application does not depend on it), it is also volatile and cannot be directly monitored by the user. On the other hand, every invocation of the operating system modifies thousands of these binary volatile states.
HAVEGE (HArdware Volatile Entropy Gathering and Expansion)
More and more modern appliances such as PDAs, cell phones,... are built around low-power superscalar processors (e.g. StrongARM, Intel Xscale) and features complex operating systems. This year, a demonstrator of HAVEGE for such a PDA featuring PocketPC2002 and a Xscale processor has been developed and is now distributed.
Showing that HAVEGE-like softwares can be a source of unpredictable random numbers on most modern computing appliances is the objective of the UNIHAVEGE project (cf. ).
This research is done in cooperation with the Inria Rocquencourt CODES team (Nicolas Sendrier).
To meet the huge computational capabilities of the 100nm IC generation together with the exploding market needs, efficient design methods for multi-processor embedded systems architectures are needed. The MESA project aims at conceiving a flexible multi-processor design platform that offers tools for application domain analysis, reconfigurable IP blocks usage, communication protocol design and validation techniques. MESA involves industrial partners (Philips, Bull, Eads Telecom, STMicroelectronics and Alcatel), SMEs (ARM, CoWare, MetaSymbiose, PolySpace, CAPS entreprise) and public/academic institutions (EPUN/MCSE, IMEC, Inria, KU-Leuven, LIP6, TIMA).
The EPICEA (Explicit Parallelism Instruction Computer Compiler Environment and Architecture) project is a collaboration between Inria team CAPS, University of Versailles Saint-Quentin, CEA and Bull funded by the ministry of Industry. The overall goal is to develop a software environment for scientific applications for architectures using the IA 64 ISA. Contribution from CAPS is described in and .
The PACCMAN project (PlAteforme de Composants Cryptographiques
pour Multi-Applications Nationales) aims at designing and producing
a ``French'' cryptographic platform. This platform will be composed of
a cryptography ASIP (Application Specific Integrated
Processor) and a complete compiler toolchain (including debugger
and profiler).
The goal of PACCMAN platform is to improve the technological independence
and the competitiveness of the French cryptography industry. The two
main industrial application fields of this project are the
implementation of secured PMR (Professional Mobile Radio)
communications and high bandwidth VPNs (Virtual Private Networks).
The PACCMAN project is supported by the OPPIDUM research network and involves several industrial partners. These partners are: EADS-DSN (the world leader in digital PMR), MATRAnet (software for network security), Bull TrustWay (software and hardware components for network security), and Inria.
The researches on instruction fetch front ends (cf. ) and register file structures (cf. ) are supported by the Intel company through a research grant (Convention 4 01 C 0677 00 31308 06 1) and a material donation.
The doctoral studies of Laurent Morin is supported by a convention CIFRE with the Thomson MMD society (cf. ).
The doctoral studies of Laurent Bertaux are supported by a convention CIFRE with the STMicroelectronics society (cf. ).
The doctoral studies of Pascal Terjan are supported by a convention CIFRE with the STMicroelectronics society (cf. ).
The doctoral studies of Gilles Pokam are supported by a convention CIFRE with the STMicroelectronics society (cf. ).
In 2003 a large part of the research team was involved in the set up of the start-up company CAPS entreprise (http://www.caps-entreprise.com/). CAPS entreprise aims at bringing innovative software tools, solutions and services to the market of high performance embedded systems. The company aims at becoming a reliable partner for system builders, platform designers and developers seeking the best system performance, by helping them match their software to the specificities of the underlying hardware platform.
CAPS entreprise offers standalone tools that are specialized for a given task (code transformation, simulation, worst-case execution time analysis...). These tools can act as building blocks in a software tool chain and are designed for seamless integration into common development environments.
The company proposes global compilation solutions, tailored to the customer's needs. After a detailed exploration of the user requirements against the existing process, specific additions and enhancements to the previous code generation infrastructure are proposed and implemented.
CAPS entreprise finally offers custom consulting services, such as performance analyses or instruction-set evaluations. Through these services, customers benefit from the company's in-house expertise and tools, helping them make strategic decisions on complex technology issues.
The company was awarded in 2003 by the french ministry of Industry as an innovative company. CAPS entreprise is incubated by Emergys and Inria-Transfert.
Industrial tranfer of ABSCISS (cf. ) to CAPS entreprise is supported by Inria through the postdoc fellowship of Ronan Amicel.
Researches on unpredictable random number generation are funded through
the ACI sécurité project UNIHAVEGE. Main partners are CAPS team and
CODES team from Inria Rocquencourt.
CAPS members participate to RTP CNRS Architecture et Compilation:
André Seznec is member of the steering committee.
Pierre Michaud participates to specific action ``Nouvelles technologies et nouveaux paradigmes d'architecture''.
François Bodin participates to specific action `` Compilation pour les systèmes embarqués''.
A. Seznec has been a member of the program committee of HiPC'2003 (High Performance Computing). He is a member of the program committees of 31th International Symposium on Computer Architecture (ISCA'31) and of CARI'2004. He is the program co-chair of ICS 2004 (International Conference on Supercomputing).
F. Bodin is a member of the editorial board
of tsi and has been a member of the program committee of MEMOCODE'2003.
F. Bodin and A. Seznec are teaching computer architecture and
compilation at dea of computer science, at diic and
dess
ISA at IFSIC, University of Rennes I.
A. Seznec has presented a seminar entitled `` HAVEGE: Hardware Volatile Entropy Gathering and Expansion, generating unpredictable random number at user-level'' at UPC, Barcelona in April 2003.
A. Seznec has presented a seminar entitled `` Ahead Pipelining the Instruction Address Generator'' at UPC, Barcelona in April 2003 and at Intel, Hillsboro, Oregon in June 2003.
Yanos Sazeides, University of Cyprus, has been visiting the project for a month in May 2003.
A. Seznec is an elected member of the evaluation committee of Inria.
F. Bodin is an elected member of the IFSIC Committee.
F. Bodin is responsible of the DEA of computer science at University of Rennes I and chairman for doctoral studies at Irisa.
F. Bodin is vice-chairman of Ecole doctorale Matisse
(
F. Bodin is with V. Verdon and the help of Y. Sost at the origin of the creation of
the fundation M. Métivier (
Jacques Lenfant is the chairman of the professor hiring committee of University of Rennes I.
Jacques Lenfant is a member of ``académie des sciences et des technologies''.