1a caps Compilation, architectures des processeurs superscalaires et spécialisés Scientific head Research Director Inria Administrative Assistant TR Inria Inria staff members Research scientist Academic staff Professor, University of Rennes 1 Professor, University of Rennes 1 Teaching Assistant, university of Versailles-Saint-Quentin, on partial secondment at Inria in 2003 Project technical staff Inria till 11/30/03 from 02/15/03 till 06/30/03 from 02/15/03 Junior technical staff Inria till 10/15/03 Inria Postdoctoral fellows from 09/01/03 with CAPS Entreprise, from 02/01/03 Ph.D. students CIFRE allocation, STMicroelectronics, till 09/30/03 Inria allocation French-Algerian allocation Allocation couplée, till 08/31/03, Teaching Assistant, from 09/01/03 Inria allocation Allocation couplée CIFRE allocation Thomson MMD, till 09/30/03 CIFRE allocation STMicroelectronics Inria allocation CIFRE allocation STMicroelectronics MENRT allocation (Sans Titre)

High performance microprocessors are used in various information technology applications ranging from supercomputers, high-end multiprocessor servers, to PCs and workstations, but also high-end embedded applications (avionics, networks, as well as consumer products such as automotive, set-top boxes or cell phones). The theoretical performance of these processors has been increasing continuously for the past two decades. This trend continues at the cost of a rising hardware complexity (transistor count, power consumption, design cost). At the same time, extracting a significant part of this theoretical performance becomes more and more difficult for the end user, even with the assistance of a compiler.

Research in the CAPS project-team ranges from processor architecture to software platforms for performance tuning, including compiler/architecture interactions, and processor simulation techniques. Our objective is to enable the end user to exploit a significant fraction of this theoretical performance while still masking the underlying hardware complexity.

Our research in computer architecture covers memory hierarchy, branch prediction, superscalar implementation, as well as SMT and multicore processors. In the recent past, we have proposed several new complexity-effective structures for caches and branch predictors, and we are still refining them (cf. , , ). We have also started research in order to reduce branch misprediction penalties (cf. ). We also aim at reducing the hardware costs of implementing wide-issue superscalar processors (cf. ) while we pursue researches on thread level parallelism on a single chip (cf. , ).

Instruction Set Architecture (ISA) is where the software meets the hardware. On the one hand the compiler must optimize the codes to take advantage of the micro-architecture. On the other hand, the ISA must be designed in such a way that the compiler can ``understand'' performance issues. We are exploring the adequation of dynamic execution on embedded applications (cf. ). In light of the current knowledge on microarchitecture, we are trying to define the characteristics that a general purpose ISA should feature, to allow efficient and cost-effective implementation (cf. ). Researches in architecture and code optimizations require to experiment and to evaluate new architectural ideas. We are defining several simulation frameworks that permits to validate both embedded statically scheduled processor architectures (ABSCISS, cf. ) and out-of-order execution for general-purpose architectures (IAOO, cf. ). Most scientific applications make use of vector-like computations, while the behavior of the memory hierarchies on such computations is implementation dependent. We are developing an original approach for the code generation for vector-like kernels on IA 64 platforms (cf. ). Most ISAs feature multimedia instructions. These instructions are generally not well handled by compilers. We have designed a C source code optimizer that automatically makes use of these instructions (cf. ). Power consumption becomes a more and more important issue, in particular with embedded systems. We are studying how the compiler can dynamically reconfigure the architecture to optimize the power consumption/performance tradeoff (cf. ). Capitalizing on CAPS know-how in ISAs, compilers and simulators, we participate in the design of a cryptographic processor (PACCMAN, cf. ).

Some performance issues must be handled at a higher level than the direct interface that lies between the hardware and the instruction set. For heterogeneous SOCs (System On a Chip) featuring special purpose hardware and one or more execution cores, we are exploring thread extraction for the different hardware components (cf. ). Code size is often an issue with embedded systems. We are exploring tradeoffs leveraging code compression and interpretation (cf. ).

Compiler optimizations for embedded systems require advanced software platforms that can support a large family of low end (assembly level) optimizations and that can be easily retargeted with new ISAs as well as new microarchitectures (ALISE, cf. ). For the end user, understanding the effective performance of an application on a hardware platform is a major issue, since it is necessary to recognize the limiting component (hardware, compiler or application implementation). We are developing a performance debugging tool for this purpose (ATLLAS, cf. ).

Finally, we use our knowledge of modern microarchitecture to participate in the definition of an unpredictable random number generator (HAVEGE, cf. ).

Our research is partially supported by industry (Intel, STMicroelectronics, Thomson MMD). We also participate in several institutionally funded projects (OPPIDUM PACMANN, MEDEA+ MESA, ministry of industry EPICEA, ACI Securité UNIHAVEGE). Among our main partners in these institutionally funded projects, let us cite Bull, EADS and STMicroelectonics.

Some of the research prototypes developed by the project during the past few years are currently being transferred to industry through the CAPS Entreprise start-up (cf. ).

Panorama

Research activities by the CAPS team range from highly focused studies on specific processor architecture components to software environments for performance tuning on embedded systems. In this context, the compiler/architecture interaction is at the heart of the team research.

In this section, we briefly present the remaining challenges in uniprocess architecture, the new challenges and opportunities for architects created by single-chip hardware thread parallelism, and the challenges for compilers on embedded processors.

Uniprocess architecture Superscalar processor branch prediction memory hierarchy speculative execution

The advance of integration technology permits the design of wide-issue superscalar processors with a high clock frequency. The gap between the main memory access time and the clock cycle is an ever increasing bottleneck. Moreover the product (pipeline depth)x(issue width) is also increasing. Effective performance is therefore more and more dependent on the management of control and data dependencies. As it stands, the two main challenges for uniprocessor architects to achieve ultimate performance remain 1) hiding the memory hierarchy latency and 2) hiding (breaking) control and data dependencies. However, computer architects must also address the new challenges associated with the power consumption and the design complexity. Finally, performance predictability will also become a major issue in the near future for both computer architects and software designers.

The gap between processor cycle time and main memory access time is increasing at a tremendous rate and is reaching up to 1000 instruction slots. At the same time, the instruction pipeline depth is also increasing (20 cycles on the Intel Pentium 4) and several instructions can be executed within a single cycle. A branch misprediction will soon lead to a 100-instruction slots penalty.

Over the past 10 years, research results have allowed to limit the performance loss due to these two phenomena. The average effective performance of processors has remained in the range of one instruction per cycle, while these two gaps were increasing by an order of magnitude.

The use of a complex memory hierarchy has been generalized over the past decade. On modern microprocessors, both software and hardware prefetching are now widely used to enable the on-time presence of data and instructions in the memory hierarchy. Highly efficient, but complex data hardware prefetch mechanisms, have been proposed to hide several hundreds of instruction slots . The challenge for the computer architects will be to reduce the complexity of these hardware mechanisms in order to enable simpler implementation. The challenge is also to propose new prefetch mechanisms that can hide several thousands of instruction slots.

Over the past decade, efficient branch prediction mechanisms have been proposed and implemented . Both branch directions and targets (even indirect jump targets) are predicted. Most of these predictors exploit either local or global branch history. The accuracy of the prediction seems to be reaching a plateau. New prediction paradigms exploiting other information sources are probably needed to allow new major prediction accuracy gains.

The complexity of many components in the processor (in terms of silicon area, power consumption and response time) increases superlinearly (and often quadratically) with the issue width e.g. register renaming, instruction scheduling, bypass network and register file access. These components are becoming the bottlenecks that limit the issue width and the cycle time .

While the complexity of the processors is steadily increasing, predicting, understanding and explaining the effective behavior of the architecture is becoming a major issue, in particular for embedded systems. Unfortunately, high performance is often synonym with high unpredictability and variability in performance. Designing architectures with predictable and high performance will become a major challenge for computer architects as well as compiler designers in the next few years.

Exploiting task parallelism on a single chip: multicore and SMT processor multicore processor

It has now become possible to implement hardware thread parallelism on a single chip. The advantage of single chip hardware parallelism over a conventional parallel machine is that the performance cost of communicating between the different tasks is reduced. Multicores or CMP (chip multiprocessors) as well as SMT (Simultaneous Multithreading) processors are emerging on the server and workstation market. On a multicore, tasks execute on distinct processing units. Resource sharing concerns only one or several on-chip cache levels, and chip pins. This is to be contrasted with SMT processors, on which resource sharing concerns most resources. Key issues concerning SMT / multicore processors are the performance on sequential application and the design complexity. This will determine the extent to which they can be used as universal computing components.

It becomes more and more difficult to exploit higher degrees of instruction-level parallelism on superscalar processors. Thus it has been proposed to exploit task parallelism. Two different approaches exist, namely the multicore approach and the simultaneous multi-threading (SMT) approach. Task parallelism is actually a simple way to increase the execution throughput in certain contexts : embedded applications, servers, multi-programmed systems, scientific computing, ...

The straightforward way to implement task parallelism is to have multiple distinct processors. Current technology is able to put several hundred millions of transistors on a single die. This allows to integrate several high-performance computing cores on the same chip, and presents several advantages, not the least of which are a reduced communication latency between cores, and a potentially higher communication bandwidth.

multicore processors are already available for some embedded applications, and IBM has introduced the dual-core POWER4 last year for work-stations and servers . Most high-end processor families have a multicore on their roadmap for this decade. The first multicores will feature only two cores and should appear within the next 2–3 years (IBM Power5, HP Mako, Intel Montecito, Sun Ultrasparc IV and Gemini,...). Second generation multicores with a higher number of cores should appear later in the decade.

On a multicore, the tasks execute on distinct processing units. Resource sharing concerns only one or several on-chip cache levels, and chip pins. This is to be contrasted with SMT processors, on which all resources are shared apart a few buffers . Some SMT processors are already available, like the Intel Pentium 4 . However, the main difficulty for the design of SMT processors is the design of a very wide issue superscalar processor. Though the SMT and multicore approaches both exploit task parallelism, they are orthogonal, and future multicores may feature SMT cores.

A key issue concerning SMT / multicore processors is whether they can improve sequential execution. Among possible improvements, one may seek to obtain a more reliable execution (for instance by redundant execution), or more performance. A few ideas have been recently proposed to speed-up sequential execution, like for instance speculative threads , exception handling , helper threads for branch prediction , helper threads for memory prefetching , etc. Among solutions already proposed, it is not yet clear which are viable and which are not. It will depend on the performance gain / hardware complexity tradeoffs. Ongoing research on this topic will decide the scope of future SMT/multicore processors.

Compiling and optimizing for embedded applications Embedded processors High Performance Compilation Code Optimization ISA Simulation

Embedded processors are proposing many new challenges to the hardware and compiler research community. Code optimization does not just mean performance optimization, but may mean best performance/code size tradeoff, guaranteeing real time constraints, or getting best power consumption/performance tradeoffs. New software environments, with new demands are needed. For instance, performance/power consumption/code size tuning debugging tools are needed. Since wide spectrum of hardware platforms have to be explored, retargetable compiler infrastructures and retargetable ISA simulators are also needed.

Embedded processors range from very small, very low-power systems (for instance for telemetry counter sensors which must run on one battery for 10 years) to power hungry high-end processors used in radars or set-top boxes. The spectrum of softwares range from very small code kernels (a few Kinstructions) to millions of code lines including a real time operating system. The constraints on the code quality vary from ``just no bugs'' to safety critical with hard real time problems, but may also be a fixed performance level at the smallest possible cost or the smallest power consumption.

Therefore embedded processors are presenting many new challenges to the hardware and compiler research community.

Code optimization for embedded processors does not directly fit in the traditional "best speed effort at any price'' assumption used for supercomputers and workstations. First, the ``common case'' paradigm using a set of representative benchmarks (e.g. SPEC2000) for general-purpose processor systems is not relevant for the design of compiler optimizations for an embedded processor: one must concentrate on the few optimizations that will bring performance on the few relevant target applications. Second, execution time is not always the only and ultimate criteria. In many cases, execution time may be less important than memory size or power consumption. Third, binary compatibility, while often important, does not overcome the matching between system cost, application functionalities and time-to-market constraints.

Many challenges have to be addressed at the compiler/optimizer level. These include compiling under constraints and mastering the optimization interactions.

Finding a tradeoff between binary code size and execution time is a major issue in many applications. For small micro-controllers, ``the smallest code, the fastest'' is an effective rule of thumb. However, for recent embedded processors featuring instruction level parallelism (e.g. VLIW processors), faster code generally means larger code size . To master code size, code compression techniques can also be used to reduce memory size of infrequently executed code regions.

In the context of real time systems, average performance is often not a critical issue, but the worst case execution time (WCET) may be critical. WCET estimations can be either obtained by measurement or by program static analysis. However these techniques are challenged by new hardware which behavior is fundamentally difficult to predict . A better synergy between compilers and hardware must be set up and supported by performance debugging tools.

Power consumption is becoming a major issue on most processors. For a given processor, power consumption is highly related to performance: in most cases, a compiler optimization reducing execution time also reduces power consumption . A more interesting issue arises with configurable hardware, for instance cache memories that can vary in size or associativity. In that case, the compiler can tradeoff performance against power consumption .

While many optimizations and code transformations have been proposed over the past two decades, the interactions between these optimizations are not really understood. The many optimizations used in modern compilers sometimes annihilate eachother . Performance tuning is therefore an important and time consuming task. For embedded systems, developers must perform this tuning while preserving code size or power consumption. New software environments must be designed for this performance tuning (cf. ). An associated challenge is to preserve the link between aggressively optimized low level code and the source code . As an alternative (or a complement) to performance tuning, automatic iterative compilation techniques address the interactions of optimizations through the use of feedback, to find efficient code transformation sequences.

Time-to-market is a major challenge for embedded processor designers. Wide spectrum of possible derived hardware platforms (configurations, co-processors, etc.) is also major issue for embedded system designers. Defining or dimensioning an embedded system (hardware, compiler and application) requires to explore a large solution space for the best cost/performance/application. Retargetable compiler infrastructures as well as fast processor simulation are key issues to support design exploration. Compiled simulation is one of the promising technique for very fast ISA simulation. These simulators can be used to retarget the compiler very early in the design process.

(Sans Titre) performance processor architecture compilers telecommunication multimedia biology health engineering environment

The Caps team is working on the foundation technologies for computer science: processor architecture and performance oriented compilation. The research results have impacts on any application domain that requires high performance executions (telecommunication, multimedia, biology,health, engineering, environment, ...). Our research activity implies the development of software prototypes (cf. , )

Panorama

The CAPS team is developing several software prototypes for research purposes: compilers, architectural simulators, programming environments, ....

Among the many prototypes developed in the project, we present here ABSCISS and HAVEGE, two softwares developed by the team. ABSCISS is currently being transfered to the CAPS Entreprise start-up and HAVEGE is freely distributed for non-commercial use.

ABSCISS retargetable processor simulation platform

Absciss (Assembly-Based System for Compiled Instruction-Set Simulation) is a retargetable system for high-speed instruction-set simulation (cf. ). This tool automatically generates compiled simulators from an assembly program and a description of the target processor instruction set architecture.

As of today, it targets various statically scheduled RISC and VLIW processors. Within this kind of architectures, the simulators generated by Absciss are cycle-accurate. Caches can also be simulated by interfacing to an external module. Other architectures can be simulated at a functional level, that is, only the behavior of the program will be simulated.

Absciss is optimized both for high speed simulation and fast simulator generation (also providing flexibility through an API for extension ``plug ins'' written in C++ or Python).

Status : Registered with APP Number IDDN.FR.001.190016.000.S.P.2002.000.10600. Absciss is industrialized by CAPS entreprise (cf. ).

HAVEGE Unpredictable random number generator

Contact : André Seznec

Status : Registered with APP Number IDDN.FR.001.500017.001.S.P.2001.000.10000. Available for tests and use in non-commercial software.

An unpredictable random number generator is a practical approximation of a truly random number generator. Such unpredictable random number generators are needed for cryptography.

Modern superscalar processors feature a large number of hardware mechanisms that targets performance improvements: caches, branch predictors, TLBs, long pipelines, instruction level parallelism,.... The state of these components is not architectural (i.e., the result of an ordinary application does not depend on it), it is also volatile and cannot be directly monitored by the user. On the other hand, every invocation of the operating system modifies thousands of these binary volatile states.

HAVEGE (HArdware Volatile Entropy Gathering and Expansion) is a user-level software unpredictable random number generator for general-purpose computers that exploits these modifications of the internal volatile hardware states as a source of uncertainty. HAVEGE combines on-the-fly hardware volatile entropy gathering with pseudo-random number generation.

The internal state of HAVEGE includes thousands of internal volatile hardware states and is merely unmonitorable. HAVEGE can reach an unprecedented throughput for a software unpredictable random number generator: several hundreds of megabits per second on current workstations and PCs.

The throughput of HAVEGE favorably competes with usual pseudo-random number generators such as rand() or random(). While HAVEGE was initially designed for cryptology-like applications, this high throughput makes HAVEGE usable for all application domains demanding high performance and high quality random number generators, e.g. Monte Carlo simulations.

Last, but not least, more and more modern appliances such as PDAs or cell phones are built around low-power superscalar processors (e.g. StrongARM, Intel Xscale) and feature complex operating systems. HAVEGE can also be implemented on these platforms. A HAVEGE demonstrator for such a PDA featuring PocketPC2002 OS and a Xscale processor is available.

Visit http://www.irisa.fr/caps/projects/hipsor/HAVEGE.html or contact André Seznec.

Processor Architecture Processor cache locality memory hierarchy branch prediction simultaneous multithreading multicore

Our research in computer architecture covers memory hierarchy, branch prediction, superscalar implementation, as well as SMT and multicore issues. In the recent past, we have proposed several new complexity-effective cache and branch predictor structures . We are still refining, analyzing (cf. ) and exploring new usage of skewed associative caches (cf. ). We are also pursuing researches on reducing branch misprediction penalties (cf. ). We are also exploring new directions in branch predictions. We are also trying to reduce the hardware costs of implementing wide-issue superscalar processors (cf. ) while continuing ongoing research on thread level parallelism on a single chip (cf. , ).

Modeling skewed-associativity

Past studies on skewed-associativity , based on cache simulations, have demonstrated the efficiency of this method of cache organization. However, the reasons of the observed efficiency are not intuitive. We have provided a theoretical framework for better understanding skewed-associativity . This study has shown that the efficiency of skewed-associativity is inherently statistical, and does not require spatial locality. The model also provides a better understanding of the potential and limit of the self-data reorganization effect. Thanks to self-data reorganization, 2-way skewed-associativity emulates full associativity for working-sets up to 50% the cache size, and 3-way skewed-associativity emulates full associativity for working-sets up to 90% the cache size.

Concurrent support of multiple page sizes on a TLB

Some architecture definitions (e.g. Alpha) allow the use of multiple virtual page sizes even for a single process. Unfortunately, on current set-associative TLBs (Translation Lookaside Buffers), pages with different sizes can not coexist. Thus, processors supporting multiple page sizes implement fully-associative TLBs.

We have shown that the skewed-associative TLB can accommodate the concurrent use of multiple page sizes within a single process . This permits to envision either medium size L1 TLBs or very large L2 TLBs supporting multiple page sizes.

Branch prediction and instruction sequencing Ahead pipelining instruction address generator

On a N-way issue superscalar processor, the front end instruction fetch engine must deliver instructions to the execution core at a sustained rate higher than N instructions per cycle. The instruction address generator/predictor (IAG) has to predict the instruction flow at an even higher rate.

Very complex IAGs featuring different predictors for jumps, returns, conditional and unconditional branches and complex logic are used. Usually, the IAG uses information (branch histories, fetch addresses, ...) available at a cycle to predict the next fetch address(es). Unfortunately, a complex IAG cannot deliver a prediction within a short cycle. Therefore, processors rely on a hierarchy of IAGs with increasing accuracies but also increasing latencies: the accurate but slow IAG is used to correct the fast, but less accurate IAG.

As an alternative to the use of a hierarchy of IAGs, it is possible to initiate the instruction address generation several cycles ahead of its use . We have explored in details such an ahead pipelined IAG and shown that, even when the instruction address generation is (partially) initiated five cycles ahead of its use, it is possible to reach approximately the same prediction accuracy as the one of a conventional one-block ahead complex IAG. We have also shown that, provided some extra storage in the checkpoint mechanism, the execution can be resumed without any extra penalty on a branch misprediction. The proposed solution allows to deliver a sustained address generation rate close to one instruction block per cycle with state-of-the-art accuracy .

Global history branch predictors

In the past few years, the CAPS team has been studying branch predictors built around the skewed predictor model . In 2003, we have made available to the research community 2bcgskew predictor simulators (http://www.irisa.fr/caps/project/Architecture/2bcgskew.html).

We have also initiated research on the promising concept of perceptron predictors . Initial results show that the first studies by Jiménez and Lin were underestimating the potential of the perceptron predictor .

Mastering hardware complexity on wide-issue supercalar processors

With the continuous shrinking of transistor size, processor designers are facing new difficulties to achieve high clock frequency. In wide issue superscalar processors, register file read time, wake up and selection logic traversal delay, bypass network transit delay, and their respective power consumption constitute such major difficulties.

The general-purpose ISAs currently in use feature a single logical register file. This central view has also been adopted for the hardware implementation of dynamically scheduled superscalar processors. Until now, the following unwritten rule has always been applied: every general-purpose physical register can be the source or the result of any instruction executed on any integer functional unit.

In , we showed that transgressing this rule can be advantageous. Indeed, the set of physical registers can be divided into distinct subsets that are only read-connected (resp. write-connected) with a subset of the entries (resp. a subset of the exits) of the functional units. Therefore, the number of write and read ports on each individual physical register and the overall complexity of the physical register file, the bypass network and the wake-up logic is decreased. This proposed hybrid approach is referred to as WSRS (for Write Specialization Read Specialization).

We are currently exploring instruction allocation policies on functional unit clusters. We also explore the benefits of integrating the simultaneous multithreading paradigm into our WSRS architecture.

Resource sharing on single chip

As the increase of issue on superscalar processors bring diminishing returns, thread parallelism with a single chip is becoming a reality. In the past few years, both SMT (Simultaneous MultiThreading) and CMP (Chip MultiProcessor) approaches were first investigated by academics and are now implemented by the industry. In some sense, CMP and SMT represent two extreme design points. We are exploring possible intermediate design points for on-chip thread parallelism in terms of design complexity and hardware sharing. The CASH parallel processor (for CMP And SMT Hybrid) retains resource sharing à la SMT when such a sharing can be made non-critical for implementation, but resource splitting a la CMP whenever resource sharing leads to a superlinear increase of the implementation hardware complexity. CASH does not exploit the complete dynamic sharing of resources enabled on SMT. But it outperforms a similar CMP on a multiprogrammed workload, as well as on a uniprocess workload. Our CASH architecture shows that there exists intermediate design points between CMP and SMT.

Exploiting multicore processors with single process workload

It has now become possible to put several hundreds of millions of transistors on a single chip. This is large enough to put several high-performance computing cores on the same chip.

Multicore or CMPs (Chip MultiProcessors) should provide large performance gains on multi-programmed workloads and parallel applications. However, CMPs are not intended to speed-up sequential applications, and the first CMPs will be confined to the server market and certain embedded applications. We are searching ways to speed-up sequential execution on a CMP, in order to make CMPs "universal" computing devices. Indeed, integration on a single-chip decreases communication latency and permits a high communication bandwidth. This offers new possibilities.

Our first proposition is based on "execution migration" , i.e., the possibility for a sequential task to migrate quickly from one core to another. The goal of execution migration is to take advantage of the overall on-chip level-2 cache capacity. We introduced the affinity algorithm, a method for controlling migrations that can be implemented in hardware. We have shown that it is possible for an application to take advantage of the overall level-2 cache capacity if the application exhibits a property which we call "splittability". Our study shows that "splittability" is a rather frequent property of programs.

Interaction on architecture and compilers Architecture compilation ISA (Instruction Set Architecture) processor simulation code optrimization power consumption multimedia instruction set

The characteristics of the applications, the instruction set architecture and the compiler/architecture interaction have a significant impact on the effective performance that can be achieved by an application and/or the complexity of the hardware needed to achieve a predetermined performance level. We are exploring the adequation of dynamic execution on embedded applications (cf. ). In the light of current microarchitecture knowledge, we are trying to define the characteristics that a general purpose ISA should feature to allow efficient and cost-effective implementation (cf. ). Researches in architecture and code optimizations require to be able to experiment and to evaluate new architectural ideas. We are defining simulation frameworks to validate both embedded statically scheduled processor architectures (ABSCISS, cf. ) and out-of-order execution general-purpose architectures (IAOO, cf. ). Most scientific applications make use of vector-like computations, while the behavior of the memory hierarchies on such computations is very implementation dependent. We are developing an original approach for the code generation for vector-like kernels on IA 64 platforms (cf. ). Most ISAs feature multimedia instructions. These instructions are generally not well handled by compilers. We have designed a C source code optimizer that automatically makes use of these instructions (cf. ). Power consumption becomes a more and more important issue, in particular on embedded systems. We are studying how the compiler can dynamically reconfigure the architecture to optimize the power consumption/performance tradeoff (cf. ). Capitalizing on CAPS know-how in ISAs, compilers and simulators, we participate in the design of a cryptographic processor (PACCMAN, cf. ).

Characterization of embedded application behaviors for decoupled architectures

Needs for performance on embedded applications will lead to the use of dynamic execution on embedded processors in the next few years. However, complete out-of-order superscalar cores are still expensive in terms of silicon area and power dissipation. We have studied the adequation of a more limited form of dynamic execution, namely decoupled architecture, to embedded applications.

Decoupled architecture is known to work very efficiently as long as the execution does not suffer from inter-processor dependencies causing some loss of decoupling, called LOD events. We have studied regularity of codes in terms of the LOD events that may occur. We have addressed three aspects of regularity: control regularity, control/memory dependency, and patterns of referencing memory data. Most of the kernels in MiBench, a set of benchmarks for embedded systems, will be amenable to efficient performance on a decoupled architecture .

Balancing instruction sets for modern microarchitectures

Considering the advancements achieved in the micro-architectural and technical fields, a number of criteria highlight that current general-purpose ISAs (Instruction Set Architecture) are no longer suited to modern microprocessors. For instance, on RISC ISAs, instruction source operands are read from the register file and results are written into it. The pressure exerted on this structure implies that the register file must provide enough access ports so that performance is not impaired. Therefore, constraining instructions to use a small fixed number of read and write operands, may lead to a design simplification of the register file. Furthermore, in pipelined architectures, some operations need less than one processor cycle to be executed. Thus, combining these operations in single instructions , may increase both the processor cycle utilization and the ILP (Instruction Level Parallelism). This work aims at defining a new general-purpose ISA that exploits these proposals in order to be in harmony with the state-of-the-art micro-architectures.

High Speed Instruction-Set Simulation

Instruction-set simulation can be used to evaluate different instruction-set architectures (ISAs) in the context of architecture exploration, or to validate a compiler back-end, to test, tune and debug programs, on a user-friendly PC or workstation rather than on actual silicon which might not even exist yet.

The increasing size and complexity of embedded software require extremely fast instruction-set simulation. Compiled instruction-set simulation is an approach that is potentially much faster than interpretation, but it has a startup cost due to the generation and compilation of the simulator. This startup cost is often seen as a major drawback and has limited the adoption of compiled instruction set simulation.

We have designed a new approach to compiled instruction-set simulation, that aims at reconciling flexibility, retargetability, high simulation speed, and small startup costs. This approach was implemented in Absciss , a generator of compiled instruction-set simulators that works at the assembler level (cf. ).

Simulation platform for an out-of-order execution IA64 microarchitecture

The ISA (Instruction Set Architecture) has an important impact on the effective implementation of processors. Recently HP and Intel have introduced a new 64-bit ISA for general-purpose systems, called the IA64 or IPF.

Unlike previous generation general purpose ISAs, the IA64 makes extensive use of predication and provides support for speculative execution (advance loads for instance). Unlike traditional ISAs, this new instruction set is very resistant to an out-of-order architecture, because of the resource size as well as the complexity of executing predicated instructions. Although the research community has started to study the execution of predicated instruction within an out-of-order core, no definitive solution has been adopted yet.

We are developing a software simulation platform called IAOO, that is designed to provide the research community with a framework that permits to investigate the out-of-order execution of the IA64-like instruction sets. The framework is built around a set of libraries and applications programs that permit to analyze in details the structure of binary programs as well as emulating or simulating full binary executables.

Additionally, we have begun to investigate a novel register management policy that is designed to operate smoothly with a fully predicated ISA. This new system is based on an intermediate representation called Translation Register Buffer or TRB. The TRB mechanism that translates a logical register into a physical register is shown to be effective when an instruction is canceled by a predicate .

Optimizing scientific applications on IA64 platforms

In the framework of the EPICEA project (cf. ), CAPS collaborates with PRiSM (University of Versailles). PRiSM research group is in charge of benchmarking and analyzing the specific behavior of each implementation of the IA64 architecture (Itanium 1, Itanium 2, Madison...). A significant effort is devoted to the analysis of the memory subsystem, which exhibits many pathological behaviors. Using this knowledge of the implementations of IA64 architecture, CAPS team develop specific technics to use the specific features of the implementation to optimize codes. The main achievement of this collaboration is the development by CAPS research group, of a highly optimized code generator for computation intensive loops. Such portions of code (called computation intensive kernels) are very common in scientific applications, and require very specific optimizations. These kernels usually represent vector computations. The generator can exploit all the IA64 architecture features like software pipeline (implemented in hardware with the rotating registers mechanism) and prefetch instructions. We have shown that the current implementations of the IA64 architecture can essentially be used like a vector machine on vector-like codes.

A methodology for optimizing softwares was developed around this generator. This method includes the detection of portions of code to optimize, the generation of the corresponding optimized kernels and the kernels integration in a library, the source code modification to make use of the generated kernels, and the linking of the modified source code and the library containing the generated kernels.

Low Power and Architecture Configurability

Managing power consumption/performance tradeoff on high-performance embedded processors has become a major challenge. The cache hierarchy is a typical example of such a power/performance tradeoff design point. On the one hand, a cache allows to maintain an important fraction of the embedded code and data workload on-chip, thus reducing the amount of memory traffic and thereby improving the performance and the power consumption. On the other hand, however, on some embedded processors, the cache memory accounts for up to $50 %$ of the total chip area and for about $80 %$ of the total transistor count , making the cache memory subsystem a critical source of power dissipation.

The processor design community reacted to this problem by making the cache subsystem reconfigurable . This may allow to adapt the cache size requirement of a running program to a desired power/performance tradeoff. Former studies on reconfigurable cache for embedded systems have only considered reconfiguration on a per-application basis.

It is well known that programs usually execute as a series of phases . Our research has focused on the study of reconfiguring caches on a per-phase basis rather than on an application level. An important aspect of this research has focused on devising a fine-grain cache size adaptation scheme, driven by the compiler, for automatically characterizing the different cache size requirements of a program phase. Our model considers a great degree of reconfiguration flexibility, enabling a cache to be resized along its size and/or degree of associativity.

Exploitation of special purpose instruction sets in C programs

Many of today modern processors provide extensions to their instruction set specifically designed for computation-intensive multimedia applications (PowerPc AltiVec, TriMedia, Pentium MMX , ...). These extensions are based on SIMD instructions that operate on data subwords packed into registers. They are usually provided as intrinsics that can be inserted in the C code but they can also be inserted in the assembly code.

Direct insertion in assembly code requires good knowledge of both processor architecture and compilation techniques. Moreover such an approach does not lead to portable codes. Using intrinsics in C source code still requires code transformations (such as vectorization) for highlighting code regions where data parallelism can be exploited.

In the context of the Medea+ MESA project, CAPS has developed a C-to-C retargetable preprocessor prototype called SWARP . SWARP searches for portions of code suitable to the use of multimedia instructions, and automatically inserts their intrinsics equivalent. SWARP is based on modern code analysis and code transformation (dependence analysis, alias analysis, loop transformation, vectorization, ...) and on a pattern matcher to recognize and replace suitable code patterns.

The Paccman project

The PACCMAN project (PlAteforme de Composants Cryptographiques pour Multi-Applications Nationales cf. ) aims at designing and producing a ``French'' cryptographic platform. This platform will be composed of a cryptography ASIP (Application Specific Integrated Processor) and a complete compiler toolchain (including debugger and profiler). The goal of PACCMAN platform is to improve the technological independence and the competitiveness of the French cryptography industry. The two main industrial application fields of this project are the implementation of secured PMR (Professional Mobile Radio) communications and high bandwidth VPNs (Virtual Private Networks).

The CAPS research group has in charge Inria's contribution to PACCMAN. This contribution is four-fold:

CAPS collaborates to the specification of the instruction set architecture (ISA) of the PACCMAN processor.

CAPS designs, implements and validates the compiler toolchain for the PACCMAN processor. This compiler toolchain is made of:

an ANSI-C compiler including both state-of-the-art and dedicated optimizing techniques,

a software debugger,

a software profiler.

CAPS designs, implements and validates a cycle-accurate simulator of the PACCMAN processor for:

functional validation of the compiler toolchain itself,

fine tuning dedicated optimizations implemented in the compiler,

pre-silicon validations of software applications.

CAPS produces an ISA reference manual and technical documentation for the compiler toolchain and the cycle-accurate simulator.

The PACCMAN compiler toolchain and the PACCMAN simulator integrate several technologies developed by CAPS: (i) a retargetable system for assembly language transformation and optimization (SALTO ), and (ii) a high speed compiled simulator generator (ABSCISS ).

The industrialization and maintenance of the PACCMAN development suite will be transfered to Caps Entreprise (cf. ).

Compilers and software environment for high performance embedded processors compilation code compression optimization platform performance debugging

Some performance issues must be handled at higher level than the direct interface between the hardware and the instruction set. For heterogeneous SOCs featuring special purpose hardware and one or more execution cores, we are exploring the thread extraction for the different hardware components . Code size is often an issue on embedded systems. We are exploring tradeoffs based on code compression and interpretation (cf. ). Compiler optimizations for embedded systems necessitate software platforms which can support a large family of low end (assembly level) optimizations and can be easily retargeted with new ISAs as well as new microarchitectures (ALISE, cf. ). For the end user, understanding the effective performance of a code on a hardware platform is a major issue, since one wants to recognize the faulty component (hardware, compiler or application implementation). We are developing a performance debugging tool for this purpose (ATLLAS, cf. ).

Thread extraction for SOCs

Systems On Chip, aka SoC, are highly integrated architectures combining on the same chip at least a programmable processor, some memory and include other computing units. They suit very well embedded applications with some intensive computations.

Synthesizing for an application implies chosing the hardware components fitting the application and then mapping the application on the chosen architecture. Some code sections may require a dedicated component for some specific types of computations while others can be parallelized.

To achieve thread extraction, our approach focuses on two criteria : computational density and potential parallelism granularity. Compute intensive areas are determined through profiling. Dependencies between parts of the application are first statically computed from the call-graph. We then refine this analysis through an instrumented execution, particularly on the accesses to shared data.

Code Compression for Performance Code Size TradeOff

Designing an embedded system is essentially a tradeoff between hardware and software. Developers must achieve fast design taking into account various constraints such as memory space, power consumption and application speed. Memory space is a strong constraint as it directly impacts the cost and the functionalities of the system.

One would like to minimize the amount of memory space allocated to programs to allow more applications to fit in the device or to reduce the number of memory chips. One may also want to optimize the amount of memory for an embedded system design, i.e to find the exact quantity needed to allow good performance .

Two major techniques impact code size. On the one hand, optimizations improve performance (especially on architecture featuring instruction level parallelism) while increasing code size. On the other hand, code compression reduces code size while degrading performance.

Finding a global tradeoff between code size and performance consists in allowing large code size on critical sections where such a use provides important performance returns, while saving code space on seldom executed code sections. This tradeoff concept is crucial in the compilation of embedded applications and is dependent on the target system and its applications. Enabling both optimizations and compressions on a single code may allow to reach a near optimal code size versus performance tradeoffs.

We have already investigated strategies for optimization under code size constraint . We are currently designing a flexible infrastructure (target, constraint ...) offering both optimization and compression to explore this tradeoff.

We have defined a software compiler driven compression scheme for this infrastructure . Our compression scheme can be used in multi-task systems and is machine independent. It has been designed to compressi small code shunks, since unfrequently executed code regions are spread all over the application.

Our scheme achieves a high compression ratio We define the compression ratio as $\frac{t o t a l f i n a l c o d e s i z e}{o r i g i n c o d e s i z e}$ and the compression rate as $100 - c o m p r e s s i o n r a t i o$ . The compressed code is typically half the size of the original code. As we only compress the sparsely executed sections, total code size of the application is reduced from 13% up to 33% .

Our recent work deals with finding tradeoff strategy for tradeoff using this compression scheme.

ALISÉ : Assembly Level Infrastructure for Software Enhancement

Optimization infrastructures are a major asset to produce efficient code for embedded systems. In particular, optimization infrastructures are needed for testing and validating alternative code optimizations . alisé is a new retargetable infrastructure for low level code optimizations. alisé relies on a target description machine that provides structural information (instruction timing, resource usage etc.) and instruction semantic information. alisé provides many functionalities, such as code region cloning, that allows alternative code optimizations.

alisé is especially adapted to complex DSP processors featuring non-orthogonal data paths and multiple register files. Using alisé we have defined a new combined scheduling/code generation technique. Thanks to semantic equivalence, code sequences can be replaced by semantically equivalent ones for shorter schedulings. This has been experimented on the MMDSP architecture from STMicroelectronics.

This work is supported by STMicroelectronics (Central R&D - Crolles).

Atllas: A Performance Debugging Tool

Multimedia processors, and mainly vliw processors, heavily rely on compilers for the production of high performance code. Our objective, supported by thomson r&d, is the design of a retargetable tuning tool called atllas for Analysis and Tuning tool for Low Level Assembly and Source code. atllas performs the static extraction of code quality informations from the optimized assembly level (number of types of instructions, functional unit occupation, WCET, critical path) and reports it at the source code level. It also allows the user to cross-check C source code and optimized assembly code through a graphical and interactive interface.

Support of multiple target architectures is enabled through the use of an abstraction of the code-checking coupled with a generic control flow analysis. Retargetability is allowed through the use of two software components. First the processor hardware description is handled through the Salto library. Second, we have implemented a target independent control flow graph builder used for the analysis of the optimized assembly code. Key of the code cross-checking, this component is based on an abstract Boolean model used for predicates analysis as in and for a specific data flow analysis.

ATTLAS has been fully implemented and tested. We have validated the principles of iterative interactive performance debugging with ATLLAS. We currently fully support the Philips Trimedia tm1000, and the Equator Map1000 and to a less extent the Sparc and the Mips architectures.

ATTLAS also features an in-depth analysis of the c source code (satisfying the c iso/iec 9899 standard ), an analysis of the relative performance by a worst case execution time analysis and some automatic source code transformations.

HAVEGE: generating empirically strong random numbers unpredictable random number cryptography security

HAVEGE (HArdware Volatile Entropy Gathering and Expansion) is a user-level software unpredictable random number generator for general-purpose computers that exploits the modifications of the internal volatile hardware states of a processor as a source of uncertainty.

An unpredictable random number generator is a practical approximation of a truly random number generator. Such unpredictable random number generators are needed for cryptography.

Modern superscalar processors feature a large number of hardware mechanisms which aim at improving performance: caches, branch predictors, TLBs, long pipelines, instruction level parallelism, ... The state of these components is not architectural (i.e. the result of an ordinary application does not depend on it), it is also volatile and cannot be directly monitored by the user. On the other hand, every invocation of the operating system modifies thousands of these binary volatile states.

HAVEGE (HArdware Volatile Entropy Gathering and Expansion) (cf. ) is a user-level software unpredictable random number generator for general-purpose computers that exploits these modifications of the internal volatile hardware states as a source of uncertainty.

More and more modern appliances such as PDAs, cell phones,... are built around low-power superscalar processors (e.g. StrongARM, Intel Xscale) and features complex operating systems. This year, a demonstrator of HAVEGE for such a PDA featuring PocketPC2002 and a Xscale processor has been developed and is now distributed.

Showing that HAVEGE-like softwares can be a source of unpredictable random numbers on most modern computing appliances is the objective of the UNIHAVEGE project (cf. ).

This research is done in cooperation with the Inria Rocquencourt CODES team (Nicolas Sendrier).

MEDEA+ A-502 MESA (2000-2004)

To meet the huge computational capabilities of the 100nm IC generation together with the exploding market needs, efficient design methods for multi-processor embedded systems architectures are needed. The MESA project aims at conceiving a flexible multi-processor design platform that offers tools for application domain analysis, reconfigurable IP blocks usage, communication protocol design and validation techniques. MESA involves industrial partners (Philips, Bull, Eads Telecom, STMicroelectronics and Alcatel), SMEs (ARM, CoWare, MetaSymbiose, PolySpace, CAPS entreprise) and public/academic institutions (EPUN/MCSE, IMEC, Inria, KU-Leuven, LIP6, TIMA).

CAPS contributions are described in and .

Project EPICEA (2001-2004)

The EPICEA (Explicit Parallelism Instruction Computer Compiler Environment and Architecture) project is a collaboration between Inria team CAPS, University of Versailles Saint-Quentin, CEA and Bull funded by the ministry of Industry. The overall goal is to develop a software environment for scientific applications for architectures using the IA 64 ISA. Contribution from CAPS is described in and .

The OPPIDUM Paccman project

The PACCMAN project (PlAteforme de Composants Cryptographiques pour Multi-Applications Nationales) aims at designing and producing a ``French'' cryptographic platform. This platform will be composed of a cryptography ASIP (Application Specific Integrated Processor) and a complete compiler toolchain (including debugger and profiler). The goal of PACCMAN platform is to improve the technological independence and the competitiveness of the French cryptography industry. The two main industrial application fields of this project are the implementation of secured PMR (Professional Mobile Radio) communications and high bandwidth VPNs (Virtual Private Networks).

The PACCMAN project is supported by the OPPIDUM research network and involves several industrial partners. These partners are: EADS-DSN (the world leader in digital PMR), MATRAnet (software for network security), Bull TrustWay (software and hardware components for network security), and Inria.

The contribution of CAPS project is described in .

Research grant from Intel

The researches on instruction fetch front ends (cf. ) and register file structures (cf. ) are supported by the Intel company through a research grant (Convention 4 01 C 0677 00 31308 06 1) and a material donation.

Code analysis and performance evaluation for embedded systems (2000-2003)

The doctoral studies of Laurent Morin is supported by a convention CIFRE with the Thomson MMD society (cf. ).

Flexible infrastructure for code scheduling and optimisation (2000-2003)

The doctoral studies of Laurent Bertaux are supported by a convention CIFRE with the STMicroelectronics society (cf. ).

Thread extraction for heterogeneous embedded systems (2002-2005)

The doctoral studies of Pascal Terjan are supported by a convention CIFRE with the STMicroelectronics society (cf. ).

Compilation and power consumption (2000-2003)

The doctoral studies of Gilles Pokam are supported by a convention CIFRE with the STMicroelectronics society (cf. ).

Start-up company project

In 2003 a large part of the research team was involved in the set up of the start-up company CAPS entreprise (http://www.caps-entreprise.com/). CAPS entreprise aims at bringing innovative software tools, solutions and services to the market of high performance embedded systems. The company aims at becoming a reliable partner for system builders, platform designers and developers seeking the best system performance, by helping them match their software to the specificities of the underlying hardware platform.

CAPS entreprise offers standalone tools that are specialized for a given task (code transformation, simulation, worst-case execution time analysis...). These tools can act as building blocks in a software tool chain and are designed for seamless integration into common development environments.

The company proposes global compilation solutions, tailored to the customer's needs. After a detailed exploration of the user requirements against the existing process, specific additions and enhancements to the previous code generation infrastructure are proposed and implemented.

CAPS entreprise finally offers custom consulting services, such as performance analyses or instruction-set evaluations. Through these services, customers benefit from the company's in-house expertise and tools, helping them make strategic decisions on complex technology issues.

The company was awarded in 2003 by the french ministry of Industry as an innovative company. CAPS entreprise is incubated by Emergys and Inria-Transfert.

Industrial tranfer of ABSCISS (cf. ) to CAPS entreprise is supported by Inria through the postdoc fellowship of Ronan Amicel.

ACI Sécurité UNIHAVEGE (2003-2006)

Researches on unpredictable random number generation are funded through the ACI sécurité project UNIHAVEGE. Main partners are CAPS team and CODES team from Inria Rocquencourt.

RTP CNRS Architecture et Compilation

CAPS members participate to RTP CNRS Architecture et Compilation:

André Seznec is member of the steering committee.

Pierre Michaud participates to specific action ``Nouvelles technologies et nouveaux paradigmes d'architecture''.

François Bodin participates to specific action `` Compilation pour les systèmes embarqués''.

Scientific community animation

A. Seznec has been a member of the program committee of HiPC'2003 (High Performance Computing). He is a member of the program committees of 31th International Symposium on Computer Architecture (ISCA'31) and of CARI'2004. He is the program co-chair of ICS 2004 (International Conference on Supercomputing).

F. Bodin is a member of the editorial board of tsi and has been a member of the program committee of MEMOCODE'2003.

University teaching

F. Bodin and A. Seznec are teaching computer architecture and compilation at dea of computer science, at diic and dess ISA at IFSIC, University of Rennes I.

Workshop, seminar, invitations

A. Seznec has presented a seminar entitled `` HAVEGE: Hardware Volatile Entropy Gathering and Expansion, generating unpredictable random number at user-level'' at UPC, Barcelona in April 2003.

A. Seznec has presented a seminar entitled `` Ahead Pipelining the Instruction Address Generator'' at UPC, Barcelona in April 2003 and at Intel, Hillsboro, Oregon in June 2003.

Yanos Sazeides, University of Cyprus, has been visiting the project for a month in May 2003.

Misceleanous

A. Seznec is an elected member of the evaluation committee of Inria.

F. Bodin is an elected member of the IFSIC Committee.

F. Bodin is responsible of the DEA of computer science at University of Rennes I and chairman for doctoral studies at Irisa.

F. Bodin is vice-chairman of Ecole doctorale Matisse (http://www.irisa.fr/matisse).

F. Bodin is with V. Verdon and the help of Y. Sost at the origin of the creation of the fundation M. Métivier (http://www.fondation-metivier.org).

Jacques Lenfant is the chairman of the professor hiring committee of University of Rennes I.

Jacques Lenfant is a member of ``académie des sciences et des technologies''.

A New System for High-Performance Cycle-Accurate Compiled Simulation 5th International Workshop on Software and Compilers for Embedded Systems may 2001 Étude, réalisation et application d'une plate-forme de collecte de traces d'exécution de programmes PhD. Thesis université de Rennes I December 2000 Data-flow prescheduling for large instruction windows in out-of-or der processors 7th International Conference on High Performance Computer Architecture January 2001 Utilisation du raisonnement à partir de cas et de l'apprentissage pour l'optimisation de code Ph. D. Thesis Dec 2002 Handling Global Constraints in Compiler Strategy International Journal of Parallel Programming August 2000 Infrastructures et stratégies de compilation pour parallélisme à grain fin PhD University of Rennes I November 1998 Design trade-offs on the EV8 branch predictor Proceedings of the 29th International Symposium on Computer Architecture (IEEE-ACM) Anchorage May 2002 Register Write Specialization Register Read Specialization: A Path to Complexity Effective of Wide Issue Superscalar Processors Proceedings of the 35th International Symposium on Microarchitecture (IEEE-ACM) Istanbul November 2002 Skewed associativity improves performance and enhances predictability IEEE Transactions on Computers May 1997 Trading conflict and capacity aliasing in conditional branch predictors Proceedings of the 24th International Symposium on Computer Architecture Denver June 1997 Multiple-block ahead branch predictors Proceedings of the 7th conference on Architectural Support for Programming Languages and Operating Systems October 1996 Simulation de jeux d'instructions à hautes performances Ph. D. Thesis University of Rennes 1 January 2003 SWARP: A Retargetable Preprocessor for Multimedia Instructions Concurrency and Computation: Practice and Experience 2-3 16 February - March 2004 303 - 318 HAVEGE: a user-level software heuristic for generating empirically strong random numbers ACM Transactions on Modeling and Computer Systems October 2003 Concurrent support of multiple page sizes on a skewed associative TLB IEEE Transactions on Computers to appear Characterization of embedded applications for decoupled processor architecture Proceedings of the IEEE 6th Annual Workshop on Workload Characterization Austin, TX oct. 2003 UFC : a Global Tradeoff Strategy for Loop Unrolling for VLIW Architectures CPC'2003 59-70 2003 Schéma de compression reconfigurable avec un émulateur logiciel pour la recherche de compromis entre la taille du code et sa performance RenPar'15/CFSE'3/SympAAA'2003 417-424 La Colle sur Loup, France 2003 A statistical model of skewed-associativity Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software 204-213 March 2003 Exploiting the cache capacity of a single-chip multi-core processor with execution migration Tenth International Symposium on High-Performance Computer Architecture Feb 2004 to appear Exploring the energy-efficiency potential of a phase-based cache resizing scheme for embedded systems Proceedings of the 8th Annual Worskhop on Interaction between Compilers and Computer Architectures (INTERACT-8) February 2004 Effective ahead pipelining of instruction block address generation Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA-03) 241–252 June 9–11 2003 Out-of-order execution of predicated instruction sets through translation register buffers Technical report 1573 IRISA November 2003 http://www.inria.fr/rrrt/rr-5011.html A compression scheme for code size versus performance trade-off Research Report 1574 IRISA 2003 Energy reduction potential of a phase-based cache resizing scheme for embedded systems Technical report 1582 IRISA December 2003 http://www.inria.fr/rrrt/rr-5036.html Energy-Delay Tradeoff Analysis of ILP-based Compilation Techniques on a VLIW Architecture Technical report 1572 IRISA November 2003 http://www.inria.fr/rrrt/rr-5026.html Redundant History Skewed Perceptron Predictors: pushing limits on global history branch predictors Research report 1554 IRISA September 2003 Selective Cache Ways: On-demand Cache Resource Allocation Proceedings of the 32nd International Symposium on Microarchitecture November 1999 Boolean Expression Diagrams LICS: IEEE Symposium on Logic in Computer Science 1997 Survey of Code-size reduction methods ACM Computing Survey 3 35 September 2003 223-267 Target Prediction for Indirect Jumps Proceedings of the

2 4^{t} h

Annual International Symposium on Computer Architecture 1997 Simultaneous Subordinate Microthreading (SSMT) Proceedings of the 26th Annual International Symposium on Computer Architecture May 1999 Feedback-Directed Selection and Characterization of Compiler Optimizations Proc.2nd workshop on Feedback-Directed Optimization November 1999 Profile-Guided Code Compression ACM PLDI'02 5 37 2002 The Stanford Hydra CMP IEEE Micro 2 20 March 2000 An Eight-Way VLIW Embedded Multimedia Processor with Advanced Cache Mechanism Proceedings of the Third IEEE Asia-Pacific Conference on ASICs (AP-ASIC2002) August 2002 The Future of Systems Research IEEE Computer August 1999 27-33 IPERF: A Framework for Automatic Construction of Performance Prediction Models In Workshop on Profile and Feedback-Directed Compilation (PFDC) October 1998 http://citeseer.nj.nec.com/hsu98iperf.html Design challenges for new application specific processors IEEE Design and Test of Computers 2 17 2000 40–50 Neural Methods for Dynamic Branch Prediction ACM Transactions on Computer Systems November 2002 The Alpha 21264 microprocessor IEEE Micro 2 19 1999 Iterative compilation in program optimization Compilers for Parallel Computers 2000 35–44 2000 Finding Effective Optimization Phase Sequences LCTES'03 12-23 2003 Dead-Block Prediction & Dead-Block Correlating Prefetchers Proceedings of the 28th Annual International Symposium on Computer Architecture Computer Architecture News June 2001 Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors Proceedings of the 28th annual international symposium on Computer architecture june 2001 A Low Power Unified Cache Architecture Providing Power and Performance Flexibility International Symposium on Low Power Electronics and Design 2000 Hyper-threading technology : architecture and microarchitecture Intel Technology Journal 1 6 February 2002 Tools for application-oriented performance tuning Proceedings of the 15th international conference on Supercomputing ACM Press 154–165 2001 http://doi.acm.org/10.1145/377792.377826 Worst-Case Execution Time Analysis for Modern Hardware Architectures ACM SIGPLAN 1997 Workshop on Languages, Compilers, and Tools for Real-Time Systems (LCT-RTS'97) 1997 Complexity-Effective Superscalar Processors Proceedings of the

2 4^{t} h

Annual International Symposium on Computer Architecture 1997 Transient fault detection via simultaneous multithreading Proceedings of the International Symposium on Computer Architecture 2000 Difficulties in Computing the WCET for Processors with Speculative Execution 2nd Intl. Workshop on Worst Case Execution Time Analysis June 2002 Design trade-offs on the EV8 branch predictor Proceedings of the 29th International Symposium on Computer Architecture (IEEE-ACM) Anchorage May 2002 Time Varying Behavior of Programs Technical report University of California, San Diego August 1999 Accurate and efficient predicate analysis with binary decision diagrams Proceedings of the 33rd annual IEEE/ACM international symposium on Microarchitecture December 2000 ACM Press 2000 POWER4 system microarchitecture IBM Journal of Research and Development 1 46 January 2002 Key Instructions: Solving the Code Location Problem for Optimized Code 2000 http://citeseer.nj.nec.com/tice00key.html Compilation techniques for low energy: An overview Proceedings of the IEEE Symposium on Low Power Electronics October 1994 CryptoManiac: A Fast Flexible Architecture for Secure Communication Proceedings of the 28th International Symposium on Computer Architecture Göteborg June 2001 An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High Performance I-caches Proceedings of the International Symposium on High Performance Computer Architecture January 2001 A Highly Configurable Cache Architecture for Embedded Systems Proceedings of the 30th International Symposium on Computer Architecture June 2003 A Highly Configurable Cache Architecture for Embedded Systems 30th International Symposium on Computer Architecture June 2003 VISTA: a system for interactive code improvement Proceedings of the joint conference on Languages, compilers and tools for embedded systems ACM Press 155–164 2002 http://doi.acm.org/10.1145/513829.513857 Code Size Efficiency in Global Scheduling for VLIW/EPIC Style Embedded Processors The 6th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT-6) held in conjunction with HPCA-8 Feburary 2002 The use of multithreading for exception handling Proceedings of the International Symposium on Microarchitecture 1999 Simultaneous multithreading : maximising on-chip parallelism 22nd Annual International Symposium on Computer Architecture 392-403 June 1995