Using Machine Learning to Focus Iterative Optimization

Alchemy Architectures, Languages and Compilers to Harness the End of Moore Years

Algorithmics, Programming, Software and Architecture

Architecture and Compiling

Olivier Temam INRIA Chercheur

Saclay

Research Director (DR) Inria, Team Leader oui Valérie Berthou INRIA Assistant

Saclay

TR Inria Denis Barthou INRIA Enseignant

Saclay

Assistant professor, University of Versailles-Saint-Quentin, delegation INRIA until september 2009 Sid-Ahmed-Ali Touati UnivFr Enseignant

Saclay

Assistant professor, University of Versailles-Saint-Quentin, delegation INRIA, since September, 2009 Hugues Berry INRIA Chercheur

Saclay

Research Associate (CR) Inria Albert Cohen INRIA Chercheur

Saclay

Research Director (DR) Inria oui Christine Eisenbeis INRIA Chercheur

Saclay

Research Director (DR) Inria Grigori Fursin INRIA Chercheur

Saclay

Research Associate (CR) Inria Cédric Bastoul UnivFr Enseignant

Saclay

Assistant Professor Frédéric Gruau UnivFr Enseignant

Saclay

Assistant Professor Frédéric Brault INRIA PhD

Saclay

Bourse Digiteo, until February, 2009 Anna Beletksa INRIA PostDoc

Saclay

Inria postdoc, since March, 2009 Philippe Dumont INRIA PostDoc

Saclay

Inria postdoc, FP6 IST grant Armin Größlinger INRIA PostDoc

Saclay

Inria postdoc, FP6 IST grant, from April, 2009, until December, 2009 Zbigniew Chamski INRIA Technique

Saclay

Inria expert engineer, FP6 IST grant, from June, 2009, until September, 2009 Joern Renecke INRIA Technique

Saclay

Inria expert engineer, FP7 IST grant Walid Benabderrhamane INRIA PhD

Saclay

MENRT scholarhsip, University of Paris-Sud Mouad Bahi INRIA PhD

Saclay

Inria scholarhsip, University of Paris-Sud Cupertino Miranda INRIA PhD

Saclay

Portugese grant, University of Paris-Sud Konrad Trifunovic INRIA PhD

Saclay

Inria scholarship, University of Paris-Sud Boubacar Diouf INRIA PhD

Saclay

MENRT scholarship, University of Paris-Sud Mounira Bachir INRIA PhD

Saclay

ATER, University of Versailles-Saint-Quentin, since september, 2009 Olivier Certner UnivFr PhD

Saclay

STMicroelectronics fellowship (CIFRE), University of Paris-Sud Mohammed Fellahi INRIA PhD

Saclay

ATER (half), University of Paris-Sud Riyad Baghdadi INRIA AutreCategorie

Saclay

Internship, since October, 2009 Soufiane Baghdadi INRIA AutreCategorie

Saclay

Internship, since October, 2009 Abramo Bagnara INRIA AutreCategorie

Saclay

Projet ACOTES, from January, 2009 until May, 2009 François Galea INRIA AutreCategorie

Saclay

from September, 2009 until October, 2009 Anne-Sophie Coquel INRIA PhD

Saclay

Large Scale Initiative ColAge Yuriy Kashnikoff INRIA AutreCategorie

Saclay

Nicolas Nash UnivEtrangere AutreCategorie

Saclay

Summer intern, PhD student from Trinity College Dublin, April-June, 2009 Ozcan Ozturk UnivEtrangere Visiteur

Saclay

Visiting Professor from Bilkent University, Ankara. July-August, 2009 Joern Rennecke UnivEtrangere Visiteur

Saclay

Expert engineer, November, 2009 until January 2010 Nicolas Zermati UnivFr AutreCategorie

Saclay

Master 1 intern, from March, 2009 until July, 2009 Shaoshan Liu UnivEtrangere PhD

Saclay

Internship, from Feb 1st, 2009 until April, 30th, 2009, collaboration with UCI, Los Angeles Abdelfetteh Louati INRIA AutreCategorie

Saclay

ADT expert engineer Fei Jiang INRIA PhD

Saclay

Inria scholarship, with the TaoInria team, University of Paris-Sud, until December, 2009 Taj Muhammad Khan INRIA PhD

Saclay

Inria scholarship, University of Paris-Sud Zheng Li INRIA PhD

Saclay

Inria scholarship, University of Paris-Sud Luidnel Maignan UnivFr PhD

Saclay

MENRT scholarship, University of Paris-Sud Louis-NoÃ«l Pouchet UnivFr PhD

Saclay

MENRT scholarship, University of Paris-Sud Adrien Eliche UnivFr PhD

Saclay

MENRT scholarship, University of Versailles-Saint-Quentin Sean Halle UnivEtrangere PhD

Saclay

Inria expert engineer, U. of California Santa Cruz, from May, 2009, until November, 2009 Pierre Amiranoff UnivFr CollaborateurExterieur

Saclay

PRAG, IUT d'Orsay Marouane Belaoucha UnivFr PhD

Saclay

MESR scolarship, co-supervised with S. Touati, University of Versailles-Saint-Quentin Nathalie Drach UnivFr CollaborateurExterieur

Saclay

Professor, Paris-6 University Andres Charif-Rubial UnivFr PhD

Saclay

ANR ProHMPT, co-supervised with W. Jalby (PRiSM, UVSQ), University of Versailles-Saint-Quentin Julien Jaeger UnivFr PhD

Saclay

ANR Para, University of Versailles-Saint-Quentin Pablo Oliveira UnivFr PhD

Saclay

CEA scolarship, co-supervised with S. Louise (CEA LIST), University of Versailles-Saint-Quentin Overall Objectives Overall Objectives

Alchemyis a joint Inria/University of Paris Sud research group.

The general research topics of the Alchemygroup are architectures, languages and compilers for high-performance embedded and general-purpose processors. Alchemyinvestigates scalablearchitecture and compiler/programming solutions for high-performance general-purpose and embedded processors. Alchemystands for Architectures, Languages and Compilers to Harness the End of Moore Years, referring to both the traditional processor architectures implemented using the current photo-lithographic processes, and novel architecture/language paradigms compatible with future and alternative technologies. The current emphasis of Alchemyis on the former part, and we are progressively increasing our efforts on the latter part.

The research goals of Alchemyspan from short term to long term. The short-term goals target existing complex processor architectures, and thus focus on improving program performance on these architectures (software-only techniques). The medium-term goals target the upcoming CMPs (Chip Multi-Processors) with a large number of cores, which will result from the now slower progression of core clock frequency due to technological limitations. The main challenge is to take advantage of the large number of cores for a wide range of applications, considering that automatic parallelization techniques have not yet proved an adequate solution. In Alchemy, we explore joint architecture/programming paradigms as a pragmatic alternative solution. Finally, even longer term research is conducted with the goal of harnessing the properties of future and alternative technologies for processing purposes.

Most of the research in Alchemyattempts to jointly consider the hardware and software aspects, based on the premise that many of the limitations of existing architecture and compiler techniques stem from the lack of cooperation between architects and compiler designers. However, Alchemyaddresses the aforementioned research goals through two different, though sometimes complementary, approaches. One approach considers that, in spite of their complexity, architectures and programs can still be accurately and efficiently modeled (and optimized) using analyticalmethods. The second approach considers the architecture/program pair already has or will reach a complexity level that will evade analytical methods, and explores a complex systemsapproach; the principle is to accept that the architecture/program pair is more easily understood (and thus optimized) based on its observed behavior rather than inferred from its known design.

Scientific Foundations Scientific Foundations

In the sections below, the different research activities of Alchemyare described, from short-term to long-term goals. For most of the goals, both analytical and complex systems approaches are conducted.

A practical approach to program optimizations for complex architectures

This part of our research work is more targeted to single-core architectures but also applies to multi-cores. The rationale for this research activity is that compilers rely on architecture models embedded in heuristics to drive compiler optimizations and strategy. As architecture complexity increases, such models tend to be too simplistic, often resulting in inefficient steering of compiler optimizations.

Iterative optimization

Our general approach consists in acknowledging that architectures are too complex to embed reliable architecture models in compilers, and to explore the behavior of the architecture/program pair through repeated executions. Then, using machine-learning techniques, a model of this behavior is inferred from the observations. This approach is usually called iterative optimization.

In the recent years, iterative optimization has emerged as a major research trend, both in traditional compilation contexts and in application-specific library generators (like ATLAS or SPIRAL). The topic has matured significantly since the pioneering works of Mike O'Boyle at University of Edinburgh, UK or Keith Cooper at Rice University. While these research works successfully demonstrated the performance potentialof the approach, they also highlighted that iterative optimization cannot become a practicaltechnique unless a number of issues are resolved. Some of the key issues are: the size and structure of the search space, the sensitivity to data sets, and the necessity to build long transformation sequences.

Scanning a large search space.Transformation parameters, the order in which transformations are applied, and even which transformations must be applied and how many times, all form a huge transformation space. One of the main challenges of iterative optimization is to rapidly converge towards an efficient, if not optimal, point of the transformation space. Machine-Learning techniques can help build an empirical model of the transformation space in a simple and systematic way, only based on the observation of transformations behavior, and then rapidly deduce the most profitable points of the space. We are investigating how to correlate static and dynamic program features with transformation efficiency. This approach can speed up the convergence of the search process by one or two orders of magnitude compared to random search , .

We have also shown that by representing the impact of loop transformations using structured encoding derived from polyhedral program representation, it is possible to reduce the complexity of the search by several orders of magnitude , . This encoding is further described in Section .

Finally we have found that it is possible to further speed up transformation space exploration by exploring several transformations during a single run . Currently, one program transformation is explored for each loop nest, while performance often reaches a stable state soon after the start of the execution. We have shown that, assuming we properly identify the phase behavior of programs, it is possible to explore multiple transformations in each run.

Data set sensitivity.Iterative optimization is based on the notion that the compiler will discover the best way to optimize a program through repeatedly running the same program on the same data set, trying one or a few different optimizations upon each run. However, in reality, a user rarely needs to execute the same data set twice. Therefore, iterative optimization is based on the implicit assumption that the best optimization configuration found will work wellfor all data setsof a program. To the best of our knowledge, this assumption has never been thoroughly investigated. Most studies on iterative optimization repeatedly execute the same program/data set pair , , , , , only recently, some studies have focused on the impact of data sets on iterative optimizations , .

In order to explore the issue of data set sensitivity, we have assembled a data set suite, of 20 data sets per benchmark, for most of the MiBench embedded benchmarks. We have found that, though a majority of programs exhibit stable performance across data sets, the variability can significantly increase with many optimizations. However, for the best optimization configurations, we find that this variability is in fact small. Furthermore, we show that it is possible to find a compromise configuration across data sets which is often within 5% of the best possible optimization configuration for most data sets, and that the iterative process can converge in less than 20 iterations (for a population of 200 optimization configurations). Overall, the preliminary conclusion, at least for the MiBench benchmarks, is that iterative optimization is a fairly robust technique across data sets, which brings it one step closer to practical usage.

Compositions of program transformations.Compilers impose a certain set of program transformations, an ordering of application and how many times each transformation is applied. In order to explore what are the possible gains beyond these strict constraints, we have manually optimized kernels and benchmarks, trying to achieve the best possible performance assuming no constraint on transformation order, count or selection , . The study helped us clarify which transformations bring the best performance improvements in general. But the main conclusion of that study is that surprisingly long compositions of transformations are sometimes needed (in one case, up to 26 composed loop transformations) in order to achieve good performance. Either because multiple issues must be tackled simultaneously or because some transformations act as enabling operations for other transformations.

As a result, we have started developing a framework facilitating the composition of long transformations. This framework is based on the polyhedral representation of program transformations . This framework also enables a more analytical approach to program optimization and parallelization, beyond the simple composition of transformations. This latter part is further developed in Section .

Putting it all together: continuous optimization.Increasingly, we are now moving toward automatizing the whole iterative optimization process. Our goal is to bring together, within a single software environment, the different aforementioned observations and techniques (search space techniques, data set sensitivity properties, long compositions of transformations,...). We are currently in the process of plugging these different techniques within GCC in order to create a tool capable of doing continuous, whole-program optimization, and even collaborative optimization across different users.

Hardware-Oriented applications of iterative optimization.Because iterative optimization can successfully capture complex dynamic/run-time phenomena, we have shown that the approach can act as a replacement for costly hardware structures designed to improve the run-time behavior of programs, such as out-of-order execution in superscalar processors. An iterative optimization-like strategy applied to an embedded VLIW processor was shown to achieve almost the same performance as if the processor was fitted with dynamic instruction reordering support. We are also investigating applications of this approach to the specialization/idiomization of general-purpose and embedded processors . Currently, we are exploring similar approaches for providing thread scheduling and placement information for CMPs without requiring costly run-time environment overhead or hardware support. This latter study is related to the work presented in Section .

Polyhedral program representation: facilitating the analysis and transformation of programs

As loop transformations are utterly important — performance-wise — and among the hardest to predictably drive through static cost models, their current support in compilers is disappointing. After decades of experience and theoretical advances, the best compilers can miss some of the most important loop transformations in simple numerical codes from linear algebra or signal processing codes. Performance hits of more than an order of magnitude are not uncommon on single-threaded code, and the situation worsens when automatically parallelizing or optimizing parallel code.

Our previous work on sequences of loop transformations has led to the design of a theoretical framework, based on the polyhedral model , , , , , , and a set of tools based on the advanced Open64 compiler. We have shown that this framework does simplify the problem of building complex transformation sequences, but also that it scales to real-world benchmarks , , , , and allows to significantly reduce the size of the search space and better understand its structure , , . The latter work, for example, is the first attempt at directly characterizing all legal and distinctways to reschedule a loop nest.

After two decades of academic research, the polyhedral model is finally evolving into a mature, production-ready approach to solve the challenges of maximizing the scalability and efficiency of statically-controlled, loop-based computations on a variety of high performance and embedded targets. After Open64, we are now porting these techniques to the GCC compiler , applying them to several multi-level parallelization and optimization problems, including vectorization, extraction and exploitation of thread-level parallelism on distributed memory CMPs like the Cell broadband engine from IBM, NXP's CAT-DI scalable signal-processing accelerator and novel STMicroelectronics emerging xStream architecture.

Project-team positioning

Note:The goal of this section and others alike is to not to act as a traditional and exhaustive “related work” section as found in research articles, but rather to provide references to a few research works which are the closest to our own.

While iterative optimization is based on simple principles which have been proposed a long time ago, this approach has been significantly developed by Mike O'Boyle at University of Edinburgh since 1997 , and more recently by Keith Cooper at Rice University . Since then, many research groups have shown example cases where an iterative approach might be profitable (various application targets, various steps of the compilation process, various architecture components) , , , . These researchers have shown that iterative optimization has a significant potential. Since then, other research groups (Polaris group at University of Illinois, CAPS at INRIA) have successfully demonstrated that iterative optimization can be used in practice for the design of libraries , , or even that it can be integrated in production compilers to assist existing optimizations . As mentioned before, Alchemyis now focusing on the issues which hinder its practical application.

Joint architecture/programming approaches

While Section is only concerned with transforming programs for a more efficient exploitation of existing architectures, in the longer term, researchers can assume modifications of architectures and/or programs are possible. These relaxed constraints allow to target the root causes of poor architecture/program performance.

The current architecture/program model partly fails because the burden is either excessively on the architecture (superscalar processors), or the compiler (VLIW and now CMPs). And both compiler and architecture optimizations often aim at program reverse-engineering: compilers attempt to dig up program properties (locality, parallelism) from the static program, while architectures attempt to retrieve them from program run-time behavior. Now, in many cases, the user is not only aware of these properties but may pass them effortlessly to the architecture and the compiler provided she had the appropriate programming support, provided the compiler would pass this information to the architecture, and the architecture would be fitted with the appropriate support to take advantage of them. For instance, simply knowing that a C structure denotes a tree rather than a graph can provide significant information for parallel execution. Such approaches, while not fully automatic, are practical and would relieve the complexity burden of the architecture and the compiler, while extracting significant amounts of task-level parallelism.

In the paragraphs below we apply this approach of passing more program semantic to the compiler and the architecture, first for domain-specific stream-oriented programs, and then for the parallelization of more general programs.

A targeted domain: Passing program semantics using a synchronous language for high-performance video processing

While we are currently investigating the aforementioned approach for general-purpose applications, we have started with the investigation of the specific domain of high-end video processing. In this domain, assessing that real-time properties will be satisfied is as important as reaching uncommon levels of compute density on a chip. 150 giga-operations per second per Watt (on pixel components) is the norm for current high-definition TVs, and cannot be achieved with programmable cores at present. The future standards will need an 8-fold increase (e.g., for 3D displays or super-high-definition). Predictability and efficiency are the keywords in this domain, in terms of both architecture and compiler behavior.

Our approach combines the aforementioned iterative optimization and polyhedral modeling research with a predictability- and efficiency-oriented parallel programming language. We focus on warrantable (as opposed to best-effort) usage of hardware resources with respect to real-time constraints. Therefore, this parallel programming language must allow overhead-free generation of tightly coupled parallel threads, interacting through dedicated registers rather than caches, streaming data through high-bandwidth, statically managed interconnect structures, with frequent synchronizations (once every few cycles), and very limited memory resources immediately available. This language also needs to support advanced loop transformations, and its representation of concurrency compatible with the expression of multi-level partitioning and mapping decisions. All these conditions tend to consider a language closer to hardware synthesis languages than general-purpose, von Neumann oriented imperative ones , .

The synchronous data-flow paradigm is a natural candidate, because of its ability to combine high-productivity in programming complex concurrent applications (due to the determinism and compositionality of the underlying model, a rare feature of a concurrent semantics), direct modeling of computation/communication time, and static checking of non-functional properties (time and resource constraints). Yet generating low-level, tightly fused loops with maximal exposition of fine-grain parallelism from such languages is a difficult problem, as soon as the target processor is not the one being described by the synchronous data-flow program, but a pre-existing target on which we are folding an application program. The two tasks are totally different: although the most difficult decisions are pushed back to the programmer in the hardware synthesis case, application programmers usually rely on the compiler to abstract away the folding of their code in a reasonably portable fashion across a variety of targets. This aspect of synchronous language compilation has largely been overlooked and constitutes the main direction of our work. Another direction lies in the description of hardware resources, at the same level as the application being mapped and scheduled onto them; this unified representation would allow the expression of the search space of program transformations, and would be a necessary step to apply incremental refinement methods (expert-driven, very popular in this domain).

Technically, we extend the classical clock calculus (a type system) of the Lucid Synchronelanguage, expliciting significantly more information about the program behavior, especially when tasks must be started and will be completed, how information flow among tasks, etc. Our main contribution is the integration of relaxed synchronous operators like jittering and bursty streams within synchronous bounds , . This research consists in revisiting the semantics of synchronous Kahn networks in the domain of media streaming applications and reconfigurable parallel architectures, in collaboration with Marc Duranton from Philips Research Eindhoven (now NXP Semiconductors) and with Marc Pouzet from LRI and the Proval INRIA project team.

A more general approach: Passing program semantic using software components

Beyond domain-specific and regular applications (loops and arrays), automatic compiler-based parallelization has achieved only mixed results on programs with complex control and data structures . Writing, and especially debugging, large parallel programs is a notoriously difficult task , and one may wonder whether the vast majority of programmers will be able to cope with it. Currently, transactional memory is a popular approach for reducing the programmer burden using intuitive transaction declarations instead of more complex concurrency control constructs. However, it does not depart from the classic approach of parallelizing standard C/C++/Fortran programs, where parallelism can be difficult to extract or manipulate. Parallel languages, such as HPF , require more ambitious evolutions of programming habits, but they also let programmers pass more semantic about the control and data characteristics of programs to the compiler for easier and more efficient parallelization. However, one can only observe that, for the moment, few such languages have become popular in practice.

A solution would have a better chance to be adopted by the community of programmers at large if it integrates well with popular practices in software engineering, and this aspect of the parallelization problem may have been overlooked. Interestingly, software engineering has recently evolved towards programming models that can blend well with multi-core architectures and parallelization. Programming has consistently evolved towards more encapsulation: procedures, then objects, then components . Essentially for two reasons, because programmers have difficulties grasping large programs and need to think locally, and because encapsulation enables reuseof programming efforts. Component-based programming, as proposed in Java Beans, .Net or more ad-hoc component frameworks, is the step beyond C++ or Java objects: programs are decomposed into modules which fully encapsulate code and data (no global variable) and which communicate among themselves through explicit interfaces/links.

Components have many assets for the task of developing parallel programs. (1) Components provide a pragmatic approach for bringing parallelization to the community at large thanks to component reuse. (2) Components provide an implicit and intuitive programming model: the programmer views the program as a "virtual space" (rather than a sequence of tasks) where components reside; two components residing together in the space and not linked or not communicating through an existing link implicitly operate in parallel; this virtual space can be mapped to the physical space of a multi-threaded/multi-core architecture. (3) Provided the architecture is somehow aware of the program decomposition into components, and can manipulate individual components, the compiler (and the user) would be also relieved of the issue of mapping programs to architectures.

In order to use software components for large-scale and fine-grain parallelization, the key notion is to augment them with the ability to split or replicate. For instance, a component walking a binary tree could spawn two components to scan two child nodes and the corresponding sub-trees in parallel.

We are investigating a low-overhead component-based approach for fine-grain parallelism, called CAPSULE, where components have the ability to replicate , . We investigate both a hardware-supported and software-only approach to component division. We show that a low-overhead component framework, possibly paired with component hardware support, can provide both an intuitive programming model for writing fine-grain parallel programs with complex control flow and data structures, and an efficient platform for parallel components execution.

Personnel Project-team positioning

As explained before, both approaches pursued rely on the same philosophy, pass more program semantic to the compiler and the architecture, though the techniques differ significantly. Naturally, there is a huge body of literature on parallelization, and here, we can only hint at some of the main research directions. Current approaches either rely on automatic parallelization of standard programs, but the automatic parallelization of “complex” applications (complex control flow and data structures) has registered mixed results. Another approach is software/hardware thread-level speculation, but one may question its cost and scalability . As mentioned before, transactional memory has become a popular approach for reducing the burden of parallelizing applications. Other approaches include parallel languages, such as HPF or parallel directives such as OpenMP .

Synchronous data-flow languages.The synchronous data-flow approach to the design and optimization of massively parallel, highly compute-efficient and predictable systems is quite unique. It is a long-term, largely fundamental effort motivated by well-established practices in the industry, mostly in the domain of high-definition language programming for hardware synthesis, and combines these practices with the best semantic properties of high-level programming languages. It is a holistic approach to combining productivity andscalability andcompute-efficiency in a unified design, targeting the domain of real-time, predictable, stream-oriented parallel systems.

The closest work is the StreamIt language and compiler from MIT , and to a lesser extent, the Sequoia project from Stanford ; these two mature projects achieved important contributions in the exposition and exploitation of thread-level parallelism on a coarse grain distributed-memory, stream-oriented architecture. StreamIt is also much more limited in expressiveness, and Sequoia is more an incremental progress on how to compile and optimize a parallel program than a productivity-oriented design of a new concurrent programming paradigm. We are currently working on a shorter term, intermediate milestone much closer to these two projects, but allowing to expose and exploit multi-level parallelism, at all stages of the design-space exploration and in all passes of the compiler.

Software components.Software components, as provided in the .Net or Java Beans frameworks, have little support for parallelism. Several years ago, a few frameworks proposed a component-like approach for parallelizing complex applications on large-scale multiprocessors, especially the Cilk and Charm++ frameworks. However Cilk does not promote encapsulation, essentially a mechanism for spawning C functions. Charm++ provides both encapsulation and spawning, but it targets large-scale multiprocessors, even grid computing , and its overhead is rather large for fine-grain parallelism as required by multi-threaded/multi-core architectures.

Probably the closest work to our hardware support for components is the Network-Driven Processor proposed by Chen et al. which aims at implementing CMP hardware support for Cilk programs. Thread creation decisions are not taken directly by the architecture, they enact any thread spawning decision taken by the Cilk environment, but they provide a sophisticated support for communications and work stealing between processors.

Alternative computing models/Spatial computing

The last research direction stems from possible evolutions of technology. While this research direction may seem very long term, processor manufacturers cannot always afford to investigate many risky alternatives way ahead in time. At the same time, for them to accept and adopt radical changes, they have to be anticipated long in advance. Thus, we believe prospective research is a core role for academic researchers, which may be less immediately useful to companies, but which can bring a real addition to their internal research activities, and which also carries the potential of bringing disruptive technology.

Prospective information on the future of CMOS technology suggests that, though the density of transistors will keep increasing, the commuting speed of transistors will not increase as fast, and transistors may be more faulty (either fabrication defects or execution faults). Possible replacement/alternative technologies, such as nanotubes which have received a lot of attention lately, share many of these properties: high density, but slow components (possibly even slower than current components), a large rate of defects/faults, and more difficulty to place them except than in fairly regular structures.

In short, several potential upcoming technologies seem to bring a very large number of possibly faulty and not so fast components with layout issues. For architectures to take advantage of such technology, they would have to rely on spacemuch more than time/speedto achieve high performance. Large spatial architectures bring a set of new architecture issues, such as controlling the execution of a program in a totally decentralized way, efficiently managing the placement of program tasks on the space, and managing the relative movement of these different tasks so as to minimize communications. Furthermore, beyond a certain number of processing elements, it is not even clear whether many applications will embed enough traditional task-level parallelism to take advantage of such large spaces, so applications may have to be expressed (programmed) differently in order to leverage that space. These two research issues are addressed in the two research activities described below.

Blob computing.Blob computing is both a spatial programming and architecture model which aims at investigating the utilization of a vast amount of processing elements. The key originality of the model is to acknowledge that the chip space becomes too large for anything else than purely localactions. As a result, all architecture control becomes local. Similarly, the program itself is decomposed into a set of purely local actions/tasks, called Blobs, connected together through links; the program can create/destroy these links during its lifetime.

With respect to architecture control, for instance, the local method for expressing that two tasks frequently communicating through a link must get close together in space so that their communication latency is low is expressed through a simply physical law, emulating spring tension; the more communications, the higher the tension. Similarly, expressing that tasks should move away because too many tasks are grouped in the same physical spot is achieved through a law similar to pressure: as the number of tasks increases, the local pressure on neighbor tasks increases, inducing them to move away. Overall many of these local control rules derive from physical or biological laws which achieve the same goals: controlling a large space through simple local interactions.

With respect to programming, the user essentially has to decompose the program into a set of nodes and links. The program can create a static node/link topology that is later used for computations, or it can dynamically change that topology during execution. But the key concept is that the user is not in charge of placing tasks on the physical space, only to express the potentialparallelism through task division. As can be observed, several of the intuitions of the CAPSULE environment of Section stems from this Blob model.

Bio-Inspired computing.As mentioned above, beyond a certain number of individual components, it is not even clear whether it will be possible to decompose tasks in such a way they can take advantage of a large space. Searching for pieces of solution to this problem has progressively lead us to biological neural networks. Indeed, biological neural networks (as opposed to artificial neural networks, ANNs) are well-known examples of systems capable of complex information processing tasks using a large number of self-organized, but slow and unreliable components. And the complexity of the tasks typically processed by biological neurons is well beyond what is classically implemented with ANNs.

Emulating the workings of biological neural networks may at first seem far-fetched. However, the SIA (Semiconductor Industry Association) in its 2005 roadmap addresses for the first time “biologically inspired architecture implementations” as emerging research architectures, and focuses on biological neural networks as interesting scalable designs for information processing. More importantly, the computer science community is beginning to realize that biologists have made tremendous progress in the understanding of how certain complex information processing tasks are implemented using biological neural networks.

One of the key emerging features of biological neural networks is that they process information by abstractingit, and then only manipulate such higher abstractions. As a result, each new input (e.g., for image processing) can be analyzed using these learned abstractions directly, thus avoiding to rerun a lengthy set of elementary computations. More precisely, Poggio et al. at MIT have shown how combinations of neurons implementing simple operations such as MAX or SUM, can automatically create such abstractions for image processing, and some computer science researchers in the image processing domain have started to take advantage of these findings.

We are starting to investigate the information processing capabilities of this abstraction programming method , , . While image processing is also our first application, we plan to later look at a more diverse set of example applications.

A complex systems approach to computing systems.More generally, the increased complexity of computing systems at stake, whether due to a large number of individual components, a large number of cores or simply complex architecture program/pairs, suggest that novel design and evaluation methodologies should be investigated that rely less on known design information than on observed behavior of the global resulting system. The main problem here is to be able to extract general characteristics of the architecture on the basis of measurements of its global behavior. For that purpose, we are using tools provided by the physics of complex systems (nonlinear time series analysis, phase transitions, multi-fractal analysis...).

We have already applied such tools to better understand the performance behavior of complex but traditional computing systems such as superscalar processors , . And we are starting to apply them to sampling techniques for performance evaluation , . We will be progressively expanding the reach of these techniques in our research studies in the future.

Project-team positioning

While spatial computing is an expression used for many purposes , the Blob computing work in our research group refers more to unconventional spatial programming paradigms such as MGS and Gamma .

There has recently been a surge of research works targeting novel technologies in computer architecture,but they have mostly focused on quantum computing, and, to our knowledge, few have focused on bio-inspired computing.

Furthermore, several researchers in the computer science community have recently started applying ideas from complex systems approaches. But their focus are usually on the software or algorithm part. Our utilization of complex systems approaches in the field of architecture is thus less investigated, although other groups have very recently expressed similar interests , .

Transversal research activities: simulation and compilation

Since our research group has been involved in both compiler and architecture research for several years, we have progressively given increased attention to tools, partly because we found a lot of productivity was lost in inefficient or hard to reuse tools. Since then, both simulation and compilation platforms have morphed into research activities of their own. Our group is now coordinating the development of the simulation platform of the European HiPEAC network, and it is co-coordinating the development of the compiler research platform of HiPEAC together with University of Edinburgh.

Simulation platform

As processor architecture and program complexity increase, so does the development and execution time of simulators. Therefore, we have investigated simulation methodologies capable of increasing our research productivity. The key point is to improve the reuse, sharing, comparison and speed capabilities of simulators. For the first three properties, we are investigating the development of a modularsimulation platform, and for the latter fourth property, we are investigating sampling techniques and more abstract modeling techniques. Our simulation platform is called UNISIM .

What is UNISIM?UNISIM is a structural simulation environment which provides an intuitive mapping from the hardware block diagram to the simulator; each hardware block corresponds to a simulation module. UNISIM is also a library of modules where researchers will be able to download and upload (contribute) modules.

What are the assets of UNISIM over other simulation platforms?UNISIM allows to reuse, exchange and compare simulator parts (and architecture ideas), something that is badly needed in academic research, and between academia and industry. Recently, we did a comparison of 10 different cache mechanisms proposed over the course of 15 years , and suggested the progress of research has been all but regular because of the lack of a common ground for comparison, and because simulation results are easily skewed by small differences in the simulator setup.

Also, other simulation environments or simulators advocate modular simulation for sharing and comparison, such as the SystemC environment , or the M5 simulator . While they do improve the modularity of simulators, in practice, reuse is still quite difficult because most simulation environments overlook the difficulty and importance of reusing control. For instance, SystemC focuses on reusing hardware blocks such as ALUs, caches, and so on. However, while hardware blocks correspond to the greatest share of transistors in the actual design, they often correspond to the least share of simulator lines. For instance, the cache data and instruction banks often correspond to a sizable amount of transistors, but they merely correspond to array declarations in the simulator; conversely, cache control corresponds to few transistors but most of the source lines of any cache simulator function/module. As a result, it is difficult to achieve reuse in practice, because control code is often not implemented in such a way that it can lend well to reuse.

On the contrary, UNISIM is focused on reuse of control code, provides a standardized module communication protocol and a control abstraction for that purpose. Moreover, UNISIM will later on come with an open library in order to better structure the set of available simulators and simulator components.

Taking a realistic approach at simulator usage.Obviously, many research groups will not accept easily to drop years of investment in their simulation platforms and to switch to a new environment. We take a pragmatic approach and UNISIM is designed from the ground up to be interoperable with existing simulators, from industry and academia. We achieve interoperability by wrapping full simulators or simulator parts within UNISIM modules. We have an example full SimpleScalar simulator stripped of its memory, wrapped into a UNISIM module, and plugged into a UNISIM SDRAM module.

Moreover, we are in the process of developing a number of APIs (for power, GUI, functional simulators, sampling,...) which will allow third-party tools to be plugged into the UNISIM engine. We call these APIs simulator capabilities or services.

With CMPs, communications become more important than cores cycle-level behavior.While the current version of UNISIM is focused on cycle-level simulators, we are developing a more abstract view of simulators called Transaction-Level Models (TLM). Later on, we will also allow hybrid simulators, using TLM for prototyping, and then zooming on some components of a complex system.

Because CMPs also require operating system support for a large part, and because existing alternatives such as SIMICS are not open enough, we are also developing full-system support in our new simulators jointly with CEA. Currently, UNISIM has a functional simulator of a PowerPC750 capable of booting Linux.

Compilation platform

The free GNU Compiler Collection(GCC) is the leading tool suite for portable developments on open platforms. It supports more than 6 input languages and 30 target processor architectures and instruction sets, with state-of-the-art support for debugging, profiling and cross-compilation. It has long been supported by the general-purpose and high-performance hardware vendors. The last couple of years have seen GCC taking momentum in the embedded system industry, and also as a platform for advanced research in program analysis, transformation and optimization.

GCC 4.4 features about 200 compilation passes, two thirds of them playing a direct role in program optimization. These passes are selected, scheduled, and parametrized through a versatile pass manager. The main families of passes can be classified as:

inter-procedural analyzes and optimizations;

profiling, coverage analysis and instrumentation;

induction variable analysis, canonicalization and strength-reduction;

loop optimization, automatic vectorization and parallelization;

data layout optimization.

More advanced developments involving GCC are in progress in the Alchemygroup:

global, whole program optimization (towards link-time and just-in-time compilation), with emphasis on scalability;

transactional memory extensions independent from yet compatible with OpenMP, and a recent intrusion into data-flow synchronous programming;

polyhedral loop nest optimization, with support for automatic vectorization in the Graphite branch of GCC; this branch has merged with GCC 4.4; it was initiated by the Alchemygroup and a former student now at AMD (Sebastian Pop);

automatic parallelization, including the extraction and adaptation of loop and pipeline parallelism, with extensions towards speculative forms of parallelism.

The HiPEAC network supports GCC as a platform for research and development in compilation for high-performance and embedded systems. The network's activities on the compiler platform are coordinated by Albert Cohen.

Project-team positioning

Simulation (UNISIM).The rationale for the simulation effort, and the current situation in the community (dominance of monolithic simulators like SimpleScalar ) has been described as part of the presentation of this research activity in Section . While several companies have internal modular simulation environments (ASIM at Intel , TSS at Philips, MaxSim at ARM,...), they are not standard nor disseminated. Only SystemC is gaining wide acceptance as a modular simulation environment with companies, less so with high-performance academic research groups. The academic research group which has the most similar approach is the Liberty group at Princeton University. They have been similarly advocating modular simulation in the past few years . Due to the growing importance of CMP architectures, several research groups have since then proposed CMP simulation platforms, some of them with modularity properties, such as M5 , Flexus , GEMS or Vasa .

Finally, UNISIM is also participating to a French simulation platform called SoCLib through a recent contract (SoCLib). The technical goals of UNISIM are rather different as we initially targeted processor decomposition into modules while SoCLib targeted systems-on-chip. As architectures are moving to multi-cores, the collaboration could become fruitful. UNISIM is also more focused on trying to gather, from the start, groups from different countries in order to increase the chances of adoption.

Compilation (GCC).We are also deeply committed to the enhancement and popularization of GCC as a common compilation research platform. The details of this investment are listed in Section . GCC is of course an interesting option for the industry, as development costs surge and returns in performance gains quickly diminish with the complexity of the modern architectures. But GCC is also, and for the first time, a serious candidate to help researchers mutualize development efforts, to experiment their contributions in a complete tool chain with production codes, to enable the sharing and comparison of these contributions in an open licensing model (a necessary condition for assessing the quality of experimental results), and to facilitate the transfer of these contributions to production environments (with an immediate impact on billions of embedded devices, general-purpose computers and servers). Learning from the failures of a well known attempt at building a common compiler infrastructure (SUIF-NCI in the late 90s), we follow a pragmatic approach based on joint industry-academia research projects ), training (tutorials, courses, see Section ), and direct contributions to the enhancement of the platform (e.g., for iterative optimization research and automatic parallelization).

Software Main software developments Main software developments Veerle Desmet Sylvain Girbal Zheng Li Olivier Temam

Compilers & program optimization:

Polyhedral transformations in Open64

The WRaP-IT tool (WHIRL Represented as Polyhedra – Interface Tool) is a program analysis and transformation tool implemented on top of the Open64 compiler and of the CLooG code generator . The formal basis of this tool is the polyhedral model for reasoning about loop nests. We introduced a specific polyhedral representation that guarantees strong transformation compositionality properties . This new representation is used to generalize classical loop transformations, to lift the constraints of classical compiler frameworks and enable more advanced iterative optimization and machine learning schemes. WRaP-IT — and its loop nest transformation kernel called URUK (Unified Representation Universal Kernel) — is designed to support a wide range of transformations on industrial codes, starting from the SPEC CPU2000 benchmarks, and recently considering a variety of media and signal processing codes (vision, radar, software radio, video encoding, and DNA-mining in particular, as part of the IST STREP ACOTES, ANR CIGC PARA, and a collaboration with Thales).

Based on this framework, we are also planning an extension of the polyhedral model to handle speculative code generation and transformation of programs with data-dependent control, and a direct search and transformation algorithm based on the Farkas lemma. These developments will take place in the GRAPHITE project: a migration/rewrite of our Open64-based software to the GCC suite. This project is motivated by the maturity — performancewise and infrastructurewise — of GCC 4.x, and on the massive industrial investment taking off on GCC in the recent years, especially in the embedded world. We are heavily involved in fostering research projects around GCC as a common compilation platform, and GRAPHITE is one of those projects.

Grigori Fursin developed the first prototype of an iterative optimization API for GCC, and started using this infrastructure for continuous and adaptive optimization research, in collaboration with the University of Edinburgh.

Candl

Cédric Bastoul Louis-Noël Pouchet

Candl is a free software and a library devoted to data dependences computation. It has been developed to be a basic bloc of our optimizing compilation tool chain in the polyhedral model. From a polyhedral representation of a static control part of a program, it is able to compute exactly the set of statement instances in dependence relation. Hence, its output is useful to build program transformations respecting the original program semantics. This tool has been designed to be robust and precise. It implements some usual techniques for data dependence removal, as array privatization or array expansion.

Clan

Cédric Bastoul Louis-Noël Pouchet Walid Benabderrahmane

Clan is a free software and library that translates some particular parts of high level programs written in C, C++, C# or Java into a polyhedral representation (strict or extended to irregular control flow). This representation may be manipulated by other tools to, e.g., achieve complex program restructurations (for optimization, parallelization or any other kind of manipulation). It has been created to avoid tedious and error-prone input file writing for polyhedral tools (such as CLooG, LeTSeE, Candl etc.). Using Clan, the user has to deal with source codes based on C grammar only (as C, C++, C# or Java).

CLooG

Cédric Bastoul Walid Benabderrahmane Louis-Noël Pouchet

CLooG is a free software and library to generate code for scanning Z-polyhedra. That is, it finds a code (e.g. in C, FORTRAN...) that reaches each integral point of one or more parameterized polyhedra. CLooG has been originally written to solve the code generation problem for optimizing compilers based on the polytope model. Nevertheless it is used now in various area e.g. to build control automata for high-level synthesis or to find the best polynomial approximation of a function. CLooG may help in any situation where scanning polyhedra matters. While the user has full control on generated code quality, CLooG is designed to avoid control overhead and to produce a very effective code. Irregular extentions have been integrated during 2009 in the irCLooG prototype.

FADAlib

( http:// www. prism. uvsq. fr/ users/ bem/ fadalib/ home. html). Dataflow dependence for irregular programs (not static control programs). The library is developped by M. Belaoucha, funded by projects Teraops (pole de competitivite systematic) and PARMA (ITEA2).

MAQAO

(modular assembly quality analyzer and optimizer, http:// maqao. prism. uvsq. fr/ ). MAQAO analyzes static assembly codes and dynamic application performance. The objective of MAQAO is to help developpers to focus on code fragments that require performance tuning, analyzes compiler optimizations and proposes tuning hints. MAQAO works on Itanium, Pentium architectures.

CAPSULE.

Olivier Certner Yves Lhuillier Zheng Li Pierre Palatin Olivier Temam

CAPSULE is our component-like parallelization environment. It consists of a run-time system which enacts tasks divisions. The environment is publicly disseminated at alchemy.futurs.inria.fr/capsule, along with several CAPSULE-parallelized benchmarks. CAPSULE was developed through several

Processor simulation:

archexplorer.org

The project can be summarized as an open and continuous exploration of the architecture design space, and takes the form of a service and web site we have just opened, www.archexplorer.org, hosting the software at the server side.

The goal of this project is twofold: to enable a more rigorous methodology approach in our domain by enabling the comparison of architecture ideas, and to propose a novel architecture design approach which relies on automatic design-space exploration, as an alternative, or at least a complement, to the current design process essentially driven by intuition and experience.

The server-side software is mostly based on UNISIM ( www.unisim.org), one of our large developments in software simulation: it corresponds to an environment on top of SystemC for truly enabling sharing, reuse and comparison, by offering a rigorous communication protocol between modules, architecture interfaces, and a set of simulators.

The archexplorer.org project is a joint project with Ghent University, Belgium (Veerle Desmet), and Thales TRT (Sylvain Girbal). I have started the project and I am coordinating the research, though Veerle and Sylvain are doing most of the implementation work; Veerle also has taken an active role in the project and can be considered as co-leading it.

UNISIM

The UNISIM platform has been described in Section . As of now, besides the simulation engine, the developments include a shared-memory CMP based on the PowerPC 405, functional simulators for the PowerPC 405 (and cycle-level), PowerPC 750, a functional system simulator of the PowerPC 750 capable of booting Linux, 10 different cache modules corresponding to various research works. The following simulators or tools are currently under development: a functional and cycle-level version of the ARM 9 with full-system capability, a distributed-memory CMP based on the Power 405 core, an ST231 VLIW functional and later on cycle-level simulator. During his internship, Taj Khan integrated the CACTI ( http:// www. hpl. hp. com/ personal/ Norman_Jouppi/ cacti4. html) Power Estimation Model developed at HP Labs in UniSim.

New Results Program optimizations Practical Approach Grigori Fursin Albert Cohen Cédric Bastoul Louis-Noël Pouchet Walid Benabderrahmane

Here are the most recent key scientific achievements.

Empirically demonstrating that significant performance gains can be achieved with program optimizations, provided architecture phenomena are better factored in during the optimization process. Observing though that long compositions of program transformations are required.

Releasing the first machine-learning based research compiler (MILEPOST GCC ) that combines Interactive Compilation Interface and static program feature extractor to predict good program optimizations to reduce execution time, code size and compilation time for a given program on a given architecture automatically using predictive modeling and statistical techniques. This compiler opens many research opportunities and is used in the EU HiPEAC network of excellence as a default compilation platform. The development of MILEPOST GCC has been coordinated by Grigori Fursin (project coordinator - Michael O'Boyle). IBM made two press-releases about this work in June, 2008 and May, 2009 , .

Showing that it is possible to capture the complex interplays between architecture and program behavior using machine-learning techniques, using that knowledge to drive program optimizations.

Publications of 2008: , . Publications of 2009: , ,

Developing multiversioning applications to make static programs adaptable at run-time , , .

Enabling predictive run-time code scheduling on heterogeneous (CPU-GPU) architectures .

Developing collective optimization approaches leveraging the knowledge of multiple users to transparently and continuously optimize programs or improve default compiler optimization heuristic , .

Developing a polyhedral program representation that facilitates the composition of complex transformation sequences.

Addressing the code generation performance issues associated with polyhedral program representation.

Further leveraging polyhedral program representation to propose novel methods for scanning the space of program transformations.

Publications of 2008: , .

Extending the polyhedral model to irregular control flow (thus significantly increasing their application domain) and demonstrating the extension allows existing optimization techniques to successfully apply to relevant benchmarks (this work has been submitted and accepted for publication in 2009 at Compiler Construction 2010).

Collective Tuning Center Grigori Fursin Olivier Temam

We created an open community-driven collaborative wiki-based portal http:// cTuning. orgthat brings together academia, industry and end-users to develop intelligent collective tuning technology that automates and simplifies compiler, program and architecture design and optimization. This technology minimizes repetitive time consuming tasks and human intervention using collective optimization, run-time adaptation, statistical and machine learning techniques. It can already help end users and researchers to improve execution time, code size, power consumption, reliability and other important characteristics of the available computing systems automatically (ranging from supercomputers to embedded systems) and should eventually enable development of the emerging intelligent self-tuning adaptive computing systems. Collective Optimization Database is intended to improve the quality of academic research by avoiding costly duplicate experiments and providing reproducible results.

Transitive Closure of Union of Affine Relations Denis Barthou Anna Beletska Albert Cohen Konrad Trifunovic

We studied a method to compute the transitivite closure of a union of affine relations on integer tuples. Within Presburger arithmetics, complete algorithms to compute the transitive closure exist for convex polyhedra only. In presence of non-convex relations, there exists little but special cases and incomplete heuristics. We introduce novel sufficient and necessary conditions defining a class of relations for which an exact computation is possible. These conditions can be relaxed to define larger classes where conservative approximations and/or more complex closed forms can be obtained. Our method is immediately applicable to a wide area of symbolic computation problems. It is illustrated on representative examples and compared with state of the art approaches.

Optimizing code through iterative specialization Minhaj Khan Henri-Pierre Charles Denis Barthou

Code specialization is a way to obtain significant improvement in the performance of an application. It works by exposing values of different parameters in source code. The availability of these specialized values enables the compilers to generate better optimized code. Although most of the efficient source code implementations contain specialized code to benefit from these optimizations, the real impact of specialization may however vary depending upon the value of the specializing parameter.

We have studied in an iterative approach for code specialization. From some specialized code, we search for a better version of code by re-specializing the code, followed by a low-level code analysis. The specialized versions fulfilling the required criteria are then transformed to generate another equivalent version of the original specialized code. The approach, tested on Itanium2 architecture using gcc/icc compilers show significant improvement in the performance of different benchmarks.

Simulation of the Lattice QCD and Technological Trends in Computation Mouad Bahi Denis Barthou Cédric Bastoul Walid Benabderrhamane Christine Eisenbeis Julien Jaeger Louis-Noël Pouchet

This is a joint ANR project “PetaQCD” with Lal (Orsay), Irisa Rennes (Caps/Alf), IRFU (CEA Saclay), LPT (Orsay), Caps Entreprise (Rennes), Kerlabs (Rennes), LPSC (Grenoble).

Simulation of the Lattice QCD is a challenging computational problem. Currently, technological trends in computation show multiple divergent models of computation. We are witnessing homogeneous multicore architectures, the use of accelerator on-chip or off-chip, in addition to the traditional architectural models.

On the verge of this technological abundance, assessing the performance tradeoffs of computing nodes based on these technologies is of crucial importance to many scientific computing applications.

In this study , we focus on assessing the efficiency and the performance expected for the Lattice QCD problem on representative architectures and we project the expected improvement on these architectures and their impact on performance for the Lattice QCD. We additionally try to pinpoint the limiting factors for performance on these architectures. This work takes place in ANR PARA and ANR QCDNEXT (both 2005-2008) and has led to the project ANR PetaQCD (2009-2011) .

Loop Optimization using Adaptive Compilation and Kernel Decomposition J. Jaeger P. Oliveira S. Louise D. Barthou

We study a new hierarchical compilation approach for the generation of high performance applications, relying on the use of state of the art compilers. This appproach is not application dependent and do not require any assembly hand-coding. It relies on the decomposition of the loop nests of the hotest functions in the application into simpler kernels, typically 1D to 2D loops, much simpler to optimize. We successfully applied this approach for dense linear algebra in 2005, reaching performance of constructor libraries. The advantage of the generated kernels is that their performance no longer depend on data input, but only on its location in memory hierarchy. Using a performance model for the memory hierarchy, it is possible to find out the best composition of kernels to use.

For larger applications, the code is no longer regular and data accesses are in particular irregular (use of indirections). Working with applications of project ANR PARA (MPEG4, QCD, oil simulation and BLAST), we study how to adapt the previous approach to these cases. When control is irregular (involving different execution path), we study the the WCET, in particular in the context of embedded applications for MPSOC architectures. This is the subject of an on-going collaboration with CEA/Lastre.

Dataflow Analysis for Irregular Programs and its applications M. Belaoucha S. Touati D. Barthou

Instance-wise dataflow analysis provides the exact execution of a statement defining a value that is read at some other point during a program execution. This analysis generates more precise information than traditional dependence analyses and can therefore validate more optimizing transformations. An implementation of this analysis, as a standalone library, has be performed by M. Belaoucha (and funded by contract Teraops and PARMA) and its integration in gcc/Graphite is in progress.

Joint architecture/programming approaches

Here are the most recent key scientific achievements.

A joint programming/architecture approach for streaming applications which is successfully used at NXP (formerly Philips Semiconductors). An extension of the synchronous Kahn process networks using a relaxed notion of synchrony, called N-synchrony, applied to the efficient and scalable parallelization of media streaming applications.

CAPSULE: division-based parallelization Olivier Certner Zheng Li Olivier Temam

We have decided to ride a popular trend in software engineering, software components, which blends well with multi-cores: it proposes to decompose a large program into smaller fully independent parts, just like multi-cores consist in decomposing large monolithic architectures into a set of smaller cores. In itself, componentization does not yield much parallelism, our contribution is to augment components with the ability to divide, yielding as much parallelism as resources allow. The programmer is only exposed to this very simple notion of parallelization, and the role of the architecture and/or the run-time system is to manage parallel tasks. We have shown that this approach performs well on programs with irregular control flow behavior and complex data structures, which are typically difficult to efficiently parallelize. We have first demonstrated the approach on multi-threaded single-cores , then on shared-memory multi-cores , and have recently implemented the hardware support for distributed-memory multi-cores.

Alternative computing models/Spatial computing Compound circuits Hugues Berry Sylvain Girbal Olivier Temam Sami Yehia

Besides parallelization, the other "spatial" scalability path is customization. Customization, which is very popular in embedded systems, has many assets: custom circuits are cheaper, faster and more power efficient than processors. They can also speed up tasks which are by nature sequential (not parallel), so that they are complementary, not an alternative, to parallelism. Their main limitation is flexibility. As a result, we have investigated techniques which can improve the flexibility of custom circuits while achieving the best possible performance, area and power properties. The first technique, which relied on collapsing processor instructions into circuits , was developed as part of the PhD of Sami Yehia, who went on to work at ARM research to apply such approaches to embedded processors, and later to Thales TRT. More recently, we developed together a novel bottom-up approach where we show how to efficiently combine any number of custom circuits to create a far more flexible compoundcircuit , without sacrificing the performance, area and power benefits of custom circuits. That approach was recently patented jointly with Thales .

ANNs as accelerators Olivier Temam

We make the case for considering a hardware ANN as a flexible yet energy efficient, high-performance and defect-resilient accelerator, ideally positioned to tackle upcoming technology, applications and programming challenges. For now, we focus this study on one type of algorithms, classifiers, but which are commonly used in many RM applications. We present a hardware accelerator design for ANNs, geared towards robustness and high-performance. We show that transistor density has reached a level where it is now possible to spatially expand in hardware an ANN capable of handling medium-sized applications. Spatial expansion has multiple benefits in terms of robustness, energy efficiency, performance and scalability, over previous time-multiplexed designs.

We synthesized our design at 90nm and showed that such a spatially expanded ANN accelerator achieves orders of magnitude reductions in energy, and similar improvements in performance with respect to the same task executed on a modern processor at the same technology node, at a fraction of the on-chip area, justifying scaling down just one core in order to rip the energy and performance benefits.

Bio-Inspired Computing Systems biology of the role of glial cells in brain cell communications Hugues Berry Eshel Ben Jacob Maurizio DePitta Vladislav Volman Mati Goldberg

The 20th century witnessed crystallization of the neuron as the fundamental building block responsible for higher brain functions. Yet, neurons are not the most numerous cells in the brain. In fact, up to 90This work is a long-term collaboration with Eshel Ben Jacob,The Maguy-Glass Chair in Physics of Complex Systems, School of Physics and Astronomy, Tel Aviv University, Israel. As a first step, we derived and investigated a concise mathematical model for glutamate-induced nastrocytic intracellular Ca2+ dynamics that captures the essential biochemical features of the regulatory pathway of inositol 1,4,5-trisphosphate (IP3) . Compared with previous similar models, our three-variable models include a more realistic description of IP3 production and degradation pathways, lumping together their essential nonlinearities within a concise formulation. Using bifurcation analysis and time simulations, we demonstrate the existence of new putative dynamical features. The crosscouplings between IP3 and Ca2+ pathways endow the system with self-consistent oscillatory properties and favor mixed frequencyÅ amplitude encoding modes over pure amplitudeÅ modulation ones.// This article has been has been selected for the Faculty of 1000 Biology: http:// www. f1000biology. com/ article/ id/ 1163674/ evaluation. Our ongoing works are investigating the biophysical mechanisms of calcium wave propagation in astrocyte populations and astrocyte-regulation of the synaptic transmission between neurons.

AMYBIA : Aggregating MYriads of Bio-Inspired Agents Hugues Berry Nazim Fates Bernard Girau

In the framework of the ARC Amybia, we are searching for innovative schemes of decentralised and massively distributed computing. We mainly aim at contributing to this at three levels. At the modelling level, we think that biology provides us with complex and efficient models of such massively distributed behaviours. We start our study by addressing the decentralised gathering problem with the help of an original model of aggregation based on the behaviour of social amoebae. At the simulation level, our research mainly relies on achieving large scale simulations and on obtaining large statistical samples. Mastering these simulations is a major scientific issue, especially considering the imposed constraints: distributed computations, parsimonious computing time and memory requirements. Furthermore its raises further problems, such as: how to handle asynchronism, randomness and statistical analysis? At the hardware level, the challenge is to constantly confront our models with the actual constraints of a true practise of distributed computing. The main idea is to consider the hardware as a kind of sanity check. Hence, we intend to implement and validate our distributed models on massively parallel computing devices. In return, we expect that the analysis of the scientific issues raised by these implementations will influence the definition of the models themselves.// As a first step, we have recently proposed a bio-inspired system based on the so-called Greenberg-Hastings cellular automaton (GHCA), to achieve decentralized and robust gathering of mobile agents scattered on a surface or computing tasks scattered on a massively-distributed computing medium. As usual with such models, GHCA has mainly been studied using an homogeneous and regular lattice. However, in the context of massively distributed computing, one also needs to consider unreliable elements and defect-based noise. A first analysis showed that in this case, phase transitions could govern the behaviour of the system. Our next goal was to broaden the knowledge on stochastic reaction-diffusion media by investigating how such systems behave when various types of noise are introduced. Hence, in , we study GHCA where noise and topological irregularities of the grid are taken into account. The decrease of the probability of excitation changes qualitatively the behaviour of the system from an active to an extinct steady state. Simulations show that this change occurs near a critical threshold; it is identified as a nonequilibrium phase transition which belongs to the directed percolation universality class. We test the robustness of the phenomenon by introducing persistent defects in the topology : directed percolation behaviour is conserved. Using experimental and analytical tools, we suggest that the critical threshold varies as the inverse of the average number of neighbours per cell. The inverse proportionality law we presented paves the way for obtaining generic laws (even approximate ones) to predict the position of the critical threshold in various simulation conditions.

The Impact of Network Topology on Self-Organizing Maps Hugues Berry Fei Jiang Marc Schoenauer

The connectivity structure of complex networks (i.e. their topology) is a crucial determinant of their information transfer properties. Hence, the computation made by complex neural networks, i.e. neural networks with complex connectivity structure, could as well be dependent on their topology. For instance, recent studies have shown that introducing a small-world topology in a multilayer perceptron increases its performance. However, other studies have inspected the performance of Hopfield or Echo state networks with small-world or scale-free topologies and reported more contrasted results.// In , we study instances of complex neural networks, i.e. neural networks with complex topologies. We use Self-Organizing Map neural networks whose neighborhood relationships are defined by a complex network, to classify handwritten digits. We show that topology has a small impact on performance and robustness to neuron failures, at least at long learning times. Performance may however be increased (by almost 10%) by evolutionary optimization of the network topology. In our experimental conditions, the evolved networks are more random than their parents, but display a more heterogeneous degree distribution. On the limited experiments presented here, it thus seems that the performance of the network is only weakly controlled by its topology. Interestingly, though, these slight differences can nevertheless be exploited by evolutionary algorithms: after evolution, the networks are more random than the initial small-world topology population. Their more heterogeneous connectivity distribution may indicate a tendency to evolve toward scale-free topologies. Unfortunately, this assumption can only be tested with large-size networks, for which the shape of the connectivity distribution can unambiguously be determined, but whose artificial evolution, for computation cost reasons, could not be carried out. Similarly, future work will have to address other classical computation problems for neural networks before we are able to draw any general conclusion.

Cortical Microarchitecture: Computing by Abstractions Hugues Berry Olivier Temam Mikko Lipasti Atif Hashmi

Recent advances in the neuroscientific understanding of the brain are bringing about a tantalizing opportunity for building synthetic machines that perform computation in ways that differ radically from traditional Von Neumann machines. These brain-like architectures, which are premised on our understanding of how the human neocortex computes, are highly fault-tolerant, averaging results over large numbers of potentially faulty components, yet manage to solve very difficult problems more reliably than traditional algorithms. A key principle of operation for these architectures is that of automatic abstraction: independent features are extracted from highly disordered inputs and are used to create abstract invariant representations for external entities expressed in the inputs. This feature extraction is applied hierarchically, leading to increasing levels of abstraction at higher layers in the hierarchy.// In collaboration with Mikko Lipasti, University of Wisconsin at Madison, WI, USA, we introduce in a behavioral model for this process, using biologically-plausible neuron-level behavior and structure, and illustrates it with an image recognition task. We also introduce a computationally-effective higher-order modelÅ one that representsthe behavior of hundreds of neurons in a cortical column using just two perceptronsÅ is shown to be capable of this same task. These models are a first step towards developing a comprehensive and biologically-plausible understanding of the computational algorithms and microarchitecture of computing systems that mimic the human neocortex.

Biological neural networks as bio-inspiration sources for future architectures Hugues Berry Olivier Temam

Beyond a certain number of individual components, it is not even clear whether it will be possible to decompose tasks in such a way they can take advantage of such a large number of computing resources. Searching for solution to this problem has progressively lead us to biological neural networks. Indeed, biological neural networks (as opposed to artificial neural networks, ANNs) are well-known examples of systems capable of complex information processing tasks using a large number of self-organized, but slow and unreliable components. And the complexity of the tasks typically processed by biological neurons is well beyond what is classically implemented with ANNs.

We are starting to investigate the information processing capabilities of this abstraction programming method . While image processing is also our first application, we plan to later look at a more diverse set of example applications.

Spatial complexity of reversible computing Mouad Bahi Christine Eisenbeis

Especially since the work of Bennett about reversibility of computation and how to make a computation reversible, the relationship between reversibility, energy, computation and space complexity has gained interest in a lot of domains in computer science. This direction could help us understanding physical limitations of processors performance. We have chosen to start by studying the space complexity of a DAG computation, defined as the maximum number of registers needed for performing the computation in both directions. This criteria is closely related to our more classical criterion of “register saturation”. We have defined heuristics for computing this number and have performed systematic experiments on all possible graphs of given size. The first experiments tend to show that for a graph of size n, no more that n/2registers are needed to perform the computations in both directions compared to the forward direction. This latter number can be considered as the “garbage” of the computation. More work is needed to prove/disprove this result more formally and understand the hypothesis in which it is valid . In this work, all operations in the DAG are assumed to be reversible. See also .

Contracts and Grants with Industry Collaborations involving industry Thales TRT

Collaboration with Thales TRT, and the CNRS-Thales lab on several topics: customization, simulation, design-space exploration, heterogeneous systems programming, memristors. As mentioned before, the research work on customization recently led to a joint patent application. Main contact: Sami Yehia.

STMicroelectronics

Collaboration with STMicroelectronics on program parallelization and architecture support for parallelization.

Philips Semiconductors, now NXP

We have had regular collaborations with Philips for almost 10 years now, including direct contracts. Currently, we are involved in several grants with Philips (IP SARC, Marie-Curie fellowships, ACOTES). Philips Semiconductors has recently become NXP.

National and international collaborative grants

“PAGDEG” (Causes and consequences of protein aggregation in cellular degeneration): an ANR-funded project (call Piribio) on modeling and simulation of cellular degeneration in bacteria (2010-2012). Supervisor: A. Lindner. Total amount funded: 450 keuros.

Large-Scale initiative “ColAge” (Natural and engineering solutions to the control of bacterial growth and aging: A systems and synthetic biology approach): an INRIA-INSERM joint grant on modeling and simulation of systems biology (2008-2011). Supervisor: H. Berry. Total amount funded: 430 keuros.

Arch ²Neu

(150kEuros) This project aims at designing a novel type of hardware for digital signal processing (sounds, images,...) based on analog neural networks. This design shall be significantly more defect and fault tolerant than previous designs, while achieving very low power. This project is a joint INRIA Alchemy/CEA LETI ANR project as part of the “Return of PostDoc”: we have attracted a young French postdoc at University of California, originally from Supelec, to come back to France and set up this new project (2009-2012).

ARC MACACC

(20 keuros): “Modeling Cortical Activity and Analysing the Brain Neural Code”, Supervision: B. Cessac (Institut Non Linéaire de Nice). Other partners: Cortex (INRIA Nancy), Institut des Neurosciences cognitives de la Méditerannée (Marseille), Lab. Jean-Alexandre Dieudonné (Nice), Odyssee (INRIA Sophia).

ARC AMYBIA

(20 keuros): “Aggregating MYriads of Biologically-Inspired Agents”, Supervision: N. Fates (Maia, INRIA NAncy). Other Participants: B. Girau (Cortex, INRIA Nancy).

PEPS-STI CNRS MARTINE

(5 Keuros): “Multifractal Analysis to Resolve information Transfer In NEural networks”, Supervision: M. Quoy (ETIS, ENSEA, U. Cergy-Pontoise). Other Participants: F. Germinet (AGP, U. Cergy-Pontoise).

Appel à Idées 2008 de l'ISC-PIF

(4 keuros): “Organization of a conference on spatial/amorphous computing”, Supervision: H. Berry, Other Participants: F. Gruau ( Alchemy), O. Michel, J.L. Giavitto (Ibisc, U. Evry).

“Action d'Envergure” ColAge

: an INRIA-INSERM joint grant (3 years) on modeling and simulation of systems biology (official start Feb. 2009). Supervisor: H. Berry. Total amount funded (for 2008): 41 keuros.

GGCC: EU, MEDEA+ program

ITEA Call 8 project on global analysis and optimization in GCC. Our involvment lie in the compiler infrastructure, static analysis in the polyhedral model, and feature extraction for global and contiunous optimization. With CEA (dpt. of energy), UPM (Spain), SICS (Sweden), major industrial partners (Airbus, Telefonica, Bertin) and SMEs (Mandriva, MySQL, and others). 04/2006–04/2009.

ACOTES: EU, IST program

FP6 STREP on language and compiler support for high-performance streaming applications. We are one of the largest contractors in the project, with major involvment in interprocedural optimization and loop transformations for concurrent distributed streaming applications; it is both a programming model and compiler project. With Philips Research (Eindhoven), IBM Research (Haifa), STMicroelectronics (AST Lugano), Nokia (Helsinki), and UPC (Barcelona). 05/2006–05/2009.

MilePost: EU, IST program

FP6 STREP on machine-learning compilation. This project matches one of the core directions of the project: iterative optimization research, with an emphasis on making iterative compilation methods practical in real development environments. With IBM Research (Haifa), ARC (London), CAPS Entreprise (Rennes), IRISA (Rennes), and University of Edinburgh. 05/2006–05/2009.

PetaQCD:

ANR project on the design of architecture, software tools and algorithms for Lattice Quantum Chromodynamics. With Lal (Orsay), Irisa Rennes (Caps/Alf), IRFU (CEA Saclay), LPT (Orsay), Caps Entreprise (Rennes), Kerlabs (Rennes), LPSC (Grenoble).

PARA: French Ministry of Research

ANR CIGC project on multi-level parallel programming and automatic parallelization. We are involved in automatic code generation approaches for domain-specific and target-specific optimizations; iterative and polyhedral compilation methods are explored in an application-specific context. With Bull, University of Versailles, LaBRI (University of Bordeaux), INT (Evry), CAPS Entreprise (Rennes). 01/2006–01/2009.

APE: French Ministry of Research

ANR RNTL project on parallel real-time applications for embedded systems. We are developing a component-based environment called CAPSULE for distributed-memory processors. It will be applied to a novel processor of STMicroelectronics and tested on applications from Thales. With STMicroelectronics, Thales, University of Paris 6, CEA. 01/2006–01/2009.

PSYCHES: EU, IST program

Marie Curie ToK IAP (Transfer of Knowledge, Industry-Academia Partnership); long-term exchange of personnel and 2 years of post-doc; with Philips Research (Eindhoven) and UPC (Barcelona). 03/2006–03/2009.

SARC: EU, IST program

FP6 FET Proactive IP on advanced computer architecture. The goal is to address all the aspects of a scalable processor architecture based on multi-cores. It includes programming paradigms, compiler optimization, hardware support and simulation issues. CAPSULE is being used as component-based programming approach, and UNISIM for the simulation platform. 01/2006–01/2010.

Embedded TeraOps

A SYSTEMATIC “Pôle de Competitivité” regional funding for the development of a large-scale embedded multi-core architectures, coordinated by Thales. It will initially focus on streaming applications but it will later target programs with more complex control flow. Thales, Dassault, Thomson, CEA, INRIA. 01/2006–01/2010.

MODSIM

MODSIM is an INRIA grant for a joint international team between INRIA and Princeton University. The goal is the development of the UNISIM simulation platform. With Princeton University. 01/2006–12/2009.

ACI ASTICO Grant

French Minister of Research grant to explore biological neuron networks as possible sources of inspiration for future computing systems, with a focus on the complex structure of these networks. Our aim is at the same time to investigate bio-inspired computing systems, and original approaches for the modeling and understanding of biological neural networks. With University of Cergy-Pontoise, University of Nice-Sophia-Antipolis and University of Paris 6. 01/2005–01/2008.

NoE HiPEAC and HiPEAC2

HiPEAC is a network of excellence on High-Performance Embedded Architectures and Compilers. It involves more than 70 European researchers from 10 countries and 6 companies, including ST, Infineon and ARM. The goal of HiPEAC is to steer European research on future processor architectures and compilers to key issues, relevant to the European embedded industry.

The HiPEAC consortium has submitted a second edition of the network, which has started officially since November 2007 and for four years again. Olivier Temam is a member of the steering committee. 09/2004–11/2011. Mounira Bachir spent a 3 months intership (Jan 14th, 2009 till April 14th, 2009) in the Trinity College of Dublin under the direction of David Gregg. item[FET OMP] OpenMediaPlatform (OMP) aims at overcoming the cost and time-to-market risks that affect the development of media-rich evolving services for the growing range of networked consumer devices. It will provide an open architecture, combining two main streams of modern software engineering: (1) open application programmers interfaces (API) for media components, to be enhanced over standards like Khronos OpenMAX, and (2) new resource-aware system design tools and standards-complying static/dynamic compilers that ease the design, implementation and efficient execution of media services on a range of consumer platforms. 01/2008–12/2009.

ACI Nanosys

French Minister of Research grant to study the impact of alternative technologies, particularly nanotubes, on future computing circuits and architectures. With a large array of French laboratories in VLSI and architecture design.

Hugues Berry is a member of GdR Dycoec: “Dynamique et contrÃ´le des ensembles complexes” ( http:// www. coria. fr/ dycoec/ )

Other Grants and Activities Informal collaborations

Cédric Bastoul collaborates with Sébastien Salva from Clermont 1 University and Clément Delamare from Direction Générale des Impôts on web service client parallelization. He collaborates with various people at Reservoir Labs Inc. (New York) on high-level compilation for multicore architectures , .

Denis Barthou collaborates with these people.

W. Jalby, Univ. of Versailles St Quentin, PRISM lab.

S. Louise, CEA/Lastre.

S. Rajopadhye, U. of Colorado, Etat-Unis.

Hugues Berry collaborates with these people.

Eshel Ben-Jacob (School of Physics and Astronomy, Tel Aviv University, Israel)

Bruno Cessac (Lab. J.A. Dieudonnee, Universitè Nice-Sophia Antipolis; Team-Project NeuroMathComp, INRIA Sophia)

Bruno Delord, Stèphane Genet (ISIR, CNRS UMR 72222/ Universitè Pierre et Marie Curie, Paris)

Nazim Fates (MAIA, INRIA Loraine, Nancy), Bernard Girau (Cortex, INRIA Loraine, Nancy)

Annick Lesne (LPTMC - UMR CNRS 7600U, Universite Pierre et Marie Curie, Jussieu, Paris)

Ariel Lindner (INSERM U571, Facultè de Mèdecine Necker-Enfants Malades, Paris)

Mikko Lipasti (Dept Electrical & Computer Engineering, University of Wisconsin, Madison, USA)

Olivier Michel, A. Spicher (LACL, U. Paris 12 Creteil)

Marc Schoenauer (TAO, INRIA Saclay)

Grigori Fursin collaborates with the following reseachers:

Michael O'Boyle, University of Edinburgh, UK

Chengyong Wu, ICT, China

Nacho Navarro and Marisa Gil, UPC, Spain

Mircea Namolaru, Ayal Zaks, Bilha Mendelson, IBM Haifa, Israel

Francois Bodin, CAPS Entreprise/IRISA, France

Olivier Temam collaborates with these people.

Mikko Lipasti (University of Wisconsin).

Kathryn McKinley (University of Texas).

Veerle Desmet, Lieven Eeckhout (Ghent University).

Chengyong Wu (ICT, Beijing, China)

Daniel Gracia-Perez, Gilles Mouchard (CEA LIST).

Sylvain Girbal, Sami Yehia (Thales TRT).

Bruno Jego (ST).

ICT

Collaboration with Prof. Chengyong Wu at ICT, China, on machine-learning techniques for compilers and data centers.

University of Wisconsin

Collaboration with Mikko Lipasti, University of Wisconsin, on bio-inspired architectures.

University of Texas

Collaboration with Kathryn McKinley at University of Texas, Austin, on a novel component-based programming approaches for heterogeneous and homogeneous computing systems.

Ghent University and Thales

Collaboration with Veerle Desmet at Ghent University, Belgium, on design-space exploration. As part of this collaboration, we recently set up the www.archexplorer.orgweb site and related project.

University of California Santa Cruz

Thanks to a France-Berkeley travel grant, We are starting a collaboration with the group of Jose Renau, thanks to a 2006-2007 France-Berkeley grant. The topics are close to the infrastructure work of Alchemy: fast and accurate simulation of multi-core processors, and support for a modern parallelisation infrastructure in GCC. Jose Renau is a member of the OpenSparc consortium and contributed to major advances in architecture and compiler support for thread-level speculation.

University of Edinburgh

For the past 3 years, we had a very active cooperation with University of Edinburgh on iterative optimization; Grigori Fursin, got his PhD from University Edinburgh. This collaboration has resulted in a series of joint articles , , .

University of Illinois

We have a regular collaboration with the group of David Padua, Urbana-Champaign, Illinois, which started 6 years ago, with multiple joint publications and travel grants (CNRS-UIUC). Research focused on high-performance Java, dependence and alias analysis, processors in memory, and currently on adaptive program generation and machine learning compilers.

Texas A&M University

We started a regular exchange of ideas and personnel with the Parasol laboratory, led by Lawrence Rauchwerger, a reference in parallel language compilation and architecture support. ProfṘauchwerger visited Alchemyfor a total of 5 months in the last 3 years, and many of us visited TAMU for shorter periods. The collaboration led to numerous advances in the understanding of the main challenges and pitfals in scalable parallel processing, and also facilites the organization of multiple academic events (e.g., the upcoming PACT'07)

Ohio State University

We have a regular collaboration with the group of Prof. Sadayappan, Columbus, Ohio. Recently, we also started to publish together. We invited Uday Bondhugula, PhD student from Ohio for two months, and a Louis-Noël Pouchet will start a postdoc in Ohio in January 2010. The collaboration focuses on polyhedral compilation and new approaches to loop tiling for automatic parallelization.

Louisiana State University

We have a regular collaboration with the group of Prof. Ramanujam, Baton Rouge, Louisiana. Recently, we also started to publish together. Mohammed Fellahi was scheduled to spend a 3 month internship in Baton Rouge in 2009, but our plans were cancelled because of difficulties to get a US visa. The collaboration focuses on code generation for polyhedral transformations, and automatic parallelization for GPUs.

UPC

We have a regular collaboration with UPC, Barcelona, which started 7 years ago, with several groups on topics ranging from program optimization to micro-architecture, resulting in several publications, joint contracts.

University of Passau

We have a regular collaboration with the group of Christian Lengauer and Martin Griebl, Passau, Germany, which started 10 years ago, with multiple joint publications and travel grants (Procope, Ministry of Foreign Affairs). Our collaboration focused on polyhedral compilation techniques and recently headed towards domain-specific program generation and metaprogramming.

Lal-LPT, University of Paris Sud

We have started a collaboration with physicists working on LQCD (Lattice Quantic Chromo Dynamics). We focus on the next generation of computer that would gain an order of magnitude speedup over their current APE-next processor (sustained 300 GFlops).

Paris 6 University

The properties of biological neural networks that are of direct interest to architecture research are in part due to the intrinsic properties of the individual neurons. We are collaborating with the neuroscience research lab ANIM (INSERM U742) to develop simulation and modelling studies of specific properties of individual biological neurons such as time handling or plasticity and memory properties .

Project-Team TAO, INRIA Futurs

We started a collaboration with Marc Schoenauer on evolutionary algorithms for optimization of complex systems. More precisely, we study evolutionary methods to optimize the complex structure of large size neural networks. The aim is to find wether there exists optimal organizations for the interconnect network of such large systems. This collaboration grounds F. Jiang's Ph.D. work, which is co-supervised and co-founded by the two groups.

CEA List

For the past 6 years, we had a regular collaboration with the Laboratoire SÃ»reté du Logiciel(Software Safety Lab) at CEA LIST on two topics: processor simulation and program optimization. Simulation of complex processor architectures is necessary for the development of software test of complex systems investigated at CEA. Program optimization is more a way to factor in the CEA expertise in static analysis and develop new applications. CEA has funded two scholarships in our group until 2004 and 2005 respectively.

Others

We also have regular contacts with several foreign research groups: the CAPSL group at University of Delaware; and the PASCAL group at University of California Irvine (NSF-INRIA grant).

Hugues Berry collaborates with Bruno Cessac (Institut Non Linéaire de Nice, UMR 6618 CNRS / Université Nice-Sophia Antipolis), Bruno Delord (ANIM, UMR 742 Inserm / Université Pierre et Marie Curie, Paris), Stéphane Genet (ANIM, UMR 742 Inserm / Université Pierre et Marie Curie, Paris), Mathias Quoy (ETIS, UMR 8051 CNRS / Université de Cergy-Pontoise / ENSEA), Olivier Michel (Ibisc, Université d'Evry), Marc Schoenauer (TAO, INRIA Futurs, Orsay), Nazim Fates (MAIA, INRIA Loraine, Nancy).

Seminar and invited scientists

1 week visit of Dr. Petros Panayi from U. Cyprus.

2 weeks visit of Razya Ladelsky from IBM Research Haifa, Israel.

2 month visit of Prof. Özcan Özturk from Bilkent University, Ankara, Turkey.

Dr. Marc Duranton (Philips NXP, Eindhoven, Netherlands) visits Alchemyregularly.

Dr. BenoÃ®t Dupont de Dinechin (STMicroelectronics, then Kalray, Grenoble) visits Alchemyregularly.

Several visits by Prof. Sadayappan (Ohio State University) and Prof. Ramanujam (Louisiana State University).

Seminar by Prof. Colin Bundwell (University of Pennsylvania) on memory consistency.

Seminary by Prof. Babak Falsafi (EPFL) on cache prefetching.

Seminar by Dr. Sven Verdoolaege (K. U. Leuven) on process networks in the polyhedral model.

Seminar by Prof. Walid Taha (RICE University) on hybrid continuous-discrete systems.

Seminar and tutorial by Prof. François Irigoin, Prof. Fabien Coelho (École des Mines), Prof. Ronan Keryell (Telecom Bretagne and HPC-project) and Prof. Frédérique Chaussumier-Silber (Telecom SudParis), on the PIPS source-to-source compiler.

Dissemination Leadership within scientific community Cédric Bastoul

Visiting Professor at Reservoir Labs Inc. Jan 09 to Dec 09.

Member of the LRI department committee at the University of Paris-Sud of Paris-Sud since 2006.

Member of the Orsay Technology Institute (IUT D'Orsay) Computer Science department committee since 2006.

Director of the Licence Professionnelle Sécurité des Systèmes et Réseaux Informatiques(Bachelor on System and Network Security) at Orsay Institute of Technology since 2007.

Hugues Berry

Hugues Berry is a member of the “Scientific Commission” (commission scientifique) of the INRIA Saclay-Ile-de-France research centre.

Albert Cohen

HiPEAC'06 Summer School course on GCC (55-65 attendees). The support material for the courses and tutorials is freely available (public domain or GPL license) and has been contributed to the main GCC site ( http:// gcc. gnu. org, Wiki section; see also http:// www. hipeac. net/ gcc-tutorial).

Founding member of IFIP WG 2.11.

President of the recruiting committee (admissibilité) for INRIA Saclay research scientists, 2007, 2008 and 2009.

Christine Eisenbeis

Member of IFIP WG 10.3.

Member of the “comité de programmes” of Digiteo.

Elected member of the “conseil d'administration”' of Inria [2006-].

Elected member of the “comité technique paritaire” of Inria [2006-2009].

Elected member of the “conseil scientifique” of University of Paris-Sud [2008-].

Chair [2008-] of the “commission des utilisateurs des moyens informatiques - recherche” of the Saclay Inria Research Center.

Olivier Temam

HiPEAC2 Steering Committee, Research workpackage leader, leader of the Research Cluster on simulation.

Program Co-Chair of the 2011 International Conference on High-Performance and Embedded Systems (HiPEAC).

General Chair of the 2011 ACM/IEEE International Symposium on Code Generation and Optimization (CGO), to be organized in Chamonix, France. It is the first time that CGO will be held outside the US.

Leader of the INRIA Alchemygroup.

Program Committees:

Denis Barthou

Program committee member of IEEE HPCC 2009, The 11th IEEE International Conference on High Performance Computing and Communications (HPCC-09), June 2009 Korea University, Seoul, Korea

Cédric Bastoul

Program committee member of DATES 2009 The 12th International Conference on Design, Automation & Test in Europe.

Program committee member of SSS 2009 The 11th International Symposium on Stabilization, Safety, and Security of Distributed Systems.

Hugues Berry

Member of the INSERM commission for systems biology (Institut genetique et developement)

Defended PhD of supervised students: Fei Jiang “Evolution and optimization of large neural networks” (co-Supervised with M. Schoenauer, TAO, INRIA Saclay). PhD in Computer Science, Univ. Orsay-Paris-XI, Dec. 16, 2009

PhD Jury Duty :

M. Valvassori (Dir. A. Ali Cherif), 10 July 2009, University Paris 8 (Rapporteur)

M. Ambard (Dir. D. Martinez & F. Alexandre), 06 June 2009, University Nancy (Rapporteur)

Selection committee for Assistant Professor positions :

Position MCF 744, University Joseph Fourier, Grenoble, Sections 26-27, Biomathematics and Bioinformatics, May 2009

Position MCF 283, University of Evry, Sections 65-27, Cell biology and Bioinformatics, April-May 2009

Reviewer for the ANR Calls “SysComm”

Review editor for the journal “Frontiers in Neurorobotics” ( http:// frontiersin. org/ neuroscience/ user. do?actionType=JournalIssues&displayJournalPage=13&journalId=13)

Albert Cohen

Co-organization (with Marc Shapiro from INRIA Rocquencourt, Jean Roman from INRIA Bordeaux) and David Devour from CNRS and U. Perpignan of the INRIA Massively Multicore and Manycore (IMMM) Days, February 4 and 5 2009. 130 attendees, covering core research and technology as well as application domains impacted by manycore processors.

Editor of the special issue of the Transactions on High Performance and Embedded Architectures and Compilers (HiPEAC Journal) for the best papers of the SHCMP'08 workshop, to appear in 2010.

Program committee member of IEEE conf. on Parallel Architectures and Compilation Techniques (PACT'08 and PACT'09).

Program committee member of ACM LCTES'10 conferenre.

Program committee member of the HiPEAC'09 and HiPEAC'10 conference

Program committee member of the 2PARMA'10 workshop on Parallel Programming and Run-time Management Techniques for Many-core Architectures.

Program committee member of the GROW'10 GCC Research Opportunities Workshop.

External program committee member of ISCA'10.

Financial chair and local organization committee of CGO'11.

Thesis proposal committee (external reviewer) of Fréderic De Mesmay, Carnegie Mellon University, USA, February 2009.

PhD thesis committee (external reviewer) of Armin Groöeßlinger, University of Passau, DE, December 2009.

PhD thesis committee (examiner) of Nicolas Geoffray, Paris 6 University (and INRIA Rocquencourt), September 2009.

PhD thesis committee (president) of Matthieu Lemerre, Paris-Sud University (and CEA List), October 2009.

PhD thesis committee (examiner) of Jean-Baptiste Tristan, Paris 6 University (and INRIA Rocquencourt), November 2009.

PhD thesis committee (president) of Lamia Djoudi, University of Versailles, December 2009.

Christine Eisenbeis

Software and Compilers for Embedded Systems, SCOPES' 2009, April, 2009, Nice, France.

PhD thesis committee (reviewer) of Florent Bouchez, ENS Lyon, April 30th, 2009.

PhD thesis committee (member) of Rémi Baron, LPT, Orsay et CEA, September 18th, 2009.

Grigori Fursin

Program Committee Member of ICPADS'09 (International Conference on Parallel and Distributed Systems), multi-core architectures track

Program Committee Member of iWAPT'09 (International Workshop on Automatic Performance Tuning)

Workshop chair and organizer of GROWÂ09 (2nd International Workshop on GCC Research Opportunities)

Workshop organizer or SMART'09 (3rd Workshop on Statistical and Machine Learning Approaches applied to Architectures and Compilation)

Program Committee Member of Open64 Workshop at CGOÂ09

Olivier Temam

Program Committee of International Conference on Architecture of Computing Systems, 2010.

Program Committee of Workshop on New Directions in Computer Architecture, 2009.

Steering committee member and co-organizer of the Rapido workshop at the HiPEAC Conference, 2009, 2010.

Program committee of Workshop on Statistical and Machine learning approaches applied to ARchitectures and compilaTion (SMART) in 2010.

Program committee of ACM/IEEE International Symposium on Computer Architecture (ISCA) in 2009, 2010.

Program committee of ACM/IEEE International Synposium on Micro-Architecture (MICRO) in 2009.

Program committee of IEEE International Symposium on High-Performance Computer Architecture (HPCA) in 2009.

Associate editor of the HiPEAC Transactions.

Teaching at university

Denis Barthou gave these courses:

15h in Master2, UVSQ on vectorization/parallelization,

Summer School INRIA/CEA/EDF on High Performance Computing (june 2008).

Cédric Bastoul gives Java, System, Network and Security lectures and labs at the Orsay Institute of Technology to first, second and third year students (L1 to L3). He also teaches a Object Oriented Programming course at Paris-Sud University to second year students (L2). Lastly, he is teaching computer architecture at École Polytechnique, for third year students (M1).

Anna Beletska gave 9 hours of lectures in the Master 2 “Recherche” of Computer Science of University of Paris-Sud.

Mohamed-Walid Benabderrahmane: Monitorat: 64 hours at IFIPS - University Paris-sud 11, Courses: C/C++/C# , Web Services, Security, Level: 5 year engineer.

Philippe Dumont: Components of a Computing System, Introduction to Computer Architecture and Operating Systems, École Polytechnique - Licence 3 - 36h

Christine Eisenbeis gave a 3 hours lecture about “Reversible computing” in the Master 2 “Recherche” of Computer Science of University of Paris-Sud.

Olivier Temam teaches a computer architecture course at École Polytechnique to 3rd-year students on computer architectures (appr. 35 hours). He also co-teaches a course on novel processor architectures at University of Paris Sud to Master's students.

Albert Cohen teaches an introductory computing systems (computer architecture, operating systems, distributed systems) at École Polytechnique to 2nd-year students (appr 35 hours, 120 students); it was the first course using the Google Android development kit as a virtual platform for lab sessions; an e-book published with Eyrolles came out of this first experiment in 2009. He also teaches an advanced operating systems course to 3rd-year students at École Polytechnique. He also co-chairs the Electrical Engineering curriculum at École Polytechnique.

Workshops, seminars, invitations

The project-team members have given the following talks and attended the following conferences:

Mounira Bachir

LCPC 09, University of Delaware, USA, October 8-10, 2009, “Using The Meeting Graph Framework to Minimise Kernel Loop Unrolling for Scheduled Loops”

Cédric Bastoul

Paper presentation at PMEA 2009 (September, Raleigh, North Carolina) Workshop on Programming Models for Emerging Architectures in conjunction with PACT 2009.

Poster presentation at PACT 2009 (September, Raleigh, North Carolina) Intl. Conf. on Parallel Architectures and Compilation Techniques.

Participation to SPC 2009 Fault-Tolerant Spaceborne Computing Employing New Technologies Workshop (May 26-29, Albuquerque, New Mexico).

Anna Beletska

Cocoa' 2009, talk “Computing the transitive closure of a union of affine integer tuple relations”

ISPDC 2009, talk “Coarse-Grained Loop Parallelization: Iteration Space Slicing vs Affine Transformations”

Mohamed-Walid Benabderrahmane

Poster Pact 2009, “A Conservative Approach to Manipulate Data-Dependent Control Flow in the Polyhedral Model”, with Louis-Noël Pouchet

Summer school : Acaces 2009, Fifth International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems July 12 to July 18, 2009 Terrassa (near Barcelona), Spain

Hugues Berry

“The Effects of Hebbian Learning on the Structure and Dynamics of Chaotic Neural Networks”, given at the Dept. Electrical and Computer Enginerring, Univ. Wisconsin at Madison, WI, USA, Jan. 13, 2009 (invited by M. Lipasti).

“Estimating the effects of intrinsic plasticity on neural network dynamics using a realistic model”, at the “Journees Mathematiques du Vivant”, Laboratory J.A. Dieudonnee, Nice, France, March 25, 2009 (invited by B. Cessac)

“ColAge: A systems and synthetic biology approach to the control of bacterial growth and aging”, the 2nd NIH-INRIA workshop on Biomedical Computing, INRIA Rocquencourt, France, June 3, 2009.

“Cell biochemistry in cytoplasms with large molecular crowding : anomalous diffusion and bacterial aging”, at the 2nd Paris Workshop on Multi-Agent Systems in Biology at the Meso or Macroscopic Scales, Univ. Pierre et Marie Curie, Paris, France, June 23, 2009 (invited by M. Beurton-Aimar)

Christine Eisenbeis

Compilers for Parallel Computers (CPC'09), Zürich, Switzerland, January 7-9, 2009.

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES'09, Grenoble, November 11-15th, 2009.

Grigori Fursin

Participation to MICRO'09 (42nd IEEE/ACM International Symposium on Microarchitecture), New York, USA, December 2009

invited talk, "Collective Tuning Initiative", presented at the University of Versailles, France, May 2009; presented at the HiPEAC industrial workshop and HiPEAC clusters, Infineon, Munich, Germany, June 2009;

paper presentation "Collective Tuning" at the GCC Summit'09, Montreal, Canada, June 2009;

invited talk, "Collective Tuning Initiative: collective optimization, run-time adaptation and machine learning", presented at University of Illinois at Urbana Champaign, USA, April 2009

paper presentation "Collective Optimization" at HiPEAC'09, Cyprus, January 2009

paper presentation "Finding representative sets of optimizations for adaptive multiversioning applications" at SMART'09, Cyprus, January 2009

invited talk (by EU FP7 commision), "MILEPOST project - using machine learning to automate and speed up program optimization for reconfigurable processors", presented at the Information and Brokerage Conference on Information and Communication Technologies in the EU's 7th Framework, Moscow, Russia, October 2008

invited talk, "Enabling Dynamic Optimization and Adaptation for Statically Compiled Programs Using Function Multi-Versioning", presented at ScalPerf'08 (Scalable Approaches to High Performance and High Productivity Computing), Bertinoro, Italy, September 2008

invited talk, "Continuous adaptive program optimizations", presented at Reservoir Labs and IBM TJ Watson Research Center, New York, USA, August 2008;

presented at Imperial College (Software Performance Engineering Laboratory), London, UK, February 2008;

presented at the Institute of Computing Technology (Chinese Academy of Sciences), Beijing, China, January 2008;

invited talk, "Program iterative continuous optimizations, run-time adaptation and machine learning", presented at IBM Toronto Lab (compiler group), Canada, July 2007;

invited talk, "Machine learning techniques for iterative program optimizations and run-time adaptation", presented for the TAO group (machine learning group), LRI, Paris-Sud XI University, INRIA and CNRS, France, June 2007;

invited talk, "Overview of current activities: Interactive Compilation Interface for fine-grain program optimizations, dataset sensitivity, machine learning to speed up optimizations and DSE, run-time program adaptation, optimizations for heterogeneous computing systems, continuous collective optimizations, HiPEAC activities", presented at Intel (compiler group), Moscow, Russia, February 2007 and at the ISP RAS (Institute for System Programming, Russian Academy of Sciences), Moscow, Russia, February 2007

"Continuous run-time adaptation and optimization of statically compiled programs", presented at the UPC, Barcelona, Spain, January 2007.

Albert Cohen

Seminar at the U. of Delaware, February 2009, Newark DE: “state of the art in polyhedral compilation for production compilers”.

Visit of Reservoir Labs, February 2009, New York.

Visit of the group of Markus Püschel and Franz Franchetti, Carnegie Mellon University, February 2009, Pittsburgh, Pennsylvania.

Visit of the group of Kathryn O'Brien, of Kenneth Zadeck and David Edelsohn at IBM Research Watson, June 2009, Yorkton Heights, New York.

Invited presentation and contribution to a planning meeting for a future European call for research proposals on system- and process-level virtualization, September 2009, Bruxelles.

Presentation at the second STMicroelectronics-INRIA Plateform2012 meeting, October 2009, Grenoble.

Invited panelist at the LCPC'09 Panel on the future of compilation research and technology, October 2009, Newark, Delaware.

Co-organizer (with Joseph Sifakis, Ahmed Jerraya and BenoÃ®t Dupont de Dinechin) of the ESWeek'09 Industrial Panel on compilers for embedded multicore architectures, October 2009, Grenoble.

Presentation at Dagstuhl Seminar 09481 (SYNCHRON'09), December 2009: “A data-flow synchronous perspective to performance portability”.

Seminar at U. Saarbrücken, December 2009: “Languages and compilers for Volkscomputing”.

Seminar at U. Passau, December 2009: “Language and compilers for Volkscomputing”.

Presentation at the second Bull-INRIA-CEA partnership meeting, December 2009, Rocquencourt.

Philippe Dumont

Workshop on PetaScale Computing, First workshop of INRIA-Illinois Petascale Computing Joint Lab, June 10 to June 12, 2009, Paris, France

Acaces Summer School, July 12 to July 18, 2009, Terrassa, Spain

“ERBIUM: A Deterministic, Low-Level Concurrent Representation for Portability and Scalable Performance”, Synchronics day, December 17 2009, Paris, France

Sean Halle

Poster at ACACES 2009 International Conference “Bidirectional Libraries for Portable High Performance Parallelism

Using Machine Learning to Focus Iterative Optimization Felix Agakov F. Edwin Bonilla E. John Cavazos J. Bjoern Franke B. Grigori Fursin G. Mike O'Boyle M. J. Thomson J. M. Toussaint M. C. Williams C. Proceedings of the 4th Annual International Symposium on Code Generation and Optimization (CGO) 2006 Chaos in computer performance Hugues Berry H. Daniel Gracia Pérez D. Olivier Temam O. Chaos 16 2006 013110 http:// hal. inria. fr/ inria-00000109/ en/ N-Sychronous Kahn Networks Albert Cohen A. Marc Duranton M. Christine Eisenbeis C. Claire Pagetti C. Florence Plateau F. Marc Pouzet M. 33th ACM Symp. on Principles of Programming Languages (PoPL'06), Charleston, South Carolina January 2006 180–193 http:// www-rocq. inria. fr/ ~acohen/ publications/ CDEPPP06. ps. gz A Polyhedral Approach to Ease the Composition of Program Transformations Albert Cohen A. Sylvain Girbal S. Olivier Temam O. Euro-Par'04, Pisa, Italy LNCS 3149 Springer-Verlag August 2004 292–303 http:// www-rocq. inria. fr/ ~acohen/ publications/ CGT04. ps. gz Blob Computing Frederic Gruau F. Yves Lhuillier Y. Philippe Reitz P. Olivier Temam O. Computing Frontiers 2004 ACM SIGMicro. 2004 http:// blob. lri. fr/ publication/ 2004-model-blob-machine. pdf Capsule : Hardware-Assisted Parallel Execution of Component-Based Programs Pierre Palatin P. Yves Lhuillier Y. Olivier Temam O. The 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006, Orlando, Florida december 2006 On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance : Matrix-Multiply revisited David Parello D. Olivier Temam O. Jean-Marie Verdun J.-M. Supercomputing IEEE Nov 2002 Induction Variable Analysis with Delayed Abstractions Sebastian Pop S. Albert Cohen A. G.-A. Silber G.-A. Intl. Conf. on High Performance Embedded Architectures and Compilers (HiPEAC'05), Barcelona, Spain LNCS 3793 Springer-Verlag November 2005 218–232 http:// www-rocq. inria. fr/ ~acohen/ publications/ PCS05. ps. gz Violated dependence analysis Nicolas Vasilache N. Cédric Bastoul C. Sylvain Girbal S. Albert Cohen A. Proceedings of the ACM International Conference on Supercomputing (ICS'06), Cairns, Australia ACM June 2006 MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms Daniel Gracia Pérez D. Gilles Mouchard G. Olivier Temam O. MICRO-37: Proceedings of the 37th International Symposium on Microarchitecture IEEE Computer Society Dec 2004 43–54 Du chaos dans les neurones Hugues Berry H. Bruno Cessac B. 0153-4092 Pour La Science 385 November 2009 108-115 Extracting representative loop statement instances of synchronization-free slices Wlodzimierz Bielecki W. Marek Palkowski M. Anna Beletska A. 0032-4140 Measurement Automation and Monitoring 10 2009 807–811 Glutamate regulation of calcium and IP3 oscillating and pulsating dynamics in astrocytes M. De Pittà M. M. Goldberg M. V. Volman V. H. Berry H. E. Ben-Jacob E. 0092-0606 Journal of Biological Physics 35 4 2009 383–411 Glutamate regulation of calcium and IP3 oscillating and pulsating dynamics in astrocytes M. De Pittà M. M. Goldberg M. V. Volman V. H. Berry H. E. Ben-Jacob E. 0092-0606 Journal of Biological Physics 35 2009 383-411 Collective Optimization Grigori Fursin G. Olivier Temam O. 1544-3566 ACM, Transactions on Architecture and Code Optimization (TACO) 2010 Array-OL with delays, a domain specific specification language for multidimensional intensive signal processing. Calin Glitia C. Philippe Dumont P. Pierre Boulet P. 0923-6082 Multidimensional Systems and Signal Processing 2009 http:// springerlink. com/ content/ w3821760381l4432/ ?p=fc0a4428f2f4468a9d630d2a434a6f69&pi=0 Systematic search within an optimisation space based on Unified Transformation Framework S. Long S. Grigori Fursin G. 1742-7185 International Journal of Computational Science and Engineering (IJCSE) 4 2 2009 102-111 Using The Meeting Graph Framework to Minimise Kernel Loop Unrolling for Scheduled Loops Mounira BACHIR M. David Gregg D. Sid-Ahmed-Ali Touati S.-A.-A. The 22nd International Workshop on Languages and Compilers for Parallel Computing (LCPC09), Delaware, USA, October 8-10, 2009 2009 International Workshop on Languages and Compilers for Parallel Computing 22 LCPC Spatial complexity of reversibly computable DAG Mouad Bahi M. Christine Eisenbeis C. CASES, International Conference on Compilers, Architecture, and Synthesis for embedded systems 2009 47-56 International Conference on Compilers, Architecture and Synthesis for Embedded Systems 2009 CASES Extended Static Control Programs as a Programming Model for Accelerators, A Case Study: Targetting ClearSpeed CSX700 With the R-Stream Compiler Cédric Bastoul C. Nicolas Vasilache N. Allen Leung A. Benoît Meister B. David Wohlford D. Richard Lethin R. PMEA'09 Workshop on Programming Models for Emerging Architectures, Raleigh, North Carolina September 2009 45-52 Workshop on Programming Models for Emerging Architectures 2009 PMEA Computing the Transitive Closure of a Union of Affine Integer Tuple Relations Anna Beletska A. Denis Barthou D. Wlodzimierz Bielecki W. Albert Cohen A. International Conference on Combinatorial Optimization and Applications COCOA'09 June 2009 98-109 International Conference on Combinatorial Optimization and Application 3 COCOA Synchronization-free automatic parallelization: Beyond affine iteration-space slicing Anna Beletska A. Wlodzimierz Bielecki W. Albert Cohen A. M. Palkowski M. 22nd International Workshop on Languages and Compilers for Parallel Computing (LCPC'09) October 2009 International Workshop on Languages and Compilers for Parallel Computing 22 LCPC Coarse-grained loop parallelization: Iteration space slicing vs affine transformations A. Beletska A. Wlodzimierz Bielecki W. Albert Cohen A. M. Palkowski M. K. Siedlecki K. IEEE International Symposium on Parallel and Distributed Computing (ISPDC'09) July 2009 International Symposium on Parallel and Distributed Computing 8 ISPDC Relaxing synchronous composition with clock abstraction Albert Cohen A. Louis Mandel L. Florence Plateau F. Marc Pouzet M. Hardware Design and Functional Languages Workshop (HFL'09) March 2009 Workshop Hardware Design using Functional languages 2009 HFL ArchExplorer.org: Joint Compiler/Hardware Exploration for Fair Comparison of Architectures Veerle Desmet V. Sylvain Girbal S. Olivier Temam O. International Workshop on Interaction between Compilers and Computer Architecture (INTERACT) February 2009 IFIP TC13 International Conference on Human-Computer Interaction 12 INTERACT Opening Up Automatic Structural Design-Space Exploration by Fixing Modular Simulation Veerle Desmet V. Sylvain Girbal S. Olivier Temam O. HiPEAC Industrial Workshop November 2009 HiPEAC Industrial Workshop 2009 A Methodology for Facilitating a Fair Comparison of Research Ideas Veerle Desmet V. Sylvain Girbal S. Olivier Temam O. IEEE, International Symposium on Performance Analysis of Systems and Software (ISPASS), White Plains, NY IEEE Computer Society Press March 2010 International Symposium on Performance Analysis of Systems and Software 2009 ISPASS Optimizing local memory allocation and assignment through a decoupled approach B. Diouf B. Ö. Öztürk Ö. Albert Cohen A. 22nd International Workshop on Languages and Compilers for Parallel Computing (LCPC'09) October 2009 International Workshop on Languages and Compilers for Parallel Computing 22 LCPC Portable Compiler Optimization Across Embedded Programs and Microarchitectures using Machine Learning Christophe Dubach C. Timothy Jones T. Edwin Bonilla E. Grigori Fursin G. Michael O'Boyle M. 42nd IEEE/ACM International Symposium on Microarchitecture (MICRO) December 2009 IEEE/ACM International Symposium on Microarchitecture 42 MICRO Critical phenomena in a discrete stochastic reaction-diffusion medium N.A. Fates N. Hugues Berry H. Fourth International Workshop on Natural Computing, IWNC 2009 September 2009 International Workshop on Natural Computing 4 IWNC Software Pipelining in Nested Loops with Prolog-Epilog Merging Mohammed Fellahi M. Albert Cohen A. Proceedings of the International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2009) January 2009 80-94 International Conference on High Performance and Embedded Architectures and Compilers 4 HiPEAC Collective Tuning Initiative: automating and accelerating development and optimization of computing systems Grigori Fursin G. Proceedings of the GCC Developers' Summit June 2009 GCC Developper's Summit 6 GCC Collective Optimization Grigori Fursin G. Olivier Temam O. Proceedings of the International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2009) January 2009 International Conference on High Performance and Embedded Architectures and Compilers 4 HiPEAC Parametric Multi-Level Tiling of Imperfectly Nested Loops Albert Hartono A. Muthu Baskaran M. Cédric Bastoul C. Albert Cohen A. Sriram Krishnamoorthy S. Boyana Norris B. J. Ramanujam J. P. Sadayappan P. Proceedings of the ACM International Conference on Supercomputing (ICS'09), Yorktown Heights, New York June 2009 147-157 ACM International Conference on Supercomputing 23 ICS Leveraging Progress in Neurobiology for Computing Systems Atif Hashmi A. Hugues Berry H. Mikko Lipasti M. Olivier Temam O. New Directions in Computer Architecture (NDCA), in conjunction with MICRO, New York December 2009 Workshop on New Directions in Computer Architecture 1 NDCA Leveraging progress in neurobiology for computing systems A. Hashmi A. Hugues Berry H. Olivier Temam O. M. Lipasti M. 1st Workshop on New Directions in Computer Architecture (NDCA-1) December 2009 Workshop on New Directions in Computer Architecture 1 NDCA Simulation of the Lattice QCD and Technological Trends in Computation Khaled Ibrahim K. Julien Jaeger J. Zhaofeng Liu Z. Louis-Noel Pouchet L.-N. Piotr Lesnicki P. Lamia Djoudi L. Denis Barthou D. Francois Bodin F. Christine Eisenbeis C. Gilbert Grosdidier G. Olivier Pene O. Patrick Roudeau P. Workshop on Compilers for Parallel Computing, Zurich, Switzerland January 2009 Workshop on Compilers for Parallel Computers 14 CPC arXiv:0808.0391v3 The Impact of Network Topology on Self-Organizing Maps F. Jiang F. H. Berry H. M. Schoenauer M. World Summit on Genetic and Evolutionary Computation, GECS-2009 June 2009 World Summit on Genetic and Evolutionary Computation 2009 GEC Summit The Impact of Network Topology on Self-Organizing Maps F. Jiang F. H. Berry H. M. Schoenauer M. World Summit on Genetic and Evolutionary Computation, GECS-2009, Shangai, China June 2009 World Summit on Genetic and Evolutionary Computation 2009 GEC Summit Predictive runtime code scheduling for heterogeneous architectures Victor Jimenez V. Isaac Gelado I. Lluis Vilanova L. Marisa Gil M. Grigori Fursin G. Nacho Navarro N. Proceedings of the International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2009) January 2009 International Conference on High Performance and Embedded Architectures and Compilers 4 HiPEAC Finding representative sets of optimizations for adaptive multiversioning applications Lianjie Luo L. Yang Chen Y. Chengyong Wu C. Shun Long S. Grigori Fursin G. 3rd Workshop on Statistical and Machine Learning Approaches Applied to Architectures and Compilation (SMART'09), colocated with HiPEAC'09 conference January 2009 Workshop on Statistical and Machine Learning Approaches Applied to Architectures and Compilation 3 SMART Productivity via Automatic Code Generation for PGAS Platforms with the R-Stream Compiler Benoît Meister B. Allen Leung A. Nicolas Vasilache N. David Wohlford D. Cédric Bastoul C. Richard Lethin R. APGAS'09 Workshop on Asynchrony in the PGAS Programming Model, Yorktown Heights, New York June 2009 Workshop on Asynchrony in the PGAS Programming Model 2009 APGAS Towards transactional memory support for GCC M. Schindewolf M. Albert Cohen A. W. Karl W. A. Marongiu A. L. Benini L. GCC Research Opportunities Workshop (GROW'09, associated with HiPEAC) January 2009 International Workshop on GCC Research Opportunities 1 GROW ANNs as Efficient and Robust Accelerators for Emerging Applications Olivier Temam O. New Directions in Computer Architecture (NDCA), in conjunction with MICRO, New York December 2009 Workshop on New Directions in Computer Architecture 1 NDCA Reducing Training Time and Calculating Confidence in a Machine Learning-based Compiler John Thomson J. Michael O'Boyle M. Grigori Fursin G. Bjoern Franke B. 22nd International Workshop on Languages and Compilers for Parallel Computing (LCPC'09) October 2009 International Workshop on Languages and Compilers for Parallel Computing 22 LCPC Polyhedral-model guided loop-nest auto-vectorization K. Trifunovic K. D. Nuzman D. Albert Cohen A. A. Zaks A. I. Rosen I. In Parallel Architectures and Compilation Techniques (PACT'09) September 2009 International Conference on Parallel Architectures and Compilation Techniques 18 PACT Reconciling Specialization and Flexibility Through Compound Circuits Sami Yehia S. Sylvain Girbal S. Hugues Berry H. Olivier Temam O. 15th International Symposium on High-Performance Computer Architecture, HPCA, Raleigh, North Carolina February 2009 International Conference on High-Performance Computer Architecture 15 HPCA The PetaQCD project Gilbert Grosdidier G. Christine Eisenbeis C. François Bodin F. André Seznec A. R. Bilhaut R. G. Le Meur G. Patrick Roudeau P. F. Touze F. Jean-Christian Angles D'Auriac J.-C. J. Carbonell J. D. Becirevic D. P. Boucaud P. Olivier Brand-Foissac O. Olivier Pene O. Denis Barthou D. Pierre. Guichon P. P.F. Honore P. P. Gallard P. L. Rilling L. 17th International Conference on Computing in High Energy and Nuclear Physics (CHEP09), Prague Tchèque, République 03 2009 http:// hal. in2p3. fr/ in2p3-00380246/ en/ International Conference on Computing in High Energy and Nuclear Physics 17 CHEP The proceedings of the International Conference on Computing in High Energy and Nuclear Physics (CHEP 2009) will be published in the open access Journal of Physics: Conference Series (JPCS), published by IOP Publishing. All papers will be free to read and download immediately upon publication. LAL 09-58 A Conservative Approach to Handle Full Functions in the Polyhedral Model Mohamed-Walid Benabderrahmane M.-W. Cédric Bastoul C. Louis-Noël Pouchet L.-N. Albert Cohen A. 6814 INRIA Research Report January 2009 Technical report Schedule-Sensitive Register Pressure Reduction in Innermost Loops, Basic Blocks and Super-Blocks Sébastien Briais S. Sid Touati S. Inria 2009 http:// hal. archives-ouvertes. fr/ inria-00436348/ PDF/ main_siralina_report. pdf Technical report Experimental Study of Register Saturation in Basic Blocks and Super-Blocks: Optimality and heuristics Sébastien Briais S. Sid-Ahmed-Ali Touati S.-A.-A. Inria 2009 http:// hal. archives-ouvertes. fr/ inria-00431103/ PDF/ main_RS_report. pdf experimental data and free software are included (made public) Technical report Cyclic Task Scheduling with Storage Requirement Minimization under Specific Architectural Constraints: Case of Buffers and Rotating Storage Facilities Sid-Ahmed-Ali TOUATI S.-A.-A. Inria 2009 http:// hal. archives-ouvertes. fr/ inria-00440446/ PDF/ PSSR. pdf This is a continuation work to SIRA (Sid-Ahmed-Ali Touati and Christine Eisenbeis. Early Periodic Register Allocation on ILP Processors. Parallel Processing Letters, Vol. 14, No. 2, June 2004. World Scientific.). We exetend that work with new heuristics and experimental results. Technical report Using Machine Learning to Focus Iterative Optimization Felix Agakov F. Edwin Bonilla E. John Cavazos J. Bjoern Franke B. Grigori Fursin G. Mike F. P. O'Boyle M. F. P. J. Thomson J. M. Toussaint M. C. Williams C. Proceedings of the 4th Annual International Symposium on Code Generation and Optimization (CGO) 2006 Using Machine Learning to Focus Iterative Optimization Felix Agakov F. Edwin Bonilla E. John Cavazos J. Bjoern Franke B. Grigori Fursin G. Mike F. P. O'Boyle M. F. P. John Thomson J. Marc Toussaint M. Chris Williams C. CGO-4: The Fourth Annual International Symposium on Code Generation and Optimization 2006 Automatic decomposition of scientific programs for parallel execution R. Allen R. D. Callahan D. K. Kennedy K. Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages ACM Press 1987 63–76 http:// doi. acm. org/ 10. 1145/ 41625. 41631 Spatial complexity of reversible computing Mouad Bahi M. Christine Eisenbeis C. Benjamin Dauvergne B. Albert Cohen A. Third International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems (ACACES'08), L'Aquila, Italy July 2008 Gamma and the Chemical Reaction Model : Ten Years After J-P. Banâtre J.-P. D. Le Métayer D. L. J.-M. Andreoli J.-M. H. Gallaire H. D. Le Métayer D. L. Coordination Programming: Mechanisms, Models and Semantics 1996 1–39 Code Generation in the Polyhedral Model Is Easier Than You Think Cédric Bastoul C. PACT'13 IEEE International Conference on Parallel Architecture and Compilation Techniques, Juan-les-Pins september 2004 7–16 http:// hal. ccsd. cnrs. fr/ ccsd-00017260 Putting Polyhedral Loop Transformations to Work Cédric Bastoul C. Albert Cohen A. Sylvain Girbal S. Saurabh Sharma S. Olivier Temam O. Workshop on Languages and Compilers for Parallel Computing (LCPC'03), College Station, Texas LNCS Springer-Verlag October 2003 23–30 Chaos in computer performance Hugues Berry H. Daniel Gracia Pérez D. Olivier Temam O. Chaos 16 2006 013110 http:// hal. inria. fr/ inria-00000109/ en/ Complex dynamics of microprocessor performances during program execution: Regularity, chaos, and others Hugues Berry H. Daniel Gracia Pérez D. Olivier Temam O. NKS2006 Wolfram Science Conference, Washington D.C., USA June 2006 Structure and dynamics of random recurrent neural networks Hugues Berry H. Mathias Quoy M. Adaptive Behavior 14 2006 129-137 Modeling Self-Developping Biological Neural Network Hugues Berry H. Olivier Temam O. Neurocomputing 70 16-18 2007 2723–2734 Aestimo: a feedback-directed optimization evaluation tool P. Berube P. J.N. Amaral J. Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS) 2006 The M5 Simulator: Modeling Networked Systems Nathan L. Binkert N. L. Ronald G. Dreslinski R. G. Lisa R. Hsu L. R. Kevin T. Lim K. T. Ali G. Saidi A. G. Steven K. Reinhardt S. K. IEEE Micro 26 4 2006 52–60 http:// dx. doi. org/ 10. 1109/ MM. 2006. 82 Cilk: An Efficient Multithreaded Runtime System R. Blumofe R. C. Joerg C. B. Kuszmaul B. C. Leiserson C. K. Randall K. Y. Zhou Y. Proceedings of the 5th Symposium on Principles and Practice of Parallel Programming 1995 http:// citeseer. ist. psu. edu/ blumofe95cilk. html The SimpleScalar tool set, version 2.0 Doug Burger D. Todd M. Austin T. M. SIGARCH Comput. Archit. News 25 3 1997 13–25 http:// doi. acm. org/ 10. 1145/ 268806. 268810 Automatic Performance Model Construction for the Fast Software Exploration of New Hardware Designs John Cavazos J. Christophe Dubach C. Felix Agakov F. Edwin Bonilla E. Mike F. P. O'Boyle M. F. P. Grigori Fursin G. Olivier Temam O. International Conference on Compilers, Architecture, And Synthesis For Embedded Systems (CASES 2006) October 2006 To appear A Practical Approach for Reconciling High and Predictable Performance in Non-Regular Parallel Programs Olivier Certner O. Zheng Li Z. Pierre Palatin P. Olivier Temam O. Frederic Arzel F. Nathalie Drach N. DATE 2008, Munich, Germany march 2008 740–745 Zbigniew Chamski Z. Marc Duranton M. Albert Cohen A. Christine Eisenbeis C. Paul Feautrier P. Daniela Genius D. Application Domain-Driven System Design for Pervasive Video Processing Ambient Intelligence: Impact on Embedded-System Design Kluwer Academic Press 2003 Hardware-modulated parallelism in chip multiprocessors Julia Chen J. Philo Juang P. Kevin Ko K. Gilberto Contreras G. David Penry D. Ram Rangan R. Adam Stoler A. Li-Shiuan Peh L.-S. Margaret Martonosi M. SIGARCH Comput. Archit. News, Special Issue: Proc. of the dasCMP'05 Workshop 33 4 2005 54–63 http:// doi. acm. org/ 10. 1145/ 1105734. 1105742 Synchronization of Periodic Clocks Albert Cohen A. Marc Duranton M. Christine Eisenbeis C. Claire Pagetti C. Florence Plateau F. Marc Pouzet M. ACM Conf. on Embedded Software (EMSOFT'05), Jersey City, New York September 2005 http:// www-rocq. inria. fr/ ~acohen/ publications/ CDEPPP05. ps. gz N-Sychronous Kahn Networks Albert Cohen A. Marc Duranton M. Christine Eisenbeis C. Claire Pagetti C. Florence Plateau F. Marc Pouzet M. 33th ACM Symp. on Principles of Programming Languages (PoPL'06), Charleston, South Carolina January 2006 180–193 http:// www-rocq. inria. fr/ ~acohen/ publications/ CDEPPP06. ps. gz Multi-Periodic Process Networks: Prototyping and Verifying Stream-Processing Systems Albert Cohen A. Daniela Genius D. A. Kortebi A. Zbigniew Chamski Z. Marc Duranton M. Paul Feautrier P. Euro-Par'02, Paderborn, Germany LNCS 2400 Springer-Verlag August 2002 http:// www-rocq. inria. fr/ ~acohen/ publications/ CGKCDF02. ps. gz Facilitating the Search for Compositions of Program Transformations Albert Cohen A. Sylvain Girbal S. David Parello D. Marc Sigler M. Olivier Temam O. Nicolas Vasilache N. ACM Intl. Conf. on Supercomputing (ICS'05), Boston, Massachusetts June 2005 151–160 http:// www-rocq. inria. fr/ ~acohen/ publications/ CGPSTV05. ps. gz A Polyhedral Approach to Ease the Composition of Program Transformations Albert Cohen A. Sylvain Girbal S. Olivier Temam O. Euro-Par'04, Pisa, Italy LNCS 3149 Springer-Verlag August 2004 292–303 http:// www-rocq. inria. fr/ ~acohen/ publications/ CGT04. ps. gz ACME: adaptive compilation made efficient Keith D. Cooper K. D. Alexander Grosul A. Timothy J. Harvey T. J. Steven Reeves S. Devika Subramanian D. Linda Torczon L. Todd Waterman T. Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES) 2005 69–77 Adaptive Optimizing Compilers for the 21st Century Keith D. Cooper K. D. Devika Subramanian D. Linda Torczon L. J. Supercomput. 23 1 2002 7–22 http:// dx. doi. org/ 10. 1023/ A:1015729001611 OpenMP: An Industry- Standard API for Shared- Memory Programming Leonardo Dagum L. Ramesh Menon R. IEEE COMPUTATIONAL SCIENCE & ENGINEERING 1998 46-55 Quickly building an optimizer for complex embedded architectures Michaël Dupré M. Nathalie Drach N. Olivier Temam O. International Symposium on Code Generation and Optimization ACM/IEEE Mar 2004 Asim: A Performance Model Framework. Joel S. Emer J. S. Pritpal Ahuja P. Eric Borch E. Artur Klauser A. Chi-Keung Luk C.-K. Srilatha Manne S. Shubhendu S. Mukherjee S. S. Harish Patil H. Steven Wallace S. Nathan L. Binkert N. L. Roger Espasa R. Toni Juan T. IEEE Computer 35 2 2002 68-76 Sequoia: Programming the Memory Hierarchy K. Fatahlian K. T. J. Knight T. J. M. Houston M. M. Erez M. D. R. Horn D. R. L. Leem L. J. Y. Park J. Y. M. Ren M. A. Aiken A. W. J. Dally W. J. P. Hanrahan P. Supercomputing 2006, Tampa, Florida November 2006 Dataflow Analysis of Array and scalar references Paul Feautrier P. Int. J. of Parallel Programming 20 1 1991 23-53 Some efficient solutions to the affine scheduling problem I. One-dimensional time Paul Feautrier P. Int. J. of Parallel Programming 21 5 1992 313-347 Some efficient solutions to the affine scheduling problem II. Multi-dimensional time Paul Feautrier P. Int. J. of Parallel Programming 21 6 1992 389-420 FLEXUS http:// www. ece. cmu. edu/ ~simflex/ flexus. html Probabilistic Source-Level Optimisation of Embedded Programs Bjoern Franke B. Mike F. P. O'Boyle M. F. P. J. Thomson J. Grigori Fursin G. Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES) 2005 Building a Practical Iterative Interactive Compiler Grigori Fursin G. Albert Cohen A. 1st Workshop on Statistical and Machine Learning Approaches Applied to Architectures and Compilation (SMART'07), colocated with HiPEAC 2007 conference, Ghent, Belgium January 2007 A Practical Method For Quickly Evaluating Program Optimizations Grigori Fursin G. Albert Cohen A. M. O'Boyle M. Olivier Temam O. Intl. Conf. on High Performance Embedded Architectures and Compilers (HiPEAC'05), Barcelona, Spain LNCS 3793 Springer-Verlag November 2005 29–46 http:// hal. inria. fr/ inria-00001054/ en/ Quick and practical run-time evaluation of multiple program optimizations Grigori Fursin G. Albert Cohen A. Michael O'Boyle M. Olivier Temam O. Trans. on High Performance Embedded Architectures and Compilers 1 1 2006 13-31 Quick and practical run-time evaluation of multiple program optimizations Grigori Fursin G. Albert Cohen A. Mike F. P. O'Boyle M. F. P. Olivier Temam O. Trans. on High Performance Embedded Architectures and Compilers 1 1 January 2007 13-31 MILEPOST GCC: machine learning based research compiler Grigori Fursin G. Cupertino Miranda C. Olivier Temam O. Mircea Namolaru M. Elad Yom-Tov E. Ayal Zaks A. Bilha Mendelson B. Phil Barnard P. Elton Ashton E. Eric Courtois E. François Bodin F. Edwin Bonilla E. John Thomson J. Hugh Leather H. Chris Williams C. Michael O'Boyle M. Proceedings of the GCC Developers' Summit June 2008 Evaluating Iterative Compilation Grigori Fursin G. Mike F. P. O'Boyle M. F. P. P.M.W. Knijnenburg P. Proc. Languages and Compilers for Parallel Computers (LCPC) 2002 305-315 GCC ICI: Interactive Compilation Interface http:// gcc-ici. sourceforge. net On the propagation of Ca-dependent plateau and valley potentials in cerebellar Purkinje cells and how they drive the cell output Stéphane Genet S. Bruno Delord B. Loïc Sabarly L. Emmanuel Guigon E. Hugues Berry H. Proceedings of NeuroComp'06, Pont-à-Mousson, France 23-24 October 2006 167–170 MGS: a Rule-Based Programming Language for Complex Objects and Collections Jean-Louis Giavitto J.-L. Olivier Michel O. Electronic Notes in Theoretical Computer Science 59 4 2001 Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies Sylvain Girbal S. Nicolas Vasilache N. Cédric Bastoul C. Albert Cohen A. David Parello D. Marc Sigler M. Olivier Temam O. Intl. J. of Parallel Programming 2006 Accepted with minor revisions NanoFabrics: spatial computing using molecular electronics S. C. Goldstein S. C. M. Budiu M. Proceedings of the 28th annual international symposium on Computer architecture, Göteborg, Sweden ACM Press 2001 178–191 IDDCA: A New Clustering Approach For Sampling Daniel Gracia Pérez D. Hugues Berry H. Olivier Temam O. MoBS: Workshop on Modeling, Benchmarking, and Simulation MoBS: Workshop on Modeling, Benchmarking, and Simulation, Madison, Wisconsin 2005 http:// hal. inria. fr/ inria-00001062/ en/ Budgeted Region Sampling (BeeRS): Do Not Separate Sampling From Warm-Up, And Then Spend Wisely Your Simulation Budget Daniel Gracia Pérez D. Hugues Berry H. Olivier Temam O. 5th IEEE International Symposium on Signal Processing and Information Technology 5th IEEE International Symposium on Signal Processing and Information Technology, Athens, Greece 2006 http:// hal. inria. fr/ inria-00001061/ en/ Blob Computing Frederic Gruau F. Yves Lhuillier Y. Philippe Reitz P. Olivier Temam O. Computing Frontiers 2004 ACM SIGMicro. 2004 http:// blob. lri. fr/ publication/ 2004-model-blob-machine. pdf MiBench: A free, commercially representative embedded benchmark suite. Matthew R. Guthaus M. R. Jeffrey S. Ringenberg J. S. Dan Ernst D. Todd M. Austin T. M. Trevor Mudge T. Richard B. Brown R. B. IEEE 4th Annual Workshop on Workload Characterization, Austin, TX December 2001 Detecting coarse-grain parallelism using an interprocedural parallelizing compiler Mary H. Hall M. H. Saman P. Amarasinghe S. P. Brian R. Murphy B. R. Shih-Wei Liao S.-W. Monica S. Lam M. S. Supercomputing '95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), New York, NY, USA ACM Press 1995 49 http:// doi. acm. org/ 10. 1145/ 224170. 224337 Transactional Memory Coherence and Consistency Lance Hammond L. Vicky Wong V. Mike Chen M. Brian D. Carlstrom B. D. John D. Davis J. D. Ben Hertzberg B. Manohar K. Prabhu M. K. Honggo Wijaya H. Christos Kozyrakis C. Kunle Olukotun K. Proceedings of the 31st Annual International Symposium on Computer Architecture IEEE Computer Society June 2004 102 http:// tcc. stanford. edu/ publications/ tcc_isca2004. pdf On the Impact of Data Input Sets on Statistical Compiler Tuning M. Haneda M. P.M.W. Knijnenburg P. H.A.G. Wijshoff H. Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL) 2006 European Network of Excellence on High-Performance Embedded Architecture and Compilation (HiPEAC) http:// www. hipeac. net Effective Adaptive Computing Environment Management via Dynamic Optimization Shiwen Hu S. Madhavi Valluri M. Lizy Kurian John L. K. IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2005) 2005 Debugging Parallel Systems: A State of the Art Report Joel Huselius J. 63 Mälardalen University, Department of Computer Science and Engineering September 2002 http:// citeseer. ist. psu. edu/ huselius02debugging. html Technical report Simulation of the Lattice QCD and Technological Trends in Computation Khaled Ibrahim K. Julien Jaeger J. Zhen Liu Z. Louis-Noël Pouchet L.-N. Piotr Lesnicki P. Lamia Djoudi L. Denis Barthou D. François Bodin F. Christine Eisenbeis C. Gilbert Grosdidier G. Olivier Péne O. Patrick Roudeau P. arXiv:0808.0391 Aug 2008 submitted to the to the 14th International Workshop on Compilers for Parallel Computers Technical report CHARM++ : A Portable Concurrent Object-Oriented System Based on C++ L. V. Kale L. V. Sanjeev Krishnan S. Andreas Paepcke A. Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications (OOPSLA) ACM Press September 1993 91-108 http:// citeseer. ist. psu. edu/ 95307. html Optimizing Code through Iterative Specialization Minhaj Ahmad Khan M. A. Henri-Pierre Charles H.-P. Denis Barthou D. ACM Symposium on Applied Computing, New York 2008 206–210 Using Message-Driven Objects to Mask Latency in Grid Computing Applications Gregory A. Koenig G. A. Laxmikant V. Kale L. V. 19th IEEE International Parallel and Distributed Processing Symposium April 2005 Fast searches for effective optimization phase sequence P. Kulkarni P. S. Hines S. J. Hiser J. D. Whalley D. J. Davidson J. D. Jones D. Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) 2004 Adaptive Programming of Unconventional Nano-Architectures John W. Lawson J. W. David H. Wolpert D. H. J. Comput. Theor. Nanosci. 3 1986 272-279 AP+SOMT: AgentProgramming SelfOrganized Yves Lhuillier Y. Olivier Temam O. International Workshop on Complexity-Effective Design, Munich, Germany ISCA May 2004 A dynamically tuned sorting library X. Li X. M. Garzaran M. D. Padua D. In ACM Conference on Code Generation and Optimization (CGO'04), Palo Alto, California March 2004 High Performance Fortran David B. Loveman D. B. IEEE Parallel Distrib. Technol. 1 1 1993 25–42 http:// dx. doi. org/ 10. 1109/ 88. 219857 Simics: A Full System Simulation Platform Peter S. Magnusson P. S. Magnus Christensson M. Jesper Eskilson J. Daniel Forsgren D. Gustav Hallberg G. Johan Hogberg J. Fredrik Larsson F. Andreas Moestedt A. Bengt Werner B. Computer 35 2 2002 50-58 http:// doi. ieeecomputersociety. org/ 10. 1109/ 2. 982916 Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset Milo M. K. Martin M. M. K. Daniel J. Sorin D. J. Bradford M. Beckmann B. M. Michael R. Marty M. R. Min Xu M. Alaa R. Alameldeen A. R. Kevin E. Moore K. E. Mark D. Hill M. D. David A. Wood D. A. SIGARCH Comput. Archit. News 33 4 2005 92–99 http:// doi. acm. org/ 10. 1145/ 1105734. 1105747 Efficient and Exact Data Dependency Analysis Dror E. Maydan D. E. John L. Hennessy J. L. Monica S. Lam M. S. Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation June 1991 1-14 MILEPOST project media coverage http:// www. milepost. eu/ media. html EU Milepost project (MachIne Learning for Embedded PrOgramS opTimization) A machine learning approach to automatic production of compiler heuristics A. Monsifrot A. François Bodin F. R. Quiniou R. Proc. AIMSA LNCS 2443 2002 41-50 Feedback Assisted Iterative Compiplation M. O'Boyle M. P. Knijnenburg P. Grigori Fursin G. Parallel Architectures and Compilation Techniques (PACT'01) IEEE Computer Society Pres October 2001 Capsule : Hardware-Assisted Parallel Execution of Component-Based Programs Pierre Palatin P. Yves Lhuillier Y. Olivier Temam O. The 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006, Orlando, Florida december 2006 Towards a Systematic, Pragmatic and Architecture-Aware Program Optimization Process for Complex Processors David Parello D. Olivier Temam O. Albert Cohen A. J.-M. Verdun J.-M. ACM Supercomputing'04, Pittsburgh, Pennsylvania November 2004 15 http:// www-rocq. inria. fr/ ~acohen/ publications/ PTCV04. ps. gz On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance: matrix-multiply revisited. David Parello D. Olivier Temam O. J.-M. Verdun J.-M. SC 2002 1-11 http:// gala. univ-perp. fr/ ~dparello/ publis/ on_increasing_architecture_awareness. pdf Machine Learning, Machine Vision, and the Brain Thomas Poggio T. Christian R. Shelton C. R. The AI Magazine 20 3 1999 37–55 http:// citeseer. ist. psu. edu/ poggio99machine. html GRAPHITE: Loop optimizations based on the polyhedral model for GCC Sébastian Pop S. Albert Cohen A. Cédric Bastoul C. Sylvain Girbal S. P. Jouvelot P. G.-A. Silber G.-A. Nicolas Vasilache N. Proc. of the 4th GCC Developper's Summit, Ottawa, Canada June 2006 A Note on the Performance Distribution of Affine Schedules Louis-Noël Pouchet L.-N. Cédric Bastoul C. John Cavazos J. Albert Cohen A. 2nd Workshop on Statistical and Machine learning approaches to ARchitectures and compilaTion (SMART'08), Göteborg, Sweden January 2008 Iterative optimization in the polyhedral model: Part II, multidimensional time Louis-Noël Pouchet L.-N. Cédric Bastoul C. Albert Cohen A. John Cavazos J. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'08), Tucson, Arizona June 2008 Iterative optimization in the polyhedral model: Part I, one-dimensional time Louis-Noël Pouchet L.-N. Cédric Bastoul C. Albert Cohen A. Nicolas Vasilache N. ACM International Conference on Code Generation and Optimization (CGO'07), San Jose, California March 2007 144–156 The Omega test: A fast and practical integer programming algorithm for dependence analysis W. Pugh W. Comm. of the ACM 8 1992 102-114 Mitosis Compiler: An Infrastructure for Speculative Threading Based on Pre-Computation Slices C. G. Quiñones C. G. C. Madriles C. J. Sánchez J. P. Marcuello P. A. González A. D. M. Tullsen D. M. PLDI '05: Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation ACM Press 2005 Semiconductor Industry Association 2005 roadmap, section on Emerging Research Devices SIA 2005 http:// www. sia-online. org/ Topological and dynamical structures induced by Hebbian learning in random neural networks Benoît Siri B. Hugues Berry H. Bruno Cessac B. Bruno Delord B. Mathias Quoy M. International Conference on Complex Systems, ICCS 2006, Boston, MA, USA June 2006 Learning-induced topological effects on dynamics in neural networks Benoît Siri B. Hugues Berry H. Bruno Cessac B. Bruno Delord B. M. Quoy M. Olivier Temam O. Proceedings of NeuroComp'06, Pont-à-Mousson, France 23-24 October 2006 206–209 Overcoming the challenges to feedback-directed optimization M.D. Smith M. Proc. ACM SIGPLAN Workshop on Dynamic and Adaptive Compilation and Optimization (Dynamo'00) 2000 SystemC v2.0.1 Language Reference Manual 2003 http:// www. systemc. org/ Component Software: Beyond Object-Oriented Programming Clemens Szyperski C. Addison-Wesley Longman Publishing Co., Inc.

Boston, MA, USA

2002 Small-World Power-Law Interconnects for Nanoscale Computing Architectures Christof Teuscher C. Proceedings of the 6th IEEE Conference on Nanotechnology, IEEE Nano 2006 July 2006 StreamIt: A Compiler for Streaming Applications W. Thies W. M. Karczmarek M. M. Gordon M. D. Maze D. J. Wong J. H. Ho H. M. Brown M. S. Amarasinghe S. December 2001 http:// citeseer. ist. psu. edu/ article/ thies01streamit. html MIT-LCS Technical Memo TM-622, Cambridge, MA Compiler optimization-space exploration S. Triantafyllis S. M. Vachharajani M. N. Vachharajani N. David I. August D. I. Proc. International Symposium on Code Generation and Optimization 2003 204–215 UNISIM: UNIted SIMulation environment http:// unisim. org Microarchitectural Exploration with Liberty Manish Vachharajani M. Neil Vachharajani N. David A. Penry D. A. Jason A. Blome J. A. David I. August D. I. the 34th Annual International Symposium on Microarchitecture, Austin, Texas, USA. December 2001 Polyhedral Code Generation in the Real World Nicolas Vasilache N. Cédric Bastoul C. Albert Cohen A. Proceedings of the International Conference on Compiler Construction (ETAPS CC'06), Vienna, Austria LNCS Springer-Verlag March 2006 185–201 http:// www-rocq. inria. fr/ ~acohen/ publications/ VBC06. ps. gz Violated dependence analysis Nicolas Vasilache N. Cédric Bastoul C. Sylvain Girbal S. Albert Cohen A. Proceedings of the ACM International Conference on Supercomputing (ICS'06), Cairns, Australia ACM June 2006 ADAPT: Automated de-coupled adaptive program transformation M.J. Voss M. R. Eigenmann R. Proc. ICPP 2000 Statistical Modeling of Feedback Data in an Automatic Tuning System R. Vuduc R. J. Bilmes J. J. Demmel J. Proc. 3rd ACM Workshop on Feedback-Directed and Dynamic Optimization 2000 41-50 Vasa: A Simulator Infrastructure with Adjustable Fidelity Dan Wallin D. Håkan Zeffer H. Martin Karlsson M. Erik Hagersten E. Proceedings of the 17th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2005), Phoenix, Arizona, USA November 2005 A loop transformation theory and an algorithm to maximize parallelism M.E. Wolf M. M.S. Lam M. IEEE Transactions on Parallel and Distributed Systems 2 4 1991 430-439 From Sequences of Dependent Instructions to Functions: a Complexity-Effective Approach for Improving Performance without ILP or Speculation Sami Yehia S. Olivier Temam O. International Workshop on Complexity-Effective Design ISCA Jun 2003 From Sequences of Dependent Instructions to Functions: An Approach for Improving Performance without ILP or Speculation Sami Yehia S. Olivier Temam O. International Symposium on Computer Architecture May 2004 MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms Daniel Gracia Pérez D. Gilles Mouchard G. Olivier Temam O. MICRO-37: Proceedings of the 37th International Symposium on Microarchitecture IEEE Computer Society Dec 2004 43–54 http:// dx. doi. org/ 10. 1109/ MICRO. 2004. 25