Alchemyis a joint Inria/University of Paris Sud research group.
The general research topics of the Alchemygroup are architectures, languages and compilers for high-performance embedded and general-purpose processors. Alchemyinvestigates scalablearchitecture and compiler/programming solutions for high-performance general-purpose and embedded processors. Alchemystands for Architectures, Languages and Compilers to Harness the End of Moore Years, referring to both the traditional processor architectures implemented using the current photo-lithographic processes, and novel architecture/language paradigms compatible with future and alternative technologies. The current emphasis of Alchemyis on the former part, and we are progressively increasing our efforts on the latter part.
The research goals of Alchemyspan from short term to long term. The short-term goals target existing complex processor architectures, and thus focus on improving program performance on these architectures (software-only techniques). The medium-term goals target the upcoming CMPs (Chip Multi-Processors) with a large number of cores, which will result from the now slower progression of core clock frequency due to technological limitations. The main challenge is to take advantage of the large number of cores for a wide range of applications, considering that automatic parallelization techniques have not yet proved an adequate solution. In Alchemy, we explore joint architecture/programming paradigms as a pragmatic alternative solution. Finally, even longer term research is conducted with the goal of harnessing the properties of future and alternative technologies for processing purposes.
Most of the research in Alchemyattempts to jointly consider the hardware and software aspects, based on the premise that many of the limitations of existing architecture and compiler techniques stem from the lack of cooperation between architects and compiler designers. However, Alchemyaddresses the aforementioned research goals through two different, though sometimes complementary, approaches. One approach considers that, in spite of their complexity, architectures and programs can still be accurately and efficiently modeled (and optimized) using analyticalmethods. The second approach considers the architecture/program pair already has or will reach a complexity level that will evade analytical methods, and explores a complex systemsapproach; the principle is to accept that the architecture/program pair is more easily understood (and thus optimized) based on its observed behavior rather than inferred from its known design.
In the sections below, the different research activities of Alchemyare described, from short-term to long-term goals. For most of the goals, both analytical and complex systems approaches are conducted.
This part of our research work is more targeted to single-core architectures but also applies to multi-cores. The rationale for this research activity is that compilers rely on architecture models embedded in heuristics to drive compiler optimizations and strategy. As architecture complexity increases, such models tend to be too simplistic, often resulting in inefficient steering of compiler optimizations.
Our general approach consists in acknowledging that architectures are too complex to embed reliable architecture models in compilers, and to explore the behavior of the architecture/program pair through repeated executions. Then, using machine-learning techniques, a model of this behavior is inferred from the observations. This approach is usually called iterative optimization.
In the recent years, iterative optimization has emerged as a major research trend, both in traditional compilation contexts and in application-specific library generators (like ATLAS or SPIRAL). The topic has matured significantly since the pioneering works of Mike O'Boyle at University of Edinburgh, UK or Keith Cooper at Rice University. While these research works successfully demonstrated the performance potentialof the approach, they also highlighted that iterative optimization cannot become a practicaltechnique unless a number of issues are resolved. Some of the key issues are: the size and structure of the search space, the sensitivity to data sets, and the necessity to build long transformation sequences.
Scanning a large search space.Transformation parameters, the order in which transformations are applied, and even which transformations must be applied and how many times, all form a huge transformation space. One of the main challenges of iterative optimization is to rapidly converge towards an efficient, if not optimal, point of the transformation space. Machine-Learning techniques can help build an empirical model of the transformation space in a simple and systematic way, only based on the observation of transformations behavior, and then rapidly deduce the most profitable points of the space. We are investigating how to correlate static and dynamic program features with transformation efficiency. This approach can speed up the convergence of the search process by one or two orders of magnitude compared to random search , .
We have also shown that by representing the impact of loop transformations using structured encoding derived from polyhedral program representation, it is possible to reduce the complexity of the search by several orders of magnitude , . This encoding is further described in Section .
Finally we have found that it is possible to further speed up transformation space exploration by exploring several transformations during a single run . Currently, one program transformation is explored for each loop nest, while performance often reaches a stable state soon after the start of the execution. We have shown that, assuming we properly identify the phase behavior of programs, it is possible to explore multiple transformations in each run.
Data set sensitivity.Iterative optimization is based on the notion that the compiler will discover the best way to optimize a program through repeatedly running the same program on the same data set, trying one or a few different optimizations upon each run. However, in reality, a user rarely needs to execute the same data set twice. Therefore, iterative optimization is based on the implicit assumption that the best optimization configuration found will work wellfor all data setsof a program. To the best of our knowledge, this assumption has never been thoroughly investigated. Most studies on iterative optimization repeatedly execute the same program/data set pair , , , , , only recently, some studies have focused on the impact of data sets on iterative optimizations , .
In order to explore the issue of data set sensitivity, we have assembled a data set suite, of 20 data sets per benchmark, for most of the MiBench embedded benchmarks. We have found that, though a majority of programs exhibit stable performance across data sets, the variability can significantly increase with many optimizations. However, for the best optimization configurations, we find that this variability is in fact small. Furthermore, we show that it is possible to find a compromise configuration across data sets which is often within 5% of the best possible optimization configuration for most data sets, and that the iterative process can converge in less than 20 iterations (for a population of 200 optimization configurations). Overall, the preliminary conclusion, at least for the MiBench benchmarks, is that iterative optimization is a fairly robust technique across data sets, which brings it one step closer to practical usage.
Compositions of program transformations.Compilers impose a certain set of program transformations, an ordering of application and how many times each transformation is applied. In order to explore what are the possible gains beyond these strict constraints, we have manually optimized kernels and benchmarks, trying to achieve the best possible performance assuming no constraint on transformation order, count or selection , . The study helped us clarify which transformations bring the best performance improvements in general. But the main conclusion of that study is that surprisingly long compositions of transformations are sometimes needed (in one case, up to 26 composed loop transformations) in order to achieve good performance. Either because multiple issues must be tackled simultaneously or because some transformations act as enabling operations for other transformations.
As a result, we have started developing a framework facilitating the composition of long transformations. This framework is based on the polyhedral representation of program transformations . This framework also enables a more analytical approach to program optimization and parallelization, beyond the simple composition of transformations. This latter part is further developed in Section .
Putting it all together: continuous optimization.Increasingly, we are now moving toward automatizing the whole iterative optimization process. Our goal is to bring together, within a single software environment, the different aforementioned observations and techniques (search space techniques, data set sensitivity properties, long compositions of transformations,...). We are currently in the process of plugging these different techniques within GCC in order to create a tool capable of doing continuous, whole-program optimization, and even collaborative optimization across different users.
Hardware-Oriented applications of iterative optimization.Because iterative optimization can successfully capture complex dynamic/run-time phenomena, we have shown that the approach can act as a replacement for costly hardware structures designed to improve the run-time behavior of programs, such as out-of-order execution in superscalar processors. An iterative optimization-like strategy applied to an embedded VLIW processor was shown to achieve almost the same performance as if the processor was fitted with dynamic instruction reordering support. We are also investigating applications of this approach to the specialization/idiomization of general-purpose and embedded processors . Currently, we are exploring similar approaches for providing thread scheduling and placement information for CMPs without requiring costly run-time environment overhead or hardware support. This latter study is related to the work presented in Section .
As loop transformations are utterly important — performance-wise — and among the hardest to predictably drive through static cost models, their current support in compilers is disappointing. After decades of experience and theoretical advances, the best compilers can miss some of the most important loop transformations in simple numerical codes from linear algebra or signal processing codes. Performance hits of more than an order of magnitude are not uncommon on single-threaded code, and the situation worsens when automatically parallelizing or optimizing parallel code.
Our previous work on sequences of loop transformations has led to the design of a theoretical framework, based on the polyhedral model , , , , , , and a set of tools based on the advanced Open64 compiler. We have shown that this framework does simplify the problem of building complex transformation sequences, but also that it scales to real-world benchmarks , , , , and allows to significantly reduce the size of the search space and better understand its structure , , . The latter work, for example, is the first attempt at directly characterizing all legal and distinctways to reschedule a loop nest.
After two decades of academic research, the polyhedral model is finally evolving into a mature, production-ready approach to solve the challenges of maximizing the scalability and efficiency of statically-controlled, loop-based computations on a variety of high performance and embedded targets. After Open64, we are now porting these techniques to the GCC compiler , applying them to several multi-level parallelization and optimization problems, including vectorization, extraction and exploitation of thread-level parallelism on distributed memory CMPs like the Cell broadband engine from IBM, NXP's CAT-DI scalable signal-processing accelerator and novel STMicroelectronics emerging xStream architecture.
Note:The goal of this section and others alike is to not to act as a traditional and exhaustive “related work” section as found in research articles, but rather to provide references to a few research works which are the closest to our own.
While iterative optimization is based on simple principles which have been proposed a long time ago, this approach has been significantly developed by Mike O'Boyle at University of Edinburgh since 1997 , and more recently by Keith Cooper at Rice University . Since then, many research groups have shown example cases where an iterative approach might be profitable (various application targets, various steps of the compilation process, various architecture components) , , , . These researchers have shown that iterative optimization has a significant potential. Since then, other research groups (Polaris group at University of Illinois, CAPS at INRIA) have successfully demonstrated that iterative optimization can be used in practice for the design of libraries , , or even that it can be integrated in production compilers to assist existing optimizations . As mentioned before, Alchemyis now focusing on the issues which hinder its practical application.
While Section is only concerned with transforming programs for a more efficient exploitation of existing architectures, in the longer term, researchers can assume modifications of architectures and/or programs are possible. These relaxed constraints allow to target the root causes of poor architecture/program performance.
The current architecture/program model partly fails because the burden is either excessively on the architecture (superscalar processors), or the compiler (VLIW and now CMPs). And both compiler and architecture optimizations often aim at program reverse-engineering: compilers attempt to dig up program properties (locality, parallelism) from the static program, while architectures attempt to retrieve them from program run-time behavior. Now, in many cases, the user is not only aware of these properties but may pass them effortlessly to the architecture and the compiler provided she had the appropriate programming support, provided the compiler would pass this information to the architecture, and the architecture would be fitted with the appropriate support to take advantage of them. For instance, simply knowing that a C structure denotes a tree rather than a graph can provide significant information for parallel execution. Such approaches, while not fully automatic, are practical and would relieve the complexity burden of the architecture and the compiler, while extracting significant amounts of task-level parallelism.
In the paragraphs below we apply this approach of passing more program semantic to the compiler and the architecture, first for domain-specific stream-oriented programs, and then for the parallelization of more general programs.
While we are currently investigating the aforementioned approach for general-purpose applications, we have started with the investigation of the specific domain of high-end video processing. In this domain, assessing that real-time properties will be satisfied is as important as reaching uncommon levels of compute density on a chip. 150 giga-operations per second per Watt (on pixel components) is the norm for current high-definition TVs, and cannot be achieved with programmable cores at present. The future standards will need an 8-fold increase (e.g., for 3D displays or super-high-definition). Predictability and efficiency are the keywords in this domain, in terms of both architecture and compiler behavior.
Our approach combines the aforementioned iterative optimization and polyhedral modeling research with a predictability- and efficiency-oriented parallel programming language. We focus on warrantable (as opposed to best-effort) usage of hardware resources with respect to real-time constraints. Therefore, this parallel programming language must allow overhead-free generation of tightly coupled parallel threads, interacting through dedicated registers rather than caches, streaming data through high-bandwidth, statically managed interconnect structures, with frequent synchronizations (once every few cycles), and very limited memory resources immediately available. This language also needs to support advanced loop transformations, and its representation of concurrency compatible with the expression of multi-level partitioning and mapping decisions. All these conditions tend to consider a language closer to hardware synthesis languages than general-purpose, von Neumann oriented imperative ones , .
The synchronous data-flow paradigm is a natural candidate, because of its ability to combine high-productivity in programming complex concurrent applications (due to the determinism and compositionality of the underlying model, a rare feature of a concurrent semantics), direct modeling of computation/communication time, and static checking of non-functional properties (time and resource constraints). Yet generating low-level, tightly fused loops with maximal exposition of fine-grain parallelism from such languages is a difficult problem, as soon as the target processor is not the one being described by the synchronous data-flow program, but a pre-existing target on which we are folding an application program. The two tasks are totally different: although the most difficult decisions are pushed back to the programmer in the hardware synthesis case, application programmers usually rely on the compiler to abstract away the folding of their code in a reasonably portable fashion across a variety of targets. This aspect of synchronous language compilation has largely been overlooked and constitutes the main direction of our work. Another direction lies in the description of hardware resources, at the same level as the application being mapped and scheduled onto them; this unified representation would allow the expression of the search space of program transformations, and would be a necessary step to apply incremental refinement methods (expert-driven, very popular in this domain).
Technically, we extend the classical clock calculus (a type system) of the Lucid Synchronelanguage, expliciting significantly more information about the program behavior, especially when tasks must be started and will be completed, how information flow among tasks, etc. Our main contribution is the integration of relaxed synchronous operators like jittering and bursty streams within synchronous bounds , . This research consists in revisiting the semantics of synchronous Kahn networks in the domain of media streaming applications and reconfigurable parallel architectures, in collaboration with Marc Duranton from Philips Research Eindhoven (now NXP Semiconductors) and with Marc Pouzet from LRI and the Proval INRIA project team.
Beyond domain-specific and regular applications (loops and arrays), automatic compiler-based parallelization has achieved only mixed results on programs with complex control and data structures . Writing, and especially debugging, large parallel programs is a notoriously difficult task , and one may wonder whether the vast majority of programmers will be able to cope with it. Currently, transactional memory is a popular approach for reducing the programmer burden using intuitive transaction declarations instead of more complex concurrency control constructs. However, it does not depart from the classic approach of parallelizing standard C/C++/Fortran programs, where parallelism can be difficult to extract or manipulate. Parallel languages, such as HPF , require more ambitious evolutions of programming habits, but they also let programmers pass more semantic about the control and data characteristics of programs to the compiler for easier and more efficient parallelization. However, one can only observe that, for the moment, few such languages have become popular in practice.
A solution would have a better chance to be adopted by the community of programmers at large if it integrates well with popular practices in software engineering, and this aspect of the parallelization problem may have been overlooked. Interestingly, software engineering has recently evolved towards programming models that can blend well with multi-core architectures and parallelization. Programming has consistently evolved towards more encapsulation: procedures, then objects, then components . Essentially for two reasons, because programmers have difficulties grasping large programs and need to think locally, and because encapsulation enables reuseof programming efforts. Component-based programming, as proposed in Java Beans, .Net or more ad-hoc component frameworks, is the step beyond C++ or Java objects: programs are decomposed into modules which fully encapsulate code and data (no global variable) and which communicate among themselves through explicit interfaces/links.
Components have many assets for the task of developing parallel programs. (1) Components provide a pragmatic approach for bringing parallelization to the community at large thanks to component reuse. (2) Components provide an implicit and intuitive programming model: the programmer views the program as a "virtual space" (rather than a sequence of tasks) where components reside; two components residing together in the space and not linked or not communicating through an existing link implicitly operate in parallel; this virtual space can be mapped to the physical space of a multi-threaded/multi-core architecture. (3) Provided the architecture is somehow aware of the program decomposition into components, and can manipulate individual components, the compiler (and the user) would be also relieved of the issue of mapping programs to architectures.
In order to use software components for large-scale and fine-grain parallelization, the key notion is to augment them with the ability to split or replicate. For instance, a component walking a binary tree could spawn two components to scan two child nodes and the corresponding sub-trees in parallel.
We are investigating a low-overhead component-based approach for fine-grain parallelism, called CAPSULE, where components have the ability to replicate , . We investigate both a hardware-supported and software-only approach to component division. We show that a low-overhead component framework, possibly paired with component hardware support, can provide both an intuitive programming model for writing fine-grain parallel programs with complex control flow and data structures, and an efficient platform for parallel components execution.
As explained before, both approaches pursued rely on the same philosophy, pass more program semantic to the compiler and the architecture, though the techniques differ significantly. Naturally, there is a huge body of literature on parallelization, and here, we can only hint at some of the main research directions. Current approaches either rely on automatic parallelization of standard programs, but the automatic parallelization of “complex” applications (complex control flow and data structures) has registered mixed results. Another approach is software/hardware thread-level speculation, but one may question its cost and scalability . As mentioned before, transactional memory has become a popular approach for reducing the burden of parallelizing applications. Other approaches include parallel languages, such as HPF or parallel directives such as OpenMP .
Synchronous data-flow languages.The synchronous data-flow approach to the design and optimization of massively parallel, highly compute-efficient and predictable systems is quite unique. It is a long-term, largely fundamental effort motivated by well-established practices in the industry, mostly in the domain of high-definition language programming for hardware synthesis, and combines these practices with the best semantic properties of high-level programming languages. It is a holistic approach to combining productivity andscalability andcompute-efficiency in a unified design, targeting the domain of real-time, predictable, stream-oriented parallel systems.
The closest work is the StreamIt language and compiler from MIT , and to a lesser extent, the Sequoia project from Stanford ; these two mature projects achieved important contributions in the exposition and exploitation of thread-level parallelism on a coarse grain distributed-memory, stream-oriented architecture. StreamIt is also much more limited in expressiveness, and Sequoia is more an incremental progress on how to compile and optimize a parallel program than a productivity-oriented design of a new concurrent programming paradigm. We are currently working on a shorter term, intermediate milestone much closer to these two projects, but allowing to expose and exploit multi-level parallelism, at all stages of the design-space exploration and in all passes of the compiler.
Software components.Software components, as provided in the .Net or Java Beans frameworks, have little support for parallelism. Several years ago, a few frameworks proposed a component-like approach for parallelizing complex applications on large-scale multiprocessors, especially the Cilk and Charm++ frameworks. However Cilk does not promote encapsulation, essentially a mechanism for spawning C functions. Charm++ provides both encapsulation and spawning, but it targets large-scale multiprocessors, even grid computing , and its overhead is rather large for fine-grain parallelism as required by multi-threaded/multi-core architectures.
Probably the closest work to our hardware support for components is the Network-Driven Processor proposed by Chen et al. which aims at implementing CMP hardware support for Cilk programs. Thread creation decisions are not taken directly by the architecture, they enact any thread spawning decision taken by the Cilk environment, but they provide a sophisticated support for communications and work stealing between processors.
The last research direction stems from possible evolutions of technology. While this research direction may seem very long term, processor manufacturers cannot always afford to investigate many risky alternatives way ahead in time. At the same time, for them to accept and adopt radical changes, they have to be anticipated long in advance. Thus, we believe prospective research is a core role for academic researchers, which may be less immediately useful to companies, but which can bring a real addition to their internal research activities, and which also carries the potential of bringing disruptive technology.
Prospective information on the future of CMOS technology suggests that, though the density of transistors will keep increasing, the commuting speed of transistors will not increase as fast, and transistors may be more faulty (either fabrication defects or execution faults). Possible replacement/alternative technologies, such as nanotubes which have received a lot of attention lately, share many of these properties: high density, but slow components (possibly even slower than current components), a large rate of defects/faults, and more difficulty to place them except than in fairly regular structures.
In short, several potential upcoming technologies seem to bring a very large number of possibly faulty and not so fast components with layout issues. For architectures to take advantage of such technology, they would have to rely on spacemuch more than time/speedto achieve high performance. Large spatial architectures bring a set of new architecture issues, such as controlling the execution of a program in a totally decentralized way, efficiently managing the placement of program tasks on the space, and managing the relative movement of these different tasks so as to minimize communications. Furthermore, beyond a certain number of processing elements, it is not even clear whether many applications will embed enough traditional task-level parallelism to take advantage of such large spaces, so applications may have to be expressed (programmed) differently in order to leverage that space. These two research issues are addressed in the two research activities described below.
Blob computing.Blob computing is both a spatial programming and architecture model which aims at investigating the utilization of a vast amount of processing elements. The key originality of the model is to acknowledge that the chip space becomes too large for anything else than purely localactions. As a result, all architecture control becomes local. Similarly, the program itself is decomposed into a set of purely local actions/tasks, called Blobs, connected together through links; the program can create/destroy these links during its lifetime.
With respect to architecture control, for instance, the local method for expressing that two tasks frequently communicating through a link must get close together in space so that their communication latency is low is expressed through a simply physical law, emulating spring tension; the more communications, the higher the tension. Similarly, expressing that tasks should move away because too many tasks are grouped in the same physical spot is achieved through a law similar to pressure: as the number of tasks increases, the local pressure on neighbor tasks increases, inducing them to move away. Overall many of these local control rules derive from physical or biological laws which achieve the same goals: controlling a large space through simple local interactions.
With respect to programming, the user essentially has to decompose the program into a set of nodes and links. The program can create a static node/link topology that is later used for computations, or it can dynamically change that topology during execution. But the key concept is that the user is not in charge of placing tasks on the physical space, only to express the potentialparallelism through task division. As can be observed, several of the intuitions of the CAPSULE environment of Section stems from this Blob model.
Bio-Inspired computing.As mentioned above, beyond a certain number of individual components, it is not even clear whether it will be possible to decompose tasks in such a way they can take advantage of a large space. Searching for pieces of solution to this problem has progressively lead us to biological neural networks. Indeed, biological neural networks (as opposed to artificial neural networks, ANNs) are well-known examples of systems capable of complex information processing tasks using a large number of self-organized, but slow and unreliable components. And the complexity of the tasks typically processed by biological neurons is well beyond what is classically implemented with ANNs.
Emulating the workings of biological neural networks may at first seem far-fetched. However, the SIA (Semiconductor Industry Association) in its 2005 roadmap addresses for the first time “biologically inspired architecture implementations” as emerging research architectures, and focuses on biological neural networks as interesting scalable designs for information processing. More importantly, the computer science community is beginning to realize that biologists have made tremendous progress in the understanding of how certain complex information processing tasks are implemented using biological neural networks.
One of the key emerging features of biological neural networks is that they process information by abstractingit, and then only manipulate such higher abstractions. As a result, each new input (e.g., for image processing) can be analyzed using these learned abstractions directly, thus avoiding to rerun a lengthy set of elementary computations. More precisely, Poggio et al. at MIT have shown how combinations of neurons implementing simple operations such as MAX or SUM, can automatically create such abstractions for image processing, and some computer science researchers in the image processing domain have started to take advantage of these findings.
We are starting to investigate the information processing capabilities of this abstraction programming method , , . While image processing is also our first application, we plan to later look at a more diverse set of example applications.
A complex systems approach to computing systems.More generally, the increased complexity of computing systems at stake, whether due to a large number of individual components, a large number of cores or simply complex architecture program/pairs, suggest that novel design and evaluation methodologies should be investigated that rely less on known design information than on observed behavior of the global resulting system. The main problem here is to be able to extract general characteristics of the architecture on the basis of measurements of its global behavior. For that purpose, we are using tools provided by the physics of complex systems (nonlinear time series analysis, phase transitions, multi-fractal analysis...).
We have already applied such tools to better understand the performance behavior of complex but traditional computing systems such as superscalar processors , . And we are starting to apply them to sampling techniques for performance evaluation , . We will be progressively expanding the reach of these techniques in our research studies in the future.
While spatial computing is an expression used for many purposes , the Blob computing work in our research group refers more to unconventional spatial programming paradigms such as MGS and Gamma .
There has recently been a surge of research works targeting novel technologies in computer architecture,but they have mostly focused on quantum computing, and, to our knowledge, few have focused on bio-inspired computing.
Furthermore, several researchers in the computer science community have recently started applying ideas from complex systems approaches. But their focus are usually on the software or algorithm part. Our utilization of complex systems approaches in the field of architecture is thus less investigated, although other groups have very recently expressed similar interests , .
Since our research group has been involved in both compiler and architecture research for several years, we have progressively given increased attention to tools, partly because we found a lot of productivity was lost in inefficient or hard to reuse tools. Since then, both simulation and compilation platforms have morphed into research activities of their own. Our group is now coordinating the development of the simulation platform of the European HiPEAC network, and it is co-coordinating the development of the compiler research platform of HiPEAC together with University of Edinburgh.
As processor architecture and program complexity increase, so does the development and execution time of simulators. Therefore, we have investigated simulation methodologies capable of increasing our research productivity. The key point is to improve the reuse, sharing, comparison and speed capabilities of simulators. For the first three properties, we are investigating the development of a modularsimulation platform, and for the latter fourth property, we are investigating sampling techniques and more abstract modeling techniques. Our simulation platform is called UNISIM .
What is UNISIM?UNISIM is a structural simulation environment which provides an intuitive mapping from the hardware block diagram to the simulator; each hardware block corresponds to a simulation module. UNISIM is also a library of modules where researchers will be able to download and upload (contribute) modules.
What are the assets of UNISIM over other simulation platforms?UNISIM allows to reuse, exchange and compare simulator parts (and architecture ideas), something that is badly needed in academic research, and between academia and industry. Recently, we did a comparison of 10 different cache mechanisms proposed over the course of 15 years , and suggested the progress of research has been all but regular because of the lack of a common ground for comparison, and because simulation results are easily skewed by small differences in the simulator setup.
Also, other simulation environments or simulators advocate modular simulation for sharing and comparison, such as the SystemC environment , or the M5 simulator . While they do improve the modularity of simulators, in practice, reuse is still quite difficult because most simulation environments overlook the difficulty and importance of reusing control. For instance, SystemC focuses on reusing hardware blocks such as ALUs, caches, and so on. However, while hardware blocks correspond to the greatest share of transistors in the actual design, they often correspond to the least share of simulator lines. For instance, the cache data and instruction banks often correspond to a sizable amount of transistors, but they merely correspond to array declarations in the simulator; conversely, cache control corresponds to few transistors but most of the source lines of any cache simulator function/module. As a result, it is difficult to achieve reuse in practice, because control code is often not implemented in such a way that it can lend well to reuse.
On the contrary, UNISIM is focused on reuse of control code, provides a standardized module communication protocol and a control abstraction for that purpose. Moreover, UNISIM will later on come with an open library in order to better structure the set of available simulators and simulator components.
Taking a realistic approach at simulator usage.Obviously, many research groups will not accept easily to drop years of investment in their simulation platforms and to switch to a new environment. We take a pragmatic approach and UNISIM is designed from the ground up to be interoperable with existing simulators, from industry and academia. We achieve interoperability by wrapping full simulators or simulator parts within UNISIM modules. We have an example full SimpleScalar simulator stripped of its memory, wrapped into a UNISIM module, and plugged into a UNISIM SDRAM module.
Moreover, we are in the process of developing a number of APIs (for power, GUI, functional simulators, sampling,...) which will allow third-party tools to be plugged into the UNISIM engine. We call these APIs simulator capabilities or services.
With CMPs, communications become more important than cores cycle-level behavior.While the current version of UNISIM is focused on cycle-level simulators, we are developing a more abstract view of simulators called Transaction-Level Models (TLM). Later on, we will also allow hybrid simulators, using TLM for prototyping, and then zooming on some components of a complex system.
Because CMPs also require operating system support for a large part, and because existing alternatives such as SIMICS are not open enough, we are also developing full-system support in our new simulators jointly with CEA. Currently, UNISIM has a functional simulator of a PowerPC750 capable of booting Linux.
The free GNU Compiler Collection(GCC) is the leading tool suite for portable developments on open platforms. It supports more than 6 input languages and 30 target processor architectures and instruction sets, with state-of-the-art support for debugging, profiling and cross-compilation. It has long been supported by the general-purpose and high-performance hardware vendors. The last couple of years have seen GCC taking momentum in the embedded system industry, and also as a platform for advanced research in program analysis, transformation and optimization.
GCC 4.4 features about 200 compilation passes, two thirds of them playing a direct role in program optimization. These passes are selected, scheduled, and parametrized through a versatile pass manager. The main families of passes can be classified as:
inter-procedural analyzes and optimizations;
profiling, coverage analysis and instrumentation;
induction variable analysis, canonicalization and strength-reduction;
loop optimization, automatic vectorization and parallelization;
data layout optimization.
More advanced developments involving GCC are in progress in the Alchemygroup:
global, whole program optimization (towards link-time and just-in-time compilation), with emphasis on scalability;
transactional memory extensions independent from yet compatible with OpenMP, and a recent intrusion into data-flow synchronous programming;
polyhedral loop nest optimization, with support for automatic vectorization in the Graphite branch of GCC; this branch has merged with GCC 4.4; it was initiated by the Alchemygroup and a former student now at AMD (Sebastian Pop);
automatic parallelization, including the extraction and adaptation of loop and pipeline parallelism, with extensions towards speculative forms of parallelism.
The HiPEAC network supports GCC as a platform for research and development in compilation for high-performance and embedded systems. The network's activities on the compiler platform are coordinated by Albert Cohen.
Simulation (UNISIM).The rationale for the simulation effort, and the current situation in the community (dominance of monolithic simulators like SimpleScalar ) has been described as part of the presentation of this research activity in Section . While several companies have internal modular simulation environments (ASIM at Intel , TSS at Philips, MaxSim at ARM,...), they are not standard nor disseminated. Only SystemC is gaining wide acceptance as a modular simulation environment with companies, less so with high-performance academic research groups. The academic research group which has the most similar approach is the Liberty group at Princeton University. They have been similarly advocating modular simulation in the past few years . Due to the growing importance of CMP architectures, several research groups have since then proposed CMP simulation platforms, some of them with modularity properties, such as M5 , Flexus , GEMS or Vasa .
Finally, UNISIM is also participating to a French simulation platform called SoCLib through a recent contract (SoCLib). The technical goals of UNISIM are rather different as we initially targeted processor decomposition into modules while SoCLib targeted systems-on-chip. As architectures are moving to multi-cores, the collaboration could become fruitful. UNISIM is also more focused on trying to gather, from the start, groups from different countries in order to increase the chances of adoption.
Compilation (GCC).We are also deeply committed to the enhancement and popularization of GCC as a common compilation research platform. The details of this investment are listed in Section . GCC is of course an interesting option for the industry, as development costs surge and returns in performance gains quickly diminish with the complexity of the modern architectures. But GCC is also, and for the first time, a serious candidate to help researchers mutualize development efforts, to experiment their contributions in a complete tool chain with production codes, to enable the sharing and comparison of these contributions in an open licensing model (a necessary condition for assessing the quality of experimental results), and to facilitate the transfer of these contributions to production environments (with an immediate impact on billions of embedded devices, general-purpose computers and servers). Learning from the failures of a well known attempt at building a common compiler infrastructure (SUIF-NCI in the late 90s), we follow a pragmatic approach based on joint industry-academia research projects ), training (tutorials, courses, see Section ), and direct contributions to the enhancement of the platform (e.g., for iterative optimization research and automatic parallelization).
Compilers & program optimization:
The WRaP-IT tool (WHIRL Represented as Polyhedra – Interface Tool) is a program analysis and transformation tool implemented on top of the Open64 compiler and of the CLooG code generator . The formal basis of this tool is the polyhedral model for reasoning about loop nests. We introduced a specific polyhedral representation that guarantees strong transformation compositionality properties . This new representation is used to generalize classical loop transformations, to lift the constraints of classical compiler frameworks and enable more advanced iterative optimization and machine learning schemes. WRaP-IT — and its loop nest transformation kernel called URUK (Unified Representation Universal Kernel) — is designed to support a wide range of transformations on industrial codes, starting from the SPEC CPU2000 benchmarks, and recently considering a variety of media and signal processing codes (vision, radar, software radio, video encoding, and DNA-mining in particular, as part of the IST STREP ACOTES, ANR CIGC PARA, and a collaboration with Thales).
Based on this framework, we are also planning an extension of the polyhedral model to handle speculative code generation and transformation of programs with data-dependent control, and a direct search and transformation algorithm based on the Farkas lemma. These developments will take place in the GRAPHITE project: a migration/rewrite of our Open64-based software to the GCC suite. This project is motivated by the maturity — performancewise and infrastructurewise — of GCC 4.x, and on the massive industrial investment taking off on GCC in the recent years, especially in the embedded world. We are heavily involved in fostering research projects around GCC as a common compilation platform, and GRAPHITE is one of those projects.
Grigori Fursin developed the first prototype of an iterative optimization API for GCC, and started using this infrastructure for continuous and adaptive optimization research, in collaboration with the University of Edinburgh.
Candl is a free software and a library devoted to data dependences computation. It has been developed to be a basic bloc of our optimizing compilation tool chain in the polyhedral model. From a polyhedral representation of a static control part of a program, it is able to compute exactly the set of statement instances in dependence relation. Hence, its output is useful to build program transformations respecting the original program semantics. This tool has been designed to be robust and precise. It implements some usual techniques for data dependence removal, as array privatization or array expansion.
Clan is a free software and library that translates some particular parts of high level programs written in C, C++, C# or Java into a polyhedral representation (strict or extended to irregular control flow). This representation may be manipulated by other tools to, e.g., achieve complex program restructurations (for optimization, parallelization or any other kind of manipulation). It has been created to avoid tedious and error-prone input file writing for polyhedral tools (such as CLooG, LeTSeE, Candl etc.). Using Clan, the user has to deal with source codes based on C grammar only (as C, C++, C# or Java).
CLooG is a free software and library to generate code for scanning Z-polyhedra. That is, it finds a code (e.g. in C, FORTRAN...) that reaches each integral point of one or more parameterized polyhedra. CLooG has been originally written to solve the code generation problem for optimizing compilers based on the polytope model. Nevertheless it is used now in various area e.g. to build control automata for high-level synthesis or to find the best polynomial approximation of a function. CLooG may help in any situation where scanning polyhedra matters. While the user has full control on generated code quality, CLooG is designed to avoid control overhead and to produce a very effective code. Irregular extentions have been integrated during 2009 in the irCLooG prototype.
(
http://
(modular assembly quality analyzer and optimizer,
http://
CAPSULE is our component-like parallelization environment. It consists of a run-time system which enacts tasks divisions. The environment is publicly disseminated at alchemy.futurs.inria.fr/capsule, along with several CAPSULE-parallelized benchmarks. CAPSULE was developed through several
Processor simulation:
The project can be summarized as an open and continuous exploration of the architecture design space, and takes the form of a service and web site we have just opened, www.archexplorer.org, hosting the software at the server side.
The goal of this project is twofold: to enable a more rigorous methodology approach in our domain by enabling the comparison of architecture ideas, and to propose a novel architecture design approach which relies on automatic design-space exploration, as an alternative, or at least a complement, to the current design process essentially driven by intuition and experience.
The server-side software is mostly based on UNISIM ( www.unisim.org), one of our large developments in software simulation: it corresponds to an environment on top of SystemC for truly enabling sharing, reuse and comparison, by offering a rigorous communication protocol between modules, architecture interfaces, and a set of simulators.
The archexplorer.org project is a joint project with Ghent University, Belgium (Veerle Desmet), and Thales TRT (Sylvain Girbal). I have started the project and I am coordinating the research, though Veerle and Sylvain are doing most of the implementation work; Veerle also has taken an active role in the project and can be considered as co-leading it.
The UNISIM platform has been described in Section
. As of now, besides the simulation engine, the developments include a
shared-memory CMP based on the PowerPC 405, functional simulators for the PowerPC 405 (and cycle-level), PowerPC 750, a functional system simulator of the PowerPC 750
capable of booting Linux, 10 different cache modules corresponding to various research works. The following simulators or tools are currently under development: a functional and
cycle-level version of the ARM 9 with full-system capability, a distributed-memory CMP based on the Power 405 core, an ST231 VLIW functional and later on cycle-level simulator.
During his internship, Taj Khan integrated the CACTI (
http://
Here are the most recent key scientific achievements.
Empirically demonstrating that significant performance gains can be achieved with program optimizations, provided architecture phenomena are better factored in during the optimization process. Observing though that long compositions of program transformations are required.
Releasing the first machine-learning based research compiler (MILEPOST GCC ) that combines Interactive Compilation Interface and static program feature extractor to predict good program optimizations to reduce execution time, code size and compilation time for a given program on a given architecture automatically using predictive modeling and statistical techniques. This compiler opens many research opportunities and is used in the EU HiPEAC network of excellence as a default compilation platform. The development of MILEPOST GCC has been coordinated by Grigori Fursin (project coordinator - Michael O'Boyle). IBM made two press-releases about this work in June, 2008 and May, 2009 , .
Showing that it is possible to capture the complex interplays between architecture and program behavior using machine-learning techniques, using that knowledge to drive program optimizations.
Developing multiversioning applications to make static programs adaptable at run-time , , .
Enabling predictive run-time code scheduling on heterogeneous (CPU-GPU) architectures .
Developing collective optimization approaches leveraging the knowledge of multiple users to transparently and continuously optimize programs or improve default compiler optimization heuristic , .
Developing a polyhedral program representation that facilitates the composition of complex transformation sequences.
Addressing the code generation performance issues associated with polyhedral program representation.
Further leveraging polyhedral program representation to propose novel methods for scanning the space of program transformations.
Extending the polyhedral model to irregular control flow (thus significantly increasing their application domain) and demonstrating the extension allows existing optimization techniques to successfully apply to relevant benchmarks (this work has been submitted and accepted for publication in 2009 at Compiler Construction 2010).
We created an open community-driven collaborative wiki-based portal
http://
We studied a method to compute the transitivite closure of a union of affine relations on integer tuples. Within Presburger arithmetics, complete algorithms to compute the transitive closure exist for convex polyhedra only. In presence of non-convex relations, there exists little but special cases and incomplete heuristics. We introduce novel sufficient and necessary conditions defining a class of relations for which an exact computation is possible. These conditions can be relaxed to define larger classes where conservative approximations and/or more complex closed forms can be obtained. Our method is immediately applicable to a wide area of symbolic computation problems. It is illustrated on representative examples and compared with state of the art approaches.
Code specialization is a way to obtain significant improvement in the performance of an application. It works by exposing values of different parameters in source code. The availability of these specialized values enables the compilers to generate better optimized code. Although most of the efficient source code implementations contain specialized code to benefit from these optimizations, the real impact of specialization may however vary depending upon the value of the specializing parameter.
We have studied in an iterative approach for code specialization. From some specialized code, we search for a better version of code by re-specializing the code, followed by a low-level code analysis. The specialized versions fulfilling the required criteria are then transformed to generate another equivalent version of the original specialized code. The approach, tested on Itanium2 architecture using gcc/icc compilers show significant improvement in the performance of different benchmarks.
This is a joint ANR project “PetaQCD” with Lal (Orsay), Irisa Rennes (Caps/Alf), IRFU (CEA Saclay), LPT (Orsay), Caps Entreprise (Rennes), Kerlabs (Rennes), LPSC (Grenoble).
Simulation of the Lattice QCD is a challenging computational problem. Currently, technological trends in computation show multiple divergent models of computation. We are witnessing homogeneous multicore architectures, the use of accelerator on-chip or off-chip, in addition to the traditional architectural models.
On the verge of this technological abundance, assessing the performance tradeoffs of computing nodes based on these technologies is of crucial importance to many scientific computing applications.
In this study , we focus on assessing the efficiency and the performance expected for the Lattice QCD problem on representative architectures and we project the expected improvement on these architectures and their impact on performance for the Lattice QCD. We additionally try to pinpoint the limiting factors for performance on these architectures. This work takes place in ANR PARA and ANR QCDNEXT (both 2005-2008) and has led to the project ANR PetaQCD (2009-2011) .
We study a new hierarchical compilation approach for the generation of high performance applications, relying on the use of state of the art compilers. This appproach is not application dependent and do not require any assembly hand-coding. It relies on the decomposition of the loop nests of the hotest functions in the application into simpler kernels, typically 1D to 2D loops, much simpler to optimize. We successfully applied this approach for dense linear algebra in 2005, reaching performance of constructor libraries. The advantage of the generated kernels is that their performance no longer depend on data input, but only on its location in memory hierarchy. Using a performance model for the memory hierarchy, it is possible to find out the best composition of kernels to use.
For larger applications, the code is no longer regular and data accesses are in particular irregular (use of indirections). Working with applications of project ANR PARA (MPEG4, QCD, oil simulation and BLAST), we study how to adapt the previous approach to these cases. When control is irregular (involving different execution path), we study the the WCET, in particular in the context of embedded applications for MPSOC architectures. This is the subject of an on-going collaboration with CEA/Lastre.
Instance-wise dataflow analysis provides the exact execution of a statement defining a value that is read at some other point during a program execution. This analysis generates more precise information than traditional dependence analyses and can therefore validate more optimizing transformations. An implementation of this analysis, as a standalone library, has be performed by M. Belaoucha (and funded by contract Teraops and PARMA) and its integration in gcc/Graphite is in progress.
Here are the most recent key scientific achievements.
A joint programming/architecture approach for streaming applications which is successfully used at NXP (formerly Philips Semiconductors). An extension of the synchronous
Kahn process networks using a relaxed notion of synchrony, called
N-synchrony, applied to the efficient and scalable parallelization of media streaming applications.
We have decided to ride a popular trend in software engineering, software components, which blends well with multi-cores: it proposes to decompose a large program into smaller fully independent parts, just like multi-cores consist in decomposing large monolithic architectures into a set of smaller cores. In itself, componentization does not yield much parallelism, our contribution is to augment components with the ability to divide, yielding as much parallelism as resources allow. The programmer is only exposed to this very simple notion of parallelization, and the role of the architecture and/or the run-time system is to manage parallel tasks. We have shown that this approach performs well on programs with irregular control flow behavior and complex data structures, which are typically difficult to efficiently parallelize. We have first demonstrated the approach on multi-threaded single-cores , then on shared-memory multi-cores , and have recently implemented the hardware support for distributed-memory multi-cores.
Besides parallelization, the other "spatial" scalability path is customization. Customization, which is very popular in embedded systems, has many assets: custom circuits are cheaper, faster and more power efficient than processors. They can also speed up tasks which are by nature sequential (not parallel), so that they are complementary, not an alternative, to parallelism. Their main limitation is flexibility. As a result, we have investigated techniques which can improve the flexibility of custom circuits while achieving the best possible performance, area and power properties. The first technique, which relied on collapsing processor instructions into circuits , was developed as part of the PhD of Sami Yehia, who went on to work at ARM research to apply such approaches to embedded processors, and later to Thales TRT. More recently, we developed together a novel bottom-up approach where we show how to efficiently combine any number of custom circuits to create a far more flexible compoundcircuit , without sacrificing the performance, area and power benefits of custom circuits. That approach was recently patented jointly with Thales .
We make the case for considering a hardware ANN as a flexible yet energy efficient, high-performance and defect-resilient accelerator, ideally positioned to tackle upcoming technology, applications and programming challenges. For now, we focus this study on one type of algorithms, classifiers, but which are commonly used in many RM applications. We present a hardware accelerator design for ANNs, geared towards robustness and high-performance. We show that transistor density has reached a level where it is now possible to spatially expand in hardware an ANN capable of handling medium-sized applications. Spatial expansion has multiple benefits in terms of robustness, energy efficiency, performance and scalability, over previous time-multiplexed designs.
We synthesized our design at 90nm and showed that such a spatially expanded ANN accelerator achieves orders of magnitude reductions in energy, and similar improvements in performance with respect to the same task executed on a modern processor at the same technology node, at a fraction of the on-chip area, justifying scaling down just one core in order to rip the energy and performance benefits.
The 20th century witnessed crystallization of the neuron as the fundamental building block responsible for higher brain functions. Yet, neurons are not the most numerous cells in the
brain. In fact, up to 90This work is a long-term collaboration with Eshel Ben Jacob,The Maguy-Glass Chair in Physics of Complex Systems, School of Physics and Astronomy, Tel Aviv
University, Israel. As a first step, we derived and investigated a concise mathematical model for glutamate-induced nastrocytic intracellular Ca2+ dynamics that captures the essential
biochemical features of the regulatory pathway of inositol 1,4,5-trisphosphate (IP3)
. Compared with previous similar models, our three-variable
models include a more realistic description of IP3 production and degradation pathways, lumping together their essential nonlinearities within a concise formulation. Using bifurcation
analysis and time simulations, we demonstrate the existence of new putative dynamical features. The crosscouplings between IP3 and Ca2+ pathways endow the system with self-consistent
oscillatory properties and favor mixed frequencyÅ amplitude encoding modes over pure amplitudeÅ modulation ones.// This article has been has been selected for the Faculty of 1000
Biology:
http://
In the framework of the ARC Amybia, we are searching for innovative schemes of decentralised and massively distributed computing. We mainly aim at contributing to this at three levels. At the modelling level, we think that biology provides us with complex and efficient models of such massively distributed behaviours. We start our study by addressing the decentralised gathering problem with the help of an original model of aggregation based on the behaviour of social amoebae. At the simulation level, our research mainly relies on achieving large scale simulations and on obtaining large statistical samples. Mastering these simulations is a major scientific issue, especially considering the imposed constraints: distributed computations, parsimonious computing time and memory requirements. Furthermore its raises further problems, such as: how to handle asynchronism, randomness and statistical analysis? At the hardware level, the challenge is to constantly confront our models with the actual constraints of a true practise of distributed computing. The main idea is to consider the hardware as a kind of sanity check. Hence, we intend to implement and validate our distributed models on massively parallel computing devices. In return, we expect that the analysis of the scientific issues raised by these implementations will influence the definition of the models themselves.// As a first step, we have recently proposed a bio-inspired system based on the so-called Greenberg-Hastings cellular automaton (GHCA), to achieve decentralized and robust gathering of mobile agents scattered on a surface or computing tasks scattered on a massively-distributed computing medium. As usual with such models, GHCA has mainly been studied using an homogeneous and regular lattice. However, in the context of massively distributed computing, one also needs to consider unreliable elements and defect-based noise. A first analysis showed that in this case, phase transitions could govern the behaviour of the system. Our next goal was to broaden the knowledge on stochastic reaction-diffusion media by investigating how such systems behave when various types of noise are introduced. Hence, in , we study GHCA where noise and topological irregularities of the grid are taken into account. The decrease of the probability of excitation changes qualitatively the behaviour of the system from an active to an extinct steady state. Simulations show that this change occurs near a critical threshold; it is identified as a nonequilibrium phase transition which belongs to the directed percolation universality class. We test the robustness of the phenomenon by introducing persistent defects in the topology : directed percolation behaviour is conserved. Using experimental and analytical tools, we suggest that the critical threshold varies as the inverse of the average number of neighbours per cell. The inverse proportionality law we presented paves the way for obtaining generic laws (even approximate ones) to predict the position of the critical threshold in various simulation conditions.
The connectivity structure of complex networks (i.e. their topology) is a crucial determinant of their information transfer properties. Hence, the computation made by complex neural networks, i.e. neural networks with complex connectivity structure, could as well be dependent on their topology. For instance, recent studies have shown that introducing a small-world topology in a multilayer perceptron increases its performance. However, other studies have inspected the performance of Hopfield or Echo state networks with small-world or scale-free topologies and reported more contrasted results.// In , we study instances of complex neural networks, i.e. neural networks with complex topologies. We use Self-Organizing Map neural networks whose neighborhood relationships are defined by a complex network, to classify handwritten digits. We show that topology has a small impact on performance and robustness to neuron failures, at least at long learning times. Performance may however be increased (by almost 10%) by evolutionary optimization of the network topology. In our experimental conditions, the evolved networks are more random than their parents, but display a more heterogeneous degree distribution. On the limited experiments presented here, it thus seems that the performance of the network is only weakly controlled by its topology. Interestingly, though, these slight differences can nevertheless be exploited by evolutionary algorithms: after evolution, the networks are more random than the initial small-world topology population. Their more heterogeneous connectivity distribution may indicate a tendency to evolve toward scale-free topologies. Unfortunately, this assumption can only be tested with large-size networks, for which the shape of the connectivity distribution can unambiguously be determined, but whose artificial evolution, for computation cost reasons, could not be carried out. Similarly, future work will have to address other classical computation problems for neural networks before we are able to draw any general conclusion.
Recent advances in the neuroscientific understanding of the brain are bringing about a tantalizing opportunity for building synthetic machines that perform computation in ways that differ radically from traditional Von Neumann machines. These brain-like architectures, which are premised on our understanding of how the human neocortex computes, are highly fault-tolerant, averaging results over large numbers of potentially faulty components, yet manage to solve very difficult problems more reliably than traditional algorithms. A key principle of operation for these architectures is that of automatic abstraction: independent features are extracted from highly disordered inputs and are used to create abstract invariant representations for external entities expressed in the inputs. This feature extraction is applied hierarchically, leading to increasing levels of abstraction at higher layers in the hierarchy.// In collaboration with Mikko Lipasti, University of Wisconsin at Madison, WI, USA, we introduce in a behavioral model for this process, using biologically-plausible neuron-level behavior and structure, and illustrates it with an image recognition task. We also introduce a computationally-effective higher-order modelÅ one that representsthe behavior of hundreds of neurons in a cortical column using just two perceptronsÅ is shown to be capable of this same task. These models are a first step towards developing a comprehensive and biologically-plausible understanding of the computational algorithms and microarchitecture of computing systems that mimic the human neocortex.
Beyond a certain number of individual components, it is not even clear whether it will be possible to decompose tasks in such a way they can take advantage of such a large number of computing resources. Searching for solution to this problem has progressively lead us to biological neural networks. Indeed, biological neural networks (as opposed to artificial neural networks, ANNs) are well-known examples of systems capable of complex information processing tasks using a large number of self-organized, but slow and unreliable components. And the complexity of the tasks typically processed by biological neurons is well beyond what is classically implemented with ANNs.
Emulating the workings of biological neural networks may at first seem far-fetched. However, the SIA (Semiconductor Industry Association) in its 2005 roadmap addresses for the first time “biologically inspired architecture implementations” as emerging research architectures, and focuses on biological neural networks as interesting scalable designs for information processing. More importantly, the computer science community is beginning to realize that biologists have made tremendous progress in the understanding of how certain complex information processing tasks are implemented using biological neural networks.
One of the key emerging features of biological neural networks is that they process information by abstractingit, and then only manipulate such higher abstractions. As a result, each new input (e.g., for image processing) can be analyzed using these learned abstractions directly, thus avoiding to rerun a lengthy set of elementary computations. More precisely, Poggio et al. at MIT have shown how combinations of neurons implementing simple operations such as MAX or SUM, can automatically create such abstractions for image processing, and some computer science researchers in the image processing domain have started to take advantage of these findings.
We are starting to investigate the information processing capabilities of this abstraction programming method . While image processing is also our first application, we plan to later look at a more diverse set of example applications.
Especially since the work of Bennett about reversibility of computation and how to make a computation reversible, the relationship between reversibility, energy, computation and space
complexity has gained interest in a lot of domains in computer science. This direction could help us understanding physical limitations of processors performance. We have chosen to start by
studying the space complexity of a DAG computation, defined as the maximum number of registers needed for performing the computation in both directions. This criteria is closely related to
our more classical criterion of “register saturation”. We have defined heuristics for computing this number and have performed systematic experiments on all possible graphs of given size. The
first experiments tend to show that for a graph of size
n, no more that
n/2registers are needed to perform the computations in both directions compared to the forward direction. This latter number can be considered as the “garbage” of
the computation. More work is needed to prove/disprove this result more formally and understand the hypothesis in which it is valid
. In this work, all operations in the DAG are assumed to be
reversible. See also
.
Collaboration with Thales TRT, and the CNRS-Thales lab on several topics: customization, simulation, design-space exploration, heterogeneous systems programming, memristors. As mentioned before, the research work on customization recently led to a joint patent application. Main contact: Sami Yehia.
Collaboration with STMicroelectronics on program parallelization and architecture support for parallelization.
We have had regular collaborations with Philips for almost 10 years now, including direct contracts. Currently, we are involved in several grants with Philips (IP SARC, Marie-Curie fellowships, ACOTES). Philips Semiconductors has recently become NXP.
“PAGDEG” (Causes and consequences of protein aggregation in cellular degeneration): an ANR-funded project (call Piribio) on modeling and simulation of cellular degeneration in bacteria (2010-2012). Supervisor: A. Lindner. Total amount funded: 450 keuros.
Large-Scale initiative “ColAge” (Natural and engineering solutions to the control of bacterial growth and aging: A systems and synthetic biology approach): an INRIA-INSERM joint grant on modeling and simulation of systems biology (2008-2011). Supervisor: H. Berry. Total amount funded: 430 keuros.
(150kEuros) This project aims at designing a novel type of hardware for digital signal processing (sounds, images,...) based on analog neural networks. This design shall be significantly more defect and fault tolerant than previous designs, while achieving very low power. This project is a joint INRIA Alchemy/CEA LETI ANR project as part of the “Return of PostDoc”: we have attracted a young French postdoc at University of California, originally from Supelec, to come back to France and set up this new project (2009-2012).
(20 keuros): “Modeling Cortical Activity and Analysing the Brain Neural Code”, Supervision: B. Cessac (Institut Non Linéaire de Nice). Other partners: Cortex (INRIA Nancy), Institut des Neurosciences cognitives de la Méditerannée (Marseille), Lab. Jean-Alexandre Dieudonné (Nice), Odyssee (INRIA Sophia).
(20 keuros): “Aggregating MYriads of Biologically-Inspired Agents”, Supervision: N. Fates (Maia, INRIA NAncy). Other Participants: B. Girau (Cortex, INRIA Nancy).
(5 Keuros): “Multifractal Analysis to Resolve information Transfer In NEural networks”, Supervision: M. Quoy (ETIS, ENSEA, U. Cergy-Pontoise). Other Participants: F. Germinet (AGP, U. Cergy-Pontoise).
(4 keuros): “Organization of a conference on spatial/amorphous computing”, Supervision: H. Berry, Other Participants: F. Gruau ( Alchemy), O. Michel, J.L. Giavitto (Ibisc, U. Evry).
: an INRIA-INSERM joint grant (3 years) on modeling and simulation of systems biology (official start Feb. 2009). Supervisor: H. Berry. Total amount funded (for 2008): 41 keuros.
ITEA Call 8 project on global analysis and optimization in GCC. Our involvment lie in the compiler infrastructure, static analysis in the polyhedral model, and feature extraction for global and contiunous optimization. With CEA (dpt. of energy), UPM (Spain), SICS (Sweden), major industrial partners (Airbus, Telefonica, Bertin) and SMEs (Mandriva, MySQL, and others). 04/2006–04/2009.
FP6 STREP on language and compiler support for high-performance streaming applications. We are one of the largest contractors in the project, with major involvment in interprocedural optimization and loop transformations for concurrent distributed streaming applications; it is both a programming model and compiler project. With Philips Research (Eindhoven), IBM Research (Haifa), STMicroelectronics (AST Lugano), Nokia (Helsinki), and UPC (Barcelona). 05/2006–05/2009.
FP6 STREP on machine-learning compilation. This project matches one of the core directions of the project: iterative optimization research, with an emphasis on making iterative compilation methods practical in real development environments. With IBM Research (Haifa), ARC (London), CAPS Entreprise (Rennes), IRISA (Rennes), and University of Edinburgh. 05/2006–05/2009.
ANR project on the design of architecture, software tools and algorithms for Lattice Quantum Chromodynamics. With Lal (Orsay), Irisa Rennes (Caps/Alf), IRFU (CEA Saclay), LPT (Orsay), Caps Entreprise (Rennes), Kerlabs (Rennes), LPSC (Grenoble).
ANR CIGC project on multi-level parallel programming and automatic parallelization. We are involved in automatic code generation approaches for domain-specific and target-specific optimizations; iterative and polyhedral compilation methods are explored in an application-specific context. With Bull, University of Versailles, LaBRI (University of Bordeaux), INT (Evry), CAPS Entreprise (Rennes). 01/2006–01/2009.
ANR RNTL project on parallel real-time applications for embedded systems. We are developing a component-based environment called CAPSULE for distributed-memory processors. It will be applied to a novel processor of STMicroelectronics and tested on applications from Thales. With STMicroelectronics, Thales, University of Paris 6, CEA. 01/2006–01/2009.
Marie Curie ToK IAP (Transfer of Knowledge, Industry-Academia Partnership); long-term exchange of personnel and 2 years of post-doc; with Philips Research (Eindhoven) and UPC (Barcelona). 03/2006–03/2009.
FP6 FET Proactive IP on advanced computer architecture. The goal is to address all the aspects of a scalable processor architecture based on multi-cores. It includes programming paradigms, compiler optimization, hardware support and simulation issues. CAPSULE is being used as component-based programming approach, and UNISIM for the simulation platform. 01/2006–01/2010.
A SYSTEMATIC “Pôle de Competitivité” regional funding for the development of a large-scale embedded multi-core architectures, coordinated by Thales. It will initially focus on streaming applications but it will later target programs with more complex control flow. Thales, Dassault, Thomson, CEA, INRIA. 01/2006–01/2010.
MODSIM is an INRIA grant for a joint international team between INRIA and Princeton University. The goal is the development of the UNISIM simulation platform. With Princeton University. 01/2006–12/2009.
French Minister of Research grant to explore biological neuron networks as possible sources of inspiration for future computing systems, with a focus on the complex structure of these networks. Our aim is at the same time to investigate bio-inspired computing systems, and original approaches for the modeling and understanding of biological neural networks. With University of Cergy-Pontoise, University of Nice-Sophia-Antipolis and University of Paris 6. 01/2005–01/2008.
HiPEAC is a network of excellence on High-Performance Embedded Architectures and Compilers. It involves more than 70 European researchers from 10 countries and 6 companies, including ST, Infineon and ARM. The goal of HiPEAC is to steer European research on future processor architectures and compilers to key issues, relevant to the European embedded industry.
The HiPEAC consortium has submitted a second edition of the network, which has started officially since November 2007 and for four years again. Olivier Temam is a member of the steering committee. 09/2004–11/2011. Mounira Bachir spent a 3 months intership (Jan 14th, 2009 till April 14th, 2009) in the Trinity College of Dublin under the direction of David Gregg. item[FET OMP] OpenMediaPlatform (OMP) aims at overcoming the cost and time-to-market risks that affect the development of media-rich evolving services for the growing range of networked consumer devices. It will provide an open architecture, combining two main streams of modern software engineering: (1) open application programmers interfaces (API) for media components, to be enhanced over standards like Khronos OpenMAX, and (2) new resource-aware system design tools and standards-complying static/dynamic compilers that ease the design, implementation and efficient execution of media services on a range of consumer platforms. 01/2008–12/2009.
French Minister of Research grant to study the impact of alternative technologies, particularly nanotubes, on future computing circuits and architectures. With a large array of French laboratories in VLSI and architecture design.
Hugues Berry is a member of GdR Dycoec: “Dynamique et contrôle des ensembles complexes” (
http://
Cédric Bastoul collaborates with Sébastien Salva from Clermont 1 University and Clément Delamare from Direction Générale des Impôts on web service client parallelization. He collaborates with various people at Reservoir Labs Inc. (New York) on high-level compilation for multicore architectures , .
Denis Barthou collaborates with these people.
W. Jalby, Univ. of Versailles St Quentin, PRISM lab.
S. Louise, CEA/Lastre.
S. Rajopadhye, U. of Colorado, Etat-Unis.
Hugues Berry collaborates with these people.
Eshel Ben-Jacob (School of Physics and Astronomy, Tel Aviv University, Israel)
Bruno Cessac (Lab. J.A. Dieudonnee, Universitè Nice-Sophia Antipolis; Team-Project NeuroMathComp, INRIA Sophia)
Bruno Delord, Stèphane Genet (ISIR, CNRS UMR 72222/ Universitè Pierre et Marie Curie, Paris)
Nazim Fates (MAIA, INRIA Loraine, Nancy), Bernard Girau (Cortex, INRIA Loraine, Nancy)
Annick Lesne (LPTMC - UMR CNRS 7600U, Universite Pierre et Marie Curie, Jussieu, Paris)
Ariel Lindner (INSERM U571, Facultè de Mèdecine Necker-Enfants Malades, Paris)
Mikko Lipasti (Dept Electrical & Computer Engineering, University of Wisconsin, Madison, USA)
Olivier Michel, A. Spicher (LACL, U. Paris 12 Creteil)
Marc Schoenauer (TAO, INRIA Saclay)
Grigori Fursin collaborates with the following reseachers:
Michael O'Boyle, University of Edinburgh, UK
Chengyong Wu, ICT, China
Nacho Navarro and Marisa Gil, UPC, Spain
Mircea Namolaru, Ayal Zaks, Bilha Mendelson, IBM Haifa, Israel
Francois Bodin, CAPS Entreprise/IRISA, France
Olivier Temam collaborates with these people.
Mikko Lipasti (University of Wisconsin).
Kathryn McKinley (University of Texas).
Veerle Desmet, Lieven Eeckhout (Ghent University).
Chengyong Wu (ICT, Beijing, China)
Daniel Gracia-Perez, Gilles Mouchard (CEA LIST).
Sylvain Girbal, Sami Yehia (Thales TRT).
Bruno Jego (ST).
Collaboration with Prof. Chengyong Wu at ICT, China, on machine-learning techniques for compilers and data centers.
Collaboration with Mikko Lipasti, University of Wisconsin, on bio-inspired architectures.
Collaboration with Kathryn McKinley at University of Texas, Austin, on a novel component-based programming approaches for heterogeneous and homogeneous computing systems.
Collaboration with Veerle Desmet at Ghent University, Belgium, on design-space exploration. As part of this collaboration, we recently set up the www.archexplorer.orgweb site and related project.
Thanks to a France-Berkeley travel grant, We are starting a collaboration with the group of Jose Renau, thanks to a 2006-2007 France-Berkeley grant. The topics are close to the infrastructure work of Alchemy: fast and accurate simulation of multi-core processors, and support for a modern parallelisation infrastructure in GCC. Jose Renau is a member of the OpenSparc consortium and contributed to major advances in architecture and compiler support for thread-level speculation.
For the past 3 years, we had a very active cooperation with University of Edinburgh on iterative optimization; Grigori Fursin, got his PhD from University Edinburgh. This collaboration has resulted in a series of joint articles , , .
We have a regular collaboration with the group of David Padua, Urbana-Champaign, Illinois, which started 6 years ago, with multiple joint publications and travel grants (CNRS-UIUC). Research focused on high-performance Java, dependence and alias analysis, processors in memory, and currently on adaptive program generation and machine learning compilers.
We started a regular exchange of ideas and personnel with the Parasol laboratory, led by Lawrence Rauchwerger, a reference in parallel language compilation and architecture support. ProfṘauchwerger visited Alchemyfor a total of 5 months in the last 3 years, and many of us visited TAMU for shorter periods. The collaboration led to numerous advances in the understanding of the main challenges and pitfals in scalable parallel processing, and also facilites the organization of multiple academic events (e.g., the upcoming PACT'07)
We have a regular collaboration with the group of Prof. Sadayappan, Columbus, Ohio. Recently, we also started to publish together. We invited Uday Bondhugula, PhD student from Ohio for two months, and a Louis-Noël Pouchet will start a postdoc in Ohio in January 2010. The collaboration focuses on polyhedral compilation and new approaches to loop tiling for automatic parallelization.
We have a regular collaboration with the group of Prof. Ramanujam, Baton Rouge, Louisiana. Recently, we also started to publish together. Mohammed Fellahi was scheduled to spend a 3 month internship in Baton Rouge in 2009, but our plans were cancelled because of difficulties to get a US visa. The collaboration focuses on code generation for polyhedral transformations, and automatic parallelization for GPUs.
We have a regular collaboration with UPC, Barcelona, which started 7 years ago, with several groups on topics ranging from program optimization to micro-architecture, resulting in several publications, joint contracts.
We have a regular collaboration with the group of Christian Lengauer and Martin Griebl, Passau, Germany, which started 10 years ago, with multiple joint publications and travel grants (Procope, Ministry of Foreign Affairs). Our collaboration focused on polyhedral compilation techniques and recently headed towards domain-specific program generation and metaprogramming.
We have started a collaboration with physicists working on LQCD (Lattice Quantic Chromo Dynamics). We focus on the next generation of computer that would gain an order of magnitude speedup over their current APE-next processor (sustained 300 GFlops).
The properties of biological neural networks that are of direct interest to architecture research are in part due to the intrinsic properties of the individual neurons. We are collaborating with the neuroscience research lab ANIM (INSERM U742) to develop simulation and modelling studies of specific properties of individual biological neurons such as time handling or plasticity and memory properties .
We started a collaboration with Marc Schoenauer on evolutionary algorithms for optimization of complex systems. More precisely, we study evolutionary methods to optimize the complex structure of large size neural networks. The aim is to find wether there exists optimal organizations for the interconnect network of such large systems. This collaboration grounds F. Jiang's Ph.D. work, which is co-supervised and co-founded by the two groups.
For the past 6 years, we had a regular collaboration with the Laboratoire Sûreté du Logiciel(Software Safety Lab) at CEA LIST on two topics: processor simulation and program optimization. Simulation of complex processor architectures is necessary for the development of software test of complex systems investigated at CEA. Program optimization is more a way to factor in the CEA expertise in static analysis and develop new applications. CEA has funded two scholarships in our group until 2004 and 2005 respectively.
We also have regular contacts with several foreign research groups: the CAPSL group at University of Delaware; and the PASCAL group at University of California Irvine (NSF-INRIA grant).
Hugues Berry collaborates with Bruno Cessac (Institut Non Linéaire de Nice, UMR 6618 CNRS / Université Nice-Sophia Antipolis), Bruno Delord (ANIM, UMR 742 Inserm / Université Pierre et Marie Curie, Paris), Stéphane Genet (ANIM, UMR 742 Inserm / Université Pierre et Marie Curie, Paris), Mathias Quoy (ETIS, UMR 8051 CNRS / Université de Cergy-Pontoise / ENSEA), Olivier Michel (Ibisc, Université d'Evry), Marc Schoenauer (TAO, INRIA Futurs, Orsay), Nazim Fates (MAIA, INRIA Loraine, Nancy).
1 week visit of Dr. Petros Panayi from U. Cyprus.
2 weeks visit of Razya Ladelsky from IBM Research Haifa, Israel.
2 month visit of Prof. Özcan Özturk from Bilkent University, Ankara, Turkey.
Dr. Marc Duranton (Philips NXP, Eindhoven, Netherlands) visits Alchemyregularly.
Dr. Benoît Dupont de Dinechin (STMicroelectronics, then Kalray, Grenoble) visits Alchemyregularly.
Several visits by Prof. Sadayappan (Ohio State University) and Prof. Ramanujam (Louisiana State University).
Seminar by Prof. Colin Bundwell (University of Pennsylvania) on memory consistency.
Seminary by Prof. Babak Falsafi (EPFL) on cache prefetching.
Seminar by Dr. Sven Verdoolaege (K. U. Leuven) on process networks in the polyhedral model.
Seminar by Prof. Walid Taha (RICE University) on hybrid continuous-discrete systems.
Seminar and tutorial by Prof. François Irigoin, Prof. Fabien Coelho (École des Mines), Prof. Ronan Keryell (Telecom Bretagne and HPC-project) and Prof. Frédérique Chaussumier-Silber (Telecom SudParis), on the PIPS source-to-source compiler.
Visiting Professor at Reservoir Labs Inc. Jan 09 to Dec 09.
Member of the LRI department committee at the University of Paris-Sud of Paris-Sud since 2006.
Member of the Orsay Technology Institute (IUT D'Orsay) Computer Science department committee since 2006.
Director of the Licence Professionnelle Sécurité des Systèmes et Réseaux Informatiques(Bachelor on System and Network Security) at Orsay Institute of Technology since 2007.
Hugues Berry is a member of the “Scientific Commission” (commission scientifique) of the INRIA Saclay-Ile-de-France research centre.
HiPEAC'06 Summer School course on GCC (55-65 attendees). The support material for the courses and tutorials is freely available (public domain or GPL license) and has
been contributed to the main GCC site (
http://
Founding member of IFIP WG 2.11.
President of the recruiting committee (admissibilité) for INRIA Saclay research scientists, 2007, 2008 and 2009.
Member of IFIP WG 10.3.
Member of the “comité de programmes” of Digiteo.
Elected member of the “conseil d'administration”' of Inria [2006-].
Elected member of the “comité technique paritaire” of Inria [2006-2009].
Elected member of the “conseil scientifique” of University of Paris-Sud [2008-].
Chair [2008-] of the “commission des utilisateurs des moyens informatiques - recherche” of the Saclay Inria Research Center.
HiPEAC2 Steering Committee, Research workpackage leader, leader of the Research Cluster on simulation.
Program Co-Chair of the 2011 International Conference on High-Performance and Embedded Systems (HiPEAC).
General Chair of the 2011 ACM/IEEE International Symposium on Code Generation and Optimization (CGO), to be organized in Chamonix, France. It is the first time that CGO will be held outside the US.
Leader of the INRIA Alchemygroup.
Program Committees:
Program committee member of IEEE HPCC 2009, The 11th IEEE International Conference on High Performance Computing and Communications (HPCC-09), June 2009 Korea University, Seoul, Korea
Program committee member of DATES 2009 The 12th International Conference on Design, Automation & Test in Europe.
Program committee member of SSS 2009 The 11th International Symposium on Stabilization, Safety, and Security of Distributed Systems.
Member of the INSERM commission for systems biology (Institut genetique et developement)
Defended PhD of supervised students: Fei Jiang “Evolution and optimization of large neural networks” (co-Supervised with M. Schoenauer, TAO, INRIA Saclay). PhD in Computer Science, Univ. Orsay-Paris-XI, Dec. 16, 2009
PhD Jury Duty :
M. Valvassori (Dir. A. Ali Cherif), 10 July 2009, University Paris 8 (Rapporteur)
M. Ambard (Dir. D. Martinez & F. Alexandre), 06 June 2009, University Nancy (Rapporteur)
Selection committee for Assistant Professor positions :
Position MCF 744, University Joseph Fourier, Grenoble, Sections 26-27, Biomathematics and Bioinformatics, May 2009
Position MCF 283, University of Evry, Sections 65-27, Cell biology and Bioinformatics, April-May 2009
Reviewer for the ANR Calls “SysComm”
Review editor for the journal “Frontiers in Neurorobotics” (
http://
Co-organization (with Marc Shapiro from INRIA Rocquencourt, Jean Roman from INRIA Bordeaux) and David Devour from CNRS and U. Perpignan of the INRIA Massively Multicore and Manycore (IMMM) Days, February 4 and 5 2009. 130 attendees, covering core research and technology as well as application domains impacted by manycore processors.
Editor of the special issue of the Transactions on High Performance and Embedded Architectures and Compilers (HiPEAC Journal) for the best papers of the SHCMP'08 workshop, to appear in 2010.
Program committee member of IEEE conf. on Parallel Architectures and Compilation Techniques (PACT'08 and PACT'09).
Program committee member of ACM LCTES'10 conferenre.
Program committee member of the HiPEAC'09 and HiPEAC'10 conference
Program committee member of the 2PARMA'10 workshop on Parallel Programming and Run-time Management Techniques for Many-core Architectures.
Program committee member of the GROW'10 GCC Research Opportunities Workshop.
External program committee member of ISCA'10.
Financial chair and local organization committee of CGO'11.
Thesis proposal committee (external reviewer) of Fréderic De Mesmay, Carnegie Mellon University, USA, February 2009.
PhD thesis committee (external reviewer) of Armin Groöeßlinger, University of Passau, DE, December 2009.
PhD thesis committee (examiner) of Nicolas Geoffray, Paris 6 University (and INRIA Rocquencourt), September 2009.
PhD thesis committee (president) of Matthieu Lemerre, Paris-Sud University (and CEA List), October 2009.
PhD thesis committee (examiner) of Jean-Baptiste Tristan, Paris 6 University (and INRIA Rocquencourt), November 2009.
PhD thesis committee (president) of Lamia Djoudi, University of Versailles, December 2009.
Software and Compilers for Embedded Systems, SCOPES' 2009, April, 2009, Nice, France.
PhD thesis committee (reviewer) of Florent Bouchez, ENS Lyon, April 30th, 2009.
PhD thesis committee (member) of Rémi Baron, LPT, Orsay et CEA, September 18th, 2009.
Program Committee Member of ICPADS'09 (International Conference on Parallel and Distributed Systems), multi-core architectures track
Program Committee Member of iWAPT'09 (International Workshop on Automatic Performance Tuning)
Workshop chair and organizer of GROWÂ09 (2nd International Workshop on GCC Research Opportunities)
Workshop organizer or SMART'09 (3rd Workshop on Statistical and Machine Learning Approaches applied to Architectures and Compilation)
Program Committee Member of Open64 Workshop at CGOÂ09
Program Committee of International Conference on Architecture of Computing Systems, 2010.
Program Committee of Workshop on New Directions in Computer Architecture, 2009.
Steering committee member and co-organizer of the Rapido workshop at the HiPEAC Conference, 2009, 2010.
Program committee of Workshop on Statistical and Machine learning approaches applied to ARchitectures and compilaTion (SMART) in 2010.
Program committee of ACM/IEEE International Symposium on Computer Architecture (ISCA) in 2009, 2010.
Program committee of ACM/IEEE International Synposium on Micro-Architecture (MICRO) in 2009.
Program committee of IEEE International Symposium on High-Performance Computer Architecture (HPCA) in 2009.
Associate editor of the HiPEAC Transactions.
Denis Barthou gave these courses:
15h in Master2, UVSQ on vectorization/parallelization,
Summer School INRIA/CEA/EDF on High Performance Computing (june 2008).
Cédric Bastoul gives Java, System, Network and Security lectures and labs at the Orsay Institute of Technology to first, second and third year students (L1 to L3). He also teaches a Object Oriented Programming course at Paris-Sud University to second year students (L2). Lastly, he is teaching computer architecture at École Polytechnique, for third year students (M1).
Anna Beletska gave 9 hours of lectures in the Master 2 “Recherche” of Computer Science of University of Paris-Sud.
Mohamed-Walid Benabderrahmane: Monitorat: 64 hours at IFIPS - University Paris-sud 11, Courses: C/C++/C# , Web Services, Security, Level: 5 year engineer.
Philippe Dumont: Components of a Computing System, Introduction to Computer Architecture and Operating Systems, École Polytechnique - Licence 3 - 36h
Christine Eisenbeis gave a 3 hours lecture about “Reversible computing” in the Master 2 “Recherche” of Computer Science of University of Paris-Sud.
Olivier Temam teaches a computer architecture course at École Polytechnique to 3rd-year students on computer architectures (appr. 35 hours). He also co-teaches a course on novel processor architectures at University of Paris Sud to Master's students.
Albert Cohen teaches an introductory computing systems (computer architecture, operating systems, distributed systems) at École Polytechnique to 2nd-year students (appr 35 hours, 120 students); it was the first course using the Google Android development kit as a virtual platform for lab sessions; an e-book published with Eyrolles came out of this first experiment in 2009. He also teaches an advanced operating systems course to 3rd-year students at École Polytechnique. He also co-chairs the Electrical Engineering curriculum at École Polytechnique.
The project-team members have given the following talks and attended the following conferences:
LCPC 09, University of Delaware, USA, October 8-10, 2009, “Using The Meeting Graph Framework to Minimise Kernel Loop Unrolling for Scheduled Loops”
Paper presentation at PMEA 2009 (September, Raleigh, North Carolina) Workshop on Programming Models for Emerging Architectures in conjunction with PACT 2009.
Poster presentation at PACT 2009 (September, Raleigh, North Carolina) Intl. Conf. on Parallel Architectures and Compilation Techniques.
Participation to SPC 2009 Fault-Tolerant Spaceborne Computing Employing New Technologies Workshop (May 26-29, Albuquerque, New Mexico).
Cocoa' 2009, talk “Computing the transitive closure of a union of affine integer tuple relations”
ISPDC 2009, talk “Coarse-Grained Loop Parallelization: Iteration Space Slicing vs Affine Transformations”
Poster Pact 2009, “A Conservative Approach to Manipulate Data-Dependent Control Flow in the Polyhedral Model”, with Louis-Noël Pouchet
Summer school : Acaces 2009, Fifth International Summer School on Advanced Computer Architecture and Compilation for Embedded Systems July 12 to July 18, 2009 Terrassa (near Barcelona), Spain
“The Effects of Hebbian Learning on the Structure and Dynamics of Chaotic Neural Networks”, given at the Dept. Electrical and Computer Enginerring, Univ. Wisconsin at Madison, WI, USA, Jan. 13, 2009 (invited by M. Lipasti).
“Estimating the effects of intrinsic plasticity on neural network dynamics using a realistic model”, at the “Journees Mathematiques du Vivant”, Laboratory J.A. Dieudonnee, Nice, France, March 25, 2009 (invited by B. Cessac)
“ColAge: A systems and synthetic biology approach to the control of bacterial growth and aging”, the 2nd NIH-INRIA workshop on Biomedical Computing, INRIA Rocquencourt, France, June 3, 2009.
“Cell biochemistry in cytoplasms with large molecular crowding : anomalous diffusion and bacterial aging”, at the 2nd Paris Workshop on Multi-Agent Systems in Biology at the Meso or Macroscopic Scales, Univ. Pierre et Marie Curie, Paris, France, June 23, 2009 (invited by M. Beurton-Aimar)
Compilers for Parallel Computers (CPC'09), Zürich, Switzerland, January 7-9, 2009.
International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES'09, Grenoble, November 11-15th, 2009.
Participation to MICRO'09 (42nd IEEE/ACM International Symposium on Microarchitecture), New York, USA, December 2009
invited talk, "Collective Tuning Initiative", presented at the University of Versailles, France, May 2009; presented at the HiPEAC industrial workshop and HiPEAC clusters, Infineon, Munich, Germany, June 2009;
paper presentation "Collective Tuning" at the GCC Summit'09, Montreal, Canada, June 2009;
invited talk, "Collective Tuning Initiative: collective optimization, run-time adaptation and machine learning", presented at University of Illinois at Urbana Champaign, USA, April 2009
paper presentation "Collective Optimization" at HiPEAC'09, Cyprus, January 2009
paper presentation "Finding representative sets of optimizations for adaptive multiversioning applications" at SMART'09, Cyprus, January 2009
invited talk (by EU FP7 commision), "MILEPOST project - using machine learning to automate and speed up program optimization for reconfigurable processors", presented at the Information and Brokerage Conference on Information and Communication Technologies in the EU's 7th Framework, Moscow, Russia, October 2008
invited talk, "Enabling Dynamic Optimization and Adaptation for Statically Compiled Programs Using Function Multi-Versioning", presented at ScalPerf'08 (Scalable Approaches to High Performance and High Productivity Computing), Bertinoro, Italy, September 2008
invited talk, "Continuous adaptive program optimizations", presented at Reservoir Labs and IBM TJ Watson Research Center, New York, USA, August 2008;
presented at Imperial College (Software Performance Engineering Laboratory), London, UK, February 2008;
presented at the Institute of Computing Technology (Chinese Academy of Sciences), Beijing, China, January 2008;
invited talk, "Program iterative continuous optimizations, run-time adaptation and machine learning", presented at IBM Toronto Lab (compiler group), Canada, July 2007;
invited talk, "Machine learning techniques for iterative program optimizations and run-time adaptation", presented for the TAO group (machine learning group), LRI, Paris-Sud XI University, INRIA and CNRS, France, June 2007;
invited talk, "Overview of current activities: Interactive Compilation Interface for fine-grain program optimizations, dataset sensitivity, machine learning to speed up optimizations and DSE, run-time program adaptation, optimizations for heterogeneous computing systems, continuous collective optimizations, HiPEAC activities", presented at Intel (compiler group), Moscow, Russia, February 2007 and at the ISP RAS (Institute for System Programming, Russian Academy of Sciences), Moscow, Russia, February 2007
"Continuous run-time adaptation and optimization of statically compiled programs", presented at the UPC, Barcelona, Spain, January 2007.
Seminar at the U. of Delaware, February 2009, Newark DE: “state of the art in polyhedral compilation for production compilers”.
Visit of Reservoir Labs, February 2009, New York.
Visit of the group of Markus Püschel and Franz Franchetti, Carnegie Mellon University, February 2009, Pittsburgh, Pennsylvania.
Visit of the group of Kathryn O'Brien, of Kenneth Zadeck and David Edelsohn at IBM Research Watson, June 2009, Yorkton Heights, New York.
Invited presentation and contribution to a planning meeting for a future European call for research proposals on system- and process-level virtualization, September 2009, Bruxelles.
Presentation at the second STMicroelectronics-INRIA Plateform2012 meeting, October 2009, Grenoble.
Invited panelist at the LCPC'09 Panel on the future of compilation research and technology, October 2009, Newark, Delaware.
Co-organizer (with Joseph Sifakis, Ahmed Jerraya and Benoît Dupont de Dinechin) of the ESWeek'09 Industrial Panel on compilers for embedded multicore architectures, October 2009, Grenoble.
Presentation at Dagstuhl Seminar 09481 (SYNCHRON'09), December 2009: “A data-flow synchronous perspective to performance portability”.
Seminar at U. Saarbrücken, December 2009: “Languages and compilers for Volkscomputing”.
Seminar at U. Passau, December 2009: “Language and compilers for Volkscomputing”.
Presentation at the second Bull-INRIA-CEA partnership meeting, December 2009, Rocquencourt.
Workshop on PetaScale Computing, First workshop of INRIA-Illinois Petascale Computing Joint Lab, June 10 to June 12, 2009, Paris, France
Acaces Summer School, July 12 to July 18, 2009, Terrassa, Spain
“ERBIUM: A Deterministic, Low-Level Concurrent Representation for Portability and Scalable Performance”, Synchronics day, December 17 2009, Paris, France
Poster at ACACES 2009 International Conference “Bidirectional Libraries for Portable High Performance Parallelism