Section: Research Program
Compilers, code optimization and high-level synthesis for software and hardware
Participants : Christophe Alias, Laure Gonnord, Matthieu Moy, Maroua Maalej [2014-2017] .
Christophe Alias and Laure Gonnord asked to join the ROMA team temporarily, starting from September 2015. Matthieu Moy (formerly Grenoble INP) joined them in September 2017 to create a new team called CASH, for “Compilation and Analysis, Software and Hardware” (see https://matthieu-moy.fr/spip/?CASH-team-proposal). The proposal was accepted by the LIP laboratory and by Inria's “comité des équipes projet”, and is waiting for final approval from Inria to officially become an “équipe centre”. The text below describes their research domain. The results that they have achieved in 2017 are included in this report.
The advent of parallelism in supercomputers, in embedded systems (smartphones, plane controllers), and in more classical end-user computers increases the need for high-level code optimization and improved compilers. Being able to deal with the complexity of the upcoming software and hardware while keeping energy consumption at a reasonable level is one of the main challenges cited in the Hipeac Roadmap which among others cites the two major issues :
-
Enhance the efficiency of the design of embedded systems, and especially the design of optimized specialized hardware.
-
Invent techniques to “expose data movement in applications and optimize them at runtime and compile time and to investigate communication-optimized algorithms”.
In particular, the rise of embedded systems and high performance computers in the last decade has generated new problems in code optimization, with strong consequences on the research area. The main challenge is to take advantage of the characteristics of the specific hardware (generic hardware, or hardware accelerators). The long-term objective is to provide solutions for the end-user developers to use at their best the huge opportunities of these emerging platforms.
Dataflow models for HPC applications
In the last decades, several frameworks have emerged to design efficient compiler algorithms. The efficiency of all the optimizations performed in compilers strongly relies on performant static analyses and intermediate representations.
The contemporary computer, is constantly evolving. New architectures such as multi-core processors, Graphics Processing Units (GPUs) or many-core coprocessors are introduced, resulting into complex heterogeneous platforms.
A consequence of this diversity an heterogeneity is that a given computation can be implemented in many different ways, with different performance characteristics. As an obvious example, changing the degree of parallelism can trade execution time for number of cores. However, many choices are less obvious: for example, augmenting the degree of parallelism of a memory-bounded application will not improve performance. Most architectures involve a complex memory hierarchy, hence memory access patterns have a considerable impact on performance. The design-space to explore to find the best performance is much wider than it used to be with older architectures, and new tools are needed to help the programmer explore it. We believe that the dataflow formalism is a good basis to build such tools as it allows expressing different forms of parallelism.
The transverse theme of the CASH proposal is the study of the dataflow model for parallel programs: the dataflow formalism expresses a computation on an infinite number of values, that can be viewed as successive values of a variable during time. A dataflow program is structured as a set of communicating processes that communicate values through communicating buffers.
Examples of dataflow languages include the synchronous languages Lustre and Signal, as well as SigmaC; the DPN representation [67] (data-aware process network) is an example of a dataflow intermediate representation for a parallelizing compiler.
The dataflow model, which expresses at the same time data parallelism and task parallelism, is in our opinion one of the best models for analysis, verification and synthesis of parallel systems. This model will be our favorite representation for our programs. Indeed, it shares the “equational” description of computation and data with the polyhedral model, and the static single assignment representation inside compilers. The dataflow formalism can be used both as a programming language and as an intermediate representation within compilers.
This topic is transverse to the proposal. While we will not a priori restrict ourselves to dataflow applications (we also consider approaches to optimize CUDA and OpenCL code for example), it will be a good starting point and a convergence point to all the members of the team.
Compiler algorithms for irregular applications
In the last decades, several frameworks has emerged to design efficient compiler algorithms. The efficiency of all the optimizations performed in compilers strongly relies on performant static analyses and intermediate representations. Among these representations, the polyhedral model [81] focus on regular programs, whose execution trace is predictable statically. The program and the data accessed are represented with a single mathematical object endowed with powerful algorithmic techniques for reasoning about it. Unfortunately, most of the algorithms used in scientific computing do not fit totally in this category.
We plan to explore the extensions of these techniques to handle irregular programs with while loops and complex data structures (such as trees, and lists). This raises many issues. We cannot represent finitely all the possible executions traces. Which approximation/representation to choose? Then, how to adapt existing techniques on approximated traces while preserving the correctness? To address these issues, we plan to incorporate new ideas coming from the abstract interpretation community: control flow, approximations, and also shape analysis; and from the termination community: rewriting is one of the major techniques that are able to handle complex data structures and also recursive programs.
High-level synthesis for FPGA
Energy consumption bounds the performance of supercomputers since the end of Dennard scaling. Hence, reducing the electrical energy spent in a computation is the major challenge raised by Exaflop computing. Novel hardware, software, compilers and operating systems must be designed to increase the energy efficiency (in flops/watt) of data manipulation and computation itself. In the last decade, many specialized hardware accelerators (Xeon Phi, GPGPU) has emerged to overcome the limitations of mainstream processors, by trading the genericity for energy efficiency. However, the best supercomputers can only reach 8 Gflops/watt [66], which is far less than the 50 Gflops/watt required by an Exaflop supercomputer. An extreme solution would be to trade all the genericity by using specialized circuits. However such circuits (application specific integrated circuits, ASIC) are usually too expensive for the HPC market and lacks of flexibility. Once printed, an ASIC cannot be modified. Any algorithm update (or bug fix) would be impossible, which clearly not realistic.
Recently, reconfigurable circuits (Field Programmable Gate Arrays, FPGA) has appeared as a credible alternative for Exaflop computing. Major companies (including Intel, Google, Facebook and Microsoft) show a growing interest to FPGA and promising results has been obtained. For instance, in 2015, Microsoft reaches 40 Gflop/watts on a data-center deep learning algorithm mapped on Intel/Altera Arria 10 FPGAs. We believe that FPGA will become the new building block for HPC and Big Data systems. Unfortunately, programming an FPGA is still a big challenge: the application must be defined at circuit level and use properly the logic cells. Hence, there is a strong need for a compiler technology able to map complex applications specified in a high-level language. This compiler technology is usually refered as high-level synthesis (HLS).
We plan to investigate how to extend the models and the algorithms developed by the HPC community to map automatically a complex application to an FPGA. This raises many issues. How to schedule/allocate the computations and the data on the FPGA in order to reduce the data transfers while keeping a high throughput? How to use optimally the resources of the FPGA while keeping a low critical path? To address these issues, we plan to develop novel execution models based on process networks and to extend/cross-fertilize the algorithms developed in both HPC and high-level synthesis communities. The purpose of the XtremLogic start-up company, co-founded by Christophe Alias and Alexandru Plesco is to transfer the results of this research to an industrial level compiler.
Simulation of Systems on a Chip
One of the bottlenecks in complex Systems on a Chip (SoCs) design flow is the simulation speed: it is necessary to be able to simulate the behavior of a complete system, including software, before the actual chip is available. Raising the level of abstraction from Register Transfer Level to more abstract simulations like Transaction Level Modeling (TLM) [63] in SystemC [62] allowed gaining several orders of magnitude of speed. We are particularly interested in the loosely timed coding style where the timing of the platform is not modeled precisely, and which allows the fastest simulations. Still, SystemC implementations used in production are still sequential, and one more order of magnitude in simulation speed could be obtained with proper parallelization techniques.
Matthieu Moy is a renown expert in the domain. He has worked on SystemC/TLM simulation with STMicroelectronics for 15 years. He presented part of his work on the subject to Collège de France with Laurent Maillet-Contoz from STMicroelectronics (http://www.college-de-france.fr/site/gerard-berry/seminar-2014-01-29-17h30.htm).
Matthieu Moy is the co-advisor (with Tanguy Sassolas) of the Phd of Gabriel Busnot, started in November 2017 in collaboration with CEA-LIST. The research will be on parallelizing SystemC simulation, and the first ideas include compilation techniques to discover process dependencies.
Work on SystemC/TLM parallel execution is both an application of other work on parallelism in the team and a tool complementary to HLS presented above. Indeed, some of the parallelization techniques we develop in CASH could apply to SystemC/TLM programs. Conversely, a complete design-flow based on HLS often needs fast system-level simulation: the full-system usually contains both parts designed using HLS, handwritten hardware components and software.