Section: New Results
Compilation and Synthesis for Reconfigurable Platform
Participants : Steven Derrien, Emmanuel Casseau, Daniel Menard, François Charot, Christophe Wolinski, Olivier Sentieys, Patrice Quinton.
Polyhedral based loop transformations for High-Level synthesis
Participants : Steven Derrien, Antoine Morvan, Patrice Quinton.
After almost two decades of research effort, there now exists a large choice of robust and mature C to hardware tools that are used as production tools by world-class chip vendor companies. Although these tools dramatically slash design time, their ability to generate efficient accelerators is still limited, and they rely on the designer to expose parallelism and to use appropriate data layout in the source program. We believe this can be overcome by tackling the problem directly at the source level, using source-to-source optimizing compilers. More precisely, our aim is to study how polyhedral based program analysis and transformation can be used to address this problem.
In the context of the PhD of Antoine Morvan, we have studied how it was possible to improve the efficiency and applicability of nested loop pipelining (also known as nested software pipelining) in C to hardware tools. Loop pipelining is a key transformation in high-level synthesis tools as it helps maximizing both computational throughput and hardware utilization. Nevertheless, it somewhat looses its efficiency when dealing with small trip-count inner loops, as the pipeline latency overhead quickly limits its efficiency.
Even if it is possible to overcome this limitation by pipelining the execution of a whole loop nest, the applicability of nested loop pipelining has so far been limited to a very narrow subset of loops, namely perfectly nested loops with constant bounds. In this work, we have extended the applicability of nested-loop pipelining to imperfectly nested loops with affine dependencies. We have shown how such loop nest can be analyzed and, under certain conditions, how one can modify the source code in order to allow nested loop pipeline to be applied using a method called polyhedral bubble insertion. Our approach shown encouraging results and led to a publication to the IEEE International Conference on Field Programmable Technology [48] in December 2011.
Reconfigurable Processor Extensions Generation
Participants : Christophe Wolinski, François Charot, Erwan Raffin, Kevin Martin, Antoine Floch.
During this year, we have continued our work on the generation of reconfigurable processor extension using the constraint programming approach. Previously, we showed how all the problems ranging from instruction identification, scheduling and binding to optimized architecture synthesis can be defined and solved using the constraint programming approach. This year, a new pattern scheduling approach has been defined. It enables concurrent match selection and parallel match scheduling on the processor and extension assuming that the execution on an extension is not atomic. It means that the data produced by an extension must not necessarily be sent to the processor just after the end of processing. Thanks to that, a better scheduling can be obtained [71] . The efficient FPGA implementation of processing units require optimization of hardware resources, such as registers and multiplexers. The extension synthesis defined previously has been revisited. For applications from MediaBench, MiBench and MiCrypt benchmark sets, an improvement of 35%, after placement and routing on the Stratix2 Altera FPGA, is observed.
Run-time Reconfigurable Architecture Modeling
Participants : Christophe Wolinski, François Charot, Emmanuel Casseau, Daniel Menard, Antoine Floch, Erwan Raffin, Steven Derrien.
We have continued to work on the modeling problem of the run-time reconfigurable, operator-based, ROMA multimedia architecture. The ROMA processor is composed of a set of coarse-grained reconfigurable operators, data memories, a configuration memory, two interconnection networks (between operators and between operators and memories), and dedicated controllers designed for each module of the datapath. A centralized controller manages the configuration and the execution steps. The ROMA processor has three different interfaces: a data interface connected to the operator network, a control interface and a debug interface connected to the main controller. The number of operators, the number of memories and their size can be decided according to application requirements. The compilation flow of our framework rests on the use of an architecture abstract model of the targeted ROMA architecture.
During this year we have focused on the definition of the constraint model to deploy an application graph on the pipeline architecture model. The goal is here to minimize the latency of the pipeline. The main changes are at the operator and memory levels. The operators are pipelined and the dual port memories behave like circular buffers. Recall that in the case of the non pipelined model, the goal is to optimize the execution time of the application under resource constraints. We have carried out experiments to evaluate the quality of our method using different pattern libraries (patterns supported by the ROMA SWP coarse-grained reconfigurable operator, patterns extracted from the MediaBench set) [47] . In these experiments the model has no limitation in terms of number of operators and number of memories. The optimality of the solutions were proven in 93% of cases. More details can be found in [29] and in the Ph.D. thesis of Erwan Raffin [17] .
In the context of the RecMotifs project, we have continued to work on a specific design flow integrating STMicroelectronics' compiler flow. This project also allowed us to bring significant evolution to our pattern analysis software tools. The RecMotifs flow consists in a pattern analysis flow for STMicroelectronics graphml files generated by ST compiler. This flow allows pattern description (description of graphml pattern that can be used in the covering pass), type extraction, pattern generation (pattern generation on a graphml file), covering (covering of a graphml file with minimization of the parallel execution time without any resource constraints). Once the pattern analysis has been applied to the graphml files, C code regeneration can be performed using GeCos.
Floating-Point to Fixed-Point Conversion
Participants : Daniel Menard, Karthick Parashar, Olivier Sentieys, Romuald Rocher, Hai-Nam Nguyen.
For the fixed-point conversion process, different optimization algorithms have been tested. The aim is to minimize the implementation cost under accuracy constraint. In [54] , two new algorithms for the word-length optimization procedure, based on the Greedy Randomized Adaptive Search Procedure (GRASP), are proposed. Compared to existing methods, our proposition yields better results and has a complexity between deterministic methods and stochastic methods.