Section: New Results

Software Radio Programming Model

Dataflow programming models

Parallel computers have become ubiquitous and current processors contain several execution cores. A variety of low-level tools exist to program these chips efficiently, but they are considered hard to program, to maintain, and to debug, because they may exhibit non-deterministic behaviors. A solution is to use the higher-level formalism of dataflow programming to specify only the operations to perform and their dependencies. This paradigm may then be combined with the Polyhedral Model, which allows automatic parallelization and optimization of loop nests. This makes programming easier by delegating the low-level work to compilers and static analyzers [41].

Existing dataflow runtime systems either focus on the efficient execution of a single data-flow application, or on scenarii where applications are known a priori. CalMAR is a Multi-Application Dataflow Runtime built on top of the RVC-Cal environment that addresses the problem of executing an a priori unknown number of dataflow applications concurrently on the same multi-core system. Its efficiency has been validated compared to the RVC-CAL traditional approach [27].

Environments for transiently powered devices

An important research initiative has been started in Socrate recently: the study of the new NVRAM technology and its use in ultra-low power context. NVRAM stands for Non-Volatile Radom Access Memory. Non-Volatile mémory has been extising for a while (Nand Flash for instance) but was not sufficiently fast to be used as main memory. Many emerging technologies are forseen for Non-Volatile RAM to replace current RAM  [58].

Socrate has started a work on the applicability of NVRAM for transiantly powered systems, i.e. systems which may undergo power outage at any time. This study resulted in the Sytare software presented in a research report and at the IoENT conference [39], [37], [17] and also to the starting of an Inria Project Lab: ZEP.

The Sytare software introduces a checkpointing system that takes into account peripherals (ADC, leds, timer, radio communication, etc.) present on all embedded system. Checkpointing is the natural solution to power outage: regularly save the state of the system in NVRAM so as to restore it when power is on again. However, no work on checkpointing took into account the restoration of the states of peripherals, Sytare provides this possibility

Filter synthesis

[46] presents an open-source tool for the automatic design of reliable finite impulse response (FIR) filters, targeting FPGAs. It shows that user intervention can be limited to a very small number of relevant input parameters: a high-level frequency-domain specification, and input/output formats. All the other design parameters are computed automatically, using novel approaches to filter coefficient quantization and direct- form architecture implementation. Our tool guarantees a priori that the resulting architecture respects the specification, while attempting to minimize its cost. Our approach is evaluated on a range of examples and shown to produce designs that are very competitive with the state of the art, with very little design effort.

Linear Time Invariant (LTI) filters are often specified and simulated using high-precision software, before being implemented in low-precision fixed-point hardware. A problem is that the hardware does not behave exactly as the simulation due to quantization and rounding issues. The article [53] advocates the construction of LTI architectures that behave as if the computation was performed with infinite accuracy, then rounded only once to the low-precision output format. From this minimalist specification, it is possible to deduce the optimal values of many architectural parameters, including all the internal data formats. This requires a detailed error analysis that captures not only the rounding errors but also their infinite accumulation in recursive filters. This error analysis then guides the design of hardware satisfying the accuracy specification at the minimal hardware cost. This generic methodology is detailed for the case of low-precision LTI filters in the Direct Form I implemented in FPGA logic. The approach is demonstrated by a fully automated and open-source architecture generator tool, and validated on a range of Infinite Impulse Response filters.

Hardware computer arithmetic

In collaboration with researchers from Istanbul, Turkey, operators have been developed for division by a small positive constant [8]. The first problem studied is the Euclidean division of an unsigned integer by a constant, computing a quotient and a remainder. Several new solutions are proposed and compared against the state of the art. As the proposed solutions use small look-up tables, they match well the hardware resources of an FPGA. The article then studies whether the division by the product of two constants is better implemented as two successive dividers or as one atomic divider. It also considers the case when only a quotient or only a remainder are needed. Finally, it addresses the correct rounding of the division of a floating-point number by a small integer constant. All these solutions, and the previous state of the art, are compared in terms of timing, area, and area-timing product. In general, the relevance domains of the various techniques are very different on FPGA and on ASIC.

[23] presents the new framework for semi-automatic circuit pipelining that will be used in future releases of the FloPoCo generator. From a single description of an operator or datapath, optimized implementations are obtained automatically for a wide range of FPGA targets and a wide range of frequency/latency trade-offs. Compared to previous versions of FloPoCo, the level of abstraction has been raised, enabling easier development, shorter generator code, and better pipeline optimization. The proposed approach is also more flexible than fully automatic pipelining approaches based on retiming: in the proposed technique, the incremental construction of the pipeline along with the circuit graph enables architectural design decisions that depend on the pipeline.

FPGAs are well known for their ability to perform non-standard computations not supported by classical microprocessors. Many libraries of highly customizable application-specific IPs have exploited this capablity. However, using such IPs usually requires handcrafted HDL, hence significant design efforts. High Level Synthesis (HLS) lowers the design effort thanks to the use of C/C++ dialects for programming FPGAs. However, high-level C language becomes a hindrance when one wants to express non-standard computations: this languages was designed for programming microprocessors and carries with it many restrictions due to this paradigm. This is especially true when computing with floating-point, whose data-types and evaluation semantics are defined by the IEEE-754 and C11 standards. If the high-level specification was a computation on the reals, then HLS imposes a very restricted implementation space. [32]attempts to bridge FPGA application-specific efficiency and HLS ease of use. It specifically targets the ubiquitous floating-point summation-reduction pattern. A source-to-source compiler transforms selected floating-point additions into sequences of simpler operators using non-standard arithmetic formats. This improves performance and accuracy for several benchmarks, while keeping the ease of use of a high-level C description.

The previous uses a variation of Kulisch' proposal to use an internal accumulator large enough to cover the full exponent range of floating-point. With it, sums and dot products become exact operations with one single rounding at the end. This idea failed to materialize in general purpose processors, as it was considered too slow and/or too expensive in terms of resources. It may however be an interesting option in reconfigurable computing, where a designer may use use smaller, more resource-efficient floating-point formats, knowing that sums and dot products will be exact. Another motivation of this work is that these exact operations, contrary to classical floating point ones, are associative, which enables better compiler optimizations in a High-Level Synthesis context. Kulisch proposed several architectures for the large accumulator, all using a sign/magnitude representation: the internal accumulator always represents a positive significand. [52] introduces an architecture using a 2’s complement representation instead, and demonstrates improvements over Kulisch’ proposal in both area and speed.

Another alternative to floating point is the UNUM, a variable length floating-point format conceived to replace the formats defined in the IEEE 754 standard. [18] discusses the implementation of UNUM arithmetic and reports hardware implementation results for the main UNUM operators.