Section: Scientific Foundations

Model-based optimization and compilation techniques


Optimization for parallelism

We study optimization techniques to produce “good” schedules and mappings of a given application onto a hardware SoC architecture. These heuristic techniques aim at fulfilling the requirements of the application, whether they be real time, memory usage or power consumption constraints. These techniques are thus multi-objective and target heterogeneous architectures.

We aim at taking advantage of the parallelism (both data-parallelism and task parallelism) expressed in the application models in order to build efficient heuristics.

Our application model has some good properties that can be exploited by the compiler: it expresses all the potential parallelism of the application, it is an expression of data dependencies –so no dependence analysis is needed–, it is in a single assignment form and unifies the temporal and spatial dimensions of the arrays. This gives to the optimizing compiler all the information it needs and in a readily usable form.

Transformation and traceability

Model to model transformations are at the heart of the MDE approach. Anyone wishing to use MDE in its projects is sooner or later facing the question: how to perform the model transformations? The standardization process of Query View Transformation  [111] was the opportunity for the development of transformation engine as Viatra, Moflon or Sitra. However, since the standard has been published, only few of investigating tools, such as ATL(http://www.eclipse.org/m2m/atl ) (a transformation dedicated tool) or Kermeta (http://www.kermeta.org ) (a generalist tool with facilities to manipulate models) are powerful enough to execute large and complex transformations such as in the Gaspard2 framework. None of these engine is fully compliant with the QVT standard. To solve this issue, new engine relying on a subset of the standard recently emerged such as QVTO (http://www.eclipse.org/m2m/qvto/doc ) and smartQVT. These engines implement the QVT Operational language.

Traceability may be used for different purposes such as understanding, capturing, tracking and verification on software artifacts during the development life cycle  [98] . MDE has as main principle that everything is a model, so trace information is mainly stored as models. Solutions are proposed to keep the trace information in the initials models source or target  [125] . The major drawbacks of this solution are that it pollutes the models with additional information and it requires adaptation of the metamodels in order to take into account traceability. Using a separate trace model with a specific semantics has the advantage of keeping trace information independent of initial models  [102] .

Contributions of the team

Data-parallel code transformations

We have studied Array-OL to Array-OL code transformations  [83] , [122] , [93] , [92] , [94]   [101] . Array-OL allows a powerful expression of the data access patterns in such applications and a complete parallelism expression. It is at the heart of our metamodel of application, hardware architecture and association.

The code transformations that have been proposed are related to loop fusion, loop distribution or tiling but they take into account the particularities of the application domain such as the presence of modulo operators to deal with cyclic frequency domains or cyclic space dimensions (as hydrophones around a submarine for example).

We pursue the study of such transformations with two objectives:

  • Propose utilization strategies of such transformations in order to optimize some criteria such as memory usage, minimization of redundant computations or adaptation to a target hardware architecture.

  • Stretch their application domain to our more general application model (instead of just Array-OL).

In 2009 the study on the interaction between the high-level data-parallel transformations and the inter-repetition dependencies (allowing the specification of uniform dependencies) was achieved. Because the ODT formalism behind the Array-OL transformations cannot express dependencies between the elements of the same multidimensional space, in order to take into account the uniform dependencies we proposed and proved an algorithm that, starting from the hierarchical distribution of repetition before and after a transformation, is capable to compute the new uniform dependencies that express the same exact dependencies as before the transformations. It all comes down to solving an (in)equations system, interpreting the solutions and translating them into new uniform dependencies.

The algorithm was implemented and integrated into the refactoring toolbox and enables the use of the transformations on models containing inter-repetition dependencies.

In order to validate the theoretical work around the high-level Array-OL refactoring based on the data-parallel transformations, together with Eric Lenormand and Michel Barreteau from THALES Research & Technology we worked on a study on optimization techniques in the context of an industrial radar application. We have proposed a strategy to use the refactoring toolbox to help explore the design space, illustrated on the radar application modeled using the Modeling and Analysis of Real-time and Embedded systems (MARTE) UML profile.

Multi-objective hierarchical scheduling heuristics

When dealing with complex heterogeneous hardware architectures, the scheduling heuristics usually take a task dependence graph as input. Both our application and hardware architecture models are hierarchical and allow repetitive expressions. We propose a Globally Irregular, Locally Regular (GILR) combination of heuristics to allow to take advantage of both task and data parallelism  [105] and have started evaluating multi-objective evolutionary meta-heuristics in this context. These evolutionary meta-heuristics deal with the irregular (task parallelism) part of the design  [80] while we have proposed a heuristic to deal with the regular part (data parallelism)  [106] .

Furthermore, local optimizations (contained inside a hierarchical level) decrease the communication overhead and allow for a more efficient usage of the memory hierarchy. We aim at combining the data-parallel code transformations presented before and the GILR heuristics in order to deal efficiently with the data-parallelism of the application by using repetitive parts of the hardware architecture.

The introduction of uniform inter-repetition dependencies in the data-parallel tasks of Gaspard2 has had several consequences. Aside the modification of the refactoring (see section ), we have studied the compilation of such tasks. This compilation involves the scheduling of such repetitions on repetitive grids of processors and the code generation. This scheduling problem is NP-complete and we have proposed a heuristic based on the automatic parallelization techniques to compute a good (efficient both in time and code size) schedule in the case when all loop bounds and processor array shapes are known.

Transformation techniques

In the previous version of Gaspard2, model transformations were complex and monolithic. They were thus hardly evolvable, reusable and maintainable. We thus proposed to decompose complex transformations into smaller ones jointly working in order to build a single output model  [96] . These transformations involve different parts of the same input metamodel (e.g. the MARTE metamodel); their application field is localized. The localization of the transformation was ensured by the definition of the intermediary metamodels as delta. The delta metamodel only contains the few concepts involved in the transformation (i.e. modified, or read). The specification of the transformations only uses the concepts of these deltas. We defined the Extend operator to build the complete metamodel from the delta and transposed the corresponding transformations. The complete metamodel corresponds to the merge between the delta and the MARTE metamodel or an intermediary metamodel. The transformation then becomes the chaining of metamodel shifts and the localized transformation. This way to define the model transformations has been used in the Gaspard2 environment. It allowed a better modularity and thus also reusability between the various chains.


Our traceability solution relies on two models the Local and the Global Trace metamodels. The former is used to capture the traces between the inputs and the outputs of one transformation. The Global Trace metamodel is used to link Local Traces according to the transformation chain. The local trace also proposes an alternative “view” to the common traceability mechanism that does not refers to the execution trace of the transformation engine. It can be used whatever the used transformation language and can easily complete an existing traceability mechanism by providing a more finer grain traceability [75] .

Furthermore, based on our trace metamodels, we developed algorithms to ease the model transformation debug. Based on the trace, the localization of an error is eased by reducing the search field to the sequence of the transformation rule calls  [76] .

Verifying conformance and semantics-preserving model transformations

We give formal executable semantics to the notions of conformance and of semantics-preserving model transformations in the model-driven engineering framework [119] . Our approach consists in translating models and meta-models (possibly enriched with OCL invariants) into specifications in Membership Equational Logic, an expressive logic implemented in the Maude tool. Conformance between a model and a meta-model is represented by the validity of a certain theory interpretation, of the specification representing the meta-model, in the specification representing the model. Model transformations between origin and destination meta-models are mappings between the sets of models that conform to the those meta-models, respectively, and can be represented by rewrite rules in Rewriting Logic, a superset of Membership Equational Logic also implemented in Maude. When the meta-models involved in a transformation are endowed with dynamic semantics, the transformations between them are also typically required to preserve those semantical aspects. We propose to represent the notion of dynamic semantics preservation by means of algebraic simulations expressed in Membership Equational Logic. Maude can then be used for automatically verifying conformance, and for automatically verifying dynamic semantics preservation up to a bounded steps of the dynamic semantics. These works lead to better understood meta-models and models, and to model transformations containing fewer errors.

Modeling for GPU

The model described in UML with Marte profile model is chained in several inout transformations that adds and/or transforms elements in the model. For adding memory allocation concepts to the model, a QVT transformation based on «Memory Allocation Metamodel» provides information to facilitate and optimize the code generation. Then a model to text transformation allows to generate the C code for GPU architecture. Before the standard releases, Acceleo is appropriate to get many aspects from the application and architecture model and transform it in CUDA (.cu, .cpp, .c, .h, Makefile) and OpenCL (.cl, .cpp, .c, .h, Makefile) files. For the code generation, it's required to take into account intrinsic characteristics of the GPUs like data distribution, contiguous memory allocation, kernels and host programs, blocks of threads, barriers and atomic functions.

Clock-based design space exploration for SoCs

We have previously proposed an abstract clock-based modeling of data-intensive SoCs behaviors within the Gaspard2 framework [70] [69] . Both application functionality and hardware architecture are characterized in terms of clocks. Then, their allocation is also expressed as a projection of functional clock properties onto physical clock properties, according to a mapping choice. The result of such allocation is a new set of clocks reflecting the simulation of the temporal behavior of the system during execution.

This year, this approach has been applied to the design of the H.264 encoder on a multiprocessor hardware architecture using the standard MARTE profile [71] . The obtained model has been analyzed by considering abstract clocks. In particular, it has been shown that such clocks help to tackle design space exploration issues via a relevant modeling of different hardware/software mappings. The trade-off about processor frequency scaling, system functional properties and energy consumption has been addressed, via different hardware IP choices. This has been achieved via a qualitative reasoning on traces resulting from a scheduling of logical clocks, capturing functional properties, on physical clocks derived from processors frequency.

Optimized code generation from UML/MARTE models

Starting from the observation that some semantics (and thus some optimization possibilities) are lost when generating code in a programming language from a UML/MARTE model, the contribution of a thesis co-directed with the CEA LIST is an optimization at the model level followed by a translation to the GENERIC intermediate representation of the gcc compilation framework in order to allow more optimization, for the moment focusing on code size optimization.

Architecture exploration based on meta-heuristics

Some progress has been made on the proposal of meta-heuristics use for multi-objective mapping and scheduling. In collaboration with the Dolphin project-team of INRIA Lille - Nord Europe and LIFL we have modeled the association process of Gaspard2 as an optimization problem in order to solve it with a genetic algorithm based heuristic that has been implemented in the ParadisEO optimization framework. This new heuristics is currently being integrated in the Gaspard2 tool. Another work comparing heuristics based on the particle swarm and genetic algorithm meta heuristics has been proposed in collaboration with the computer science laboratory of Oran, Algeria, in continuation of our collaboration.

Architecture exploration for efficient data transfer and storage

A major point in embedded system design today is the optimization of communication structures, memory hierarchy and global synchronizations. Such an optimization is a time consuming and error-prone process, that requires a suitable automatic approach. We proposed an electronic system level framework to explore the data transfer storage micro-architecture and the synchronization of iterative data-parallel applications [88] . The aim is to define a methodology that can be a front-end for loop-based high level synthesis or interconnect hardware IPs in order to realize memory-centric MPSoCs. In Gaspard2, this will enable to assess various mappings of Array-OL models onto different kinds of target architectures.

Our solution starts from a canonical Array-OL representation and apply a set of transformations in order to infer an Application Specific architecture that masks the times to transfer data with the time to perform the computations. A customizable model of the target architecture including FIFO queues and double buffering mechanism is proposed. The mapping of a given image processing application onto this architecture is performed through a flow of Array-OL transformations aimed to improve the parallelism level and to reduce the size of the used internal memories. A method based on an integer partition is considered to reduce the space of explored transformations.

Multi-objective mapping and scheduling heuristics

Mohamed Akli Redjedal, univ. Lille 1 master, co-directed with Laetitia Jourdan form the Dolphin project-team of INRIA Lille - Nord Europe and LIFL. The work of Mohamed Redjedal has consisted in modeling the association process of Gaspard2 as an optimization problem in order to solve it with a genetic algorithm based heuristic. He has indeed modeled this multi-objective mapping and scheduling problem, proposed a heuristic and its implementation in the ParadisEO optimization framework. A 1st year master student from the univ. of Brussels has worked 6 weeks on the model driven export from Gaspard2 to the optimization heuristics proposed by Mohamed Redjedal

GPGPU code production

The solution of large, sparse systems of linear equations « Ax=b » presents a bottleneck in sequential code executing on CPU. To solve a system bound to Maxwell's equations on Finite Element Method (FEM), a version of conjugate gradient iterative method was implemented in CUDA and OpenCL as well. The aim is to accelerate and verify the parallel code on GPUs. The first results showed a speedup around 6 times against sequential code on CPU. Another approach uses an algorithm that explores the sparse matrix storage format (by rows and by columns). This one did not increase the speedup but it allows to evaluate the impact of the access to the memory.

From MARTE to OpenCL.

We have proposed an MDE approach to generate OpenCL code. From an abstract model defined using UML/MARTE, we generate a compilable OpenCL code and then, a functional executable application. As MDE approach, the research results provide, additionally, a tool for project reuse and fast development for not necessarily experts. This approach is an effective operational code generator for the newly released OpenCL standard. Further, although experimental examples use mono device(one GPU) example, this approach provides resources to model applications running on multi devices (homogeneously configured). Moreover, we provide two main contributions for modeling with UML profile to MARTE. On the one hand, an approach to model distributed memory simple aspects, i.e. communication and memory allocations. On the other hand, an approach for modeling the platform and execution models of OpenCL. During the development of the transformation chain, an hybrid metamodel was proposed for specifying of CPU and GPU programming models. This allows generating other target languages that conform the same memory, platform and execution models of OpenCL, such as CUDA language. Based on other created model to text templates, future works will exploit this multi language aspect. Additionally, intelligent transformations can determine optimization levels in data communication and data access. Several studies show that these optimizations increase remarkably the application performance.

Formal techniques for construction, compilation and analysis of domain-specific languages

The increasing complexity of software development requires rigorously defined domain specific modelling languages (DSML). Model-driven engineering (MDE) allows users to define their language's syntax in terms of metamodels. Several approaches for defining operational semantics of DSML have also been proposed  [123] , [89] , [73] , [84] , [115] . We have also proposed one such approach, based on representing models and metamodels as algebraic specifications, and operational semantics as rewrite rules over those specifications  [95] , [120] . These approaches allow, in principle, for model execution and for formal analyses of the DSML. However, most of the time, the executions/analyses are performed via transformations to other languages: code generation, resp. translation to the input language of a model checker. The consequence is that the results (e.g., a program crash log, or a counterexample returned by a model checker) may not be straightforward to interpret by the users of a DSML. We have proposed in  [118] a formal and operational framework for tracing such results back to the original DSML's syntax and operational semantics, and have illustrated it on SPEM, a language for timed process management.

Electromagnetic modeling

The Finite Integration Technique (F.I.T) is used to compute the phenomena. This technique is efficient if the mesh is generated by a regular hexahedron. Moreover the matrix system, obtained from a regular mesh can be exploited to use the parallel direct solver. In fact, in reordering the unknowns by the nested dissection method, it is possible to construct directly the lower triangular matrix with many processors without assembling the matrix system. During this year, we have used our parallel direct solver as a preconditionner for a sparse linear system coming from a FEM problem with a good efficiency.