## Section: Scientific Foundations

### Adaptive Parallel and Distributed Algorithms Design

Participants : François Broquedis, Pierre-François Dutot, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram, Frédéric Wagner.

*This theme deals with the analysis and the design of algorithmic schemes that
control (statically or dynamically) the grain of interactive applications.
*

The classical approach consists in setting in advance the number of processors for an application, the execution being limited to the use of these processors. This approach is restricted to a constant number of identical resources and for regular computations. To deal with irregularity (data and/or computations on the one hand; heterogeneous and/or dynamical resources on the other hand), an alternate approach consists in adapting the potential parallelism degree to the one suited to the resources. Two cases are distinguished:

in the classical bottom-up approach, the application provides fine grain tasks; then those tasks are clustered to obtain a minimal parallel degree.

the top-down approach (Cilk, Cilk+, TBB, Hood, Athapascan) is based on a work-stealing scheduling driven by idle resources. A local sequential depth-first execution of tasks is favored when recursive parallelism is available.

Ideally, a good parallel execution can be viewed as a flow of computations flowing through resources with no control overhead. To minimize control overhead, the application has to be adapted: a parallel algorithm on $p$ resources is not efficient on $q<p$ resources. On one processor, the scheduler should execute a sequential algorithm instead of emulating a parallel one. Then, the scheduler should adapt to resource availability by changing its underlying algorithm. This first way of adapting granularity is implemented by Kaapi (default work-stealing schedule based on work-first principle).

However, this adaptation is restrictive. More generally, the algorithm should adapt itself at runtime to improve its performance by decreasing the overheads induced by parallelism, namely the arithmetic operations and communications. This motivates the development of new parallel algorithmic schemes that enable the scheduler to control the distribution between computation and communication (grain) in the application to find the good balance between parallelism and synchronizations. MOAIS has exhibited several techniques to manage adaptivity from an algorithmic point of view:

amortization of the number of global synchronizations required in an iteration (for the evaluation of a stopping criterion);

adaptive deployment of an application based on on-line discovery and performance measurements of communication links;

generic recursive cascading of two kind of algorithms: a sequential one, to provide efficient executions on the local resource, and a parallel one that enables an idle resource to extract parallelism to dynamically suit the degree of parallelism to the available resources.

The generic underlying approach consists in finding a good mix of various
algorithms, what is often called a "poly-algorithm".
Particular instances of this approach are
Atlas library (performance benchmark are used to decide at compile
time the best block size and instruction interleaving for sequential
matrix product) and FFTW library (at run time, the best recursive splitting
of the FFT butterfly scheme is precomputed by dynamic programming).
Both cases rely on pre-benchmarking of the algorithms.
Our approach is more general in the sense that it also enables to
tune the granularity at any time during execution.
The objective is to develop processor oblivious algorithms:
similarly to cache oblivious algorithms,
we define a parallel algorithm as *processor-oblivious* if no program
variable that depends on architecture parameters, such as the number or processors
or their respective speeds, needs to be tuned to minimize the algorithm runtime.

We have applied this technique to develop processor
oblivious algorithms for several applications
with provable performance:
iterated and prefix sum (partial sums) computations, stream computations (cipher and
hd-video transformation),
3D image reconstruction (based on the concurrent usage of multi-core and GPU),
loop computations with early termination.
Finally, to validate these novel parallel computation schemes, we developed a
tool named **KRASH**. This tool is able to generate dynamic CPU load in a
reproducible way on many-cores machines. Thus, by providing the same
experimental conditions to several parallel applications, it enables users to
evaluate the efficiency of resource uses for each approach.

By optimizing the work-stealing to our adaptive algorithm scheme, the non-blocking (wait-free) implementation of Kaapi has been designed and leads to the C library X-kaapi.

Extensions concern the development of algorithms that are both cache and processor oblivious on heterogeneous processors. The processor algorithms proposed for prefix sums and segmentation of an array are cache oblivious too.