Section: New Results
Scheduling and Placement for HPC
With the complexification of the architecture of HPC nodes (multicores, non uniform memory access, GPU and accelerators), a recent trend in application development is to explicitely express the computations as a task graph, and rely on a specialized middleware stack to make scheduling decisions and implement them. Traditional algorithms used in this community are dynamic heuristics, to cope with the unpredictability of execution times. In [12], we analyze the performance of static and hybrid strategies, obtained by adding more static (resp. dynamic) features into dynamic (resp. static) strategies. Our conclusions are somehow unexpected in the sense that we prove that static-based strategies are very efficient, even in a context where performance estimations are not very good. We also present and generalize HeteroPrio, a semi-static resource-centric strategy based on the acceleration factors of tasks. In [19], we generalize this strategy to platforms with more than two types of resources. This allows to use intra-task parallelism by grouping several CPU cores together. In [27], we prove tight approximation ratios for HeteroPrio in the context of independent tasks, providing a theoretical insight to its good practical performance.
Another study [26] focuses on the memory-constrained case, where tasks may produce large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. There is a known, polynomial time algorithm [55] to minimize the peak memory used on one machine for the cases where the input graph is a rooted tree. We generalize in [26] to the variant where the input graph is a directed series-parallel graph, and propose a polynomial time algorithm. This allows to solve this practical problem in two important classes of applications.
In [13], we consider the static problem of
data placement for matrix multiplication in heterogeneous machines, so
as to optimize both load balancing and communication volume. This is
modeled as a partitioning of a square into a set of zones of
prescribed areas, while minimizing the overall size of their
projections onto horizontal and vertical axes. We combine two ideas
from the literature (recursive partitioning, and optimal solution
structure for low number of processors) to obtain a non-rectangular
recursive partitioning (NRRP), whose approximation ratio is