Section: New Software and Platforms
StarPU
The StarPU Runtime System
Keywords: Multicore - GPU - Scheduling - HPC - Performance
Scientific Description: Traditional processors have reached architectural limits which heterogeneous multicore designs and hardware specialization (eg. coprocessors, accelerators, ...) intend to address. However, exploiting such machines introduces numerous challenging issues at all levels, ranging from programming models and compilers to the design of scalable hardware solutions. The design of efficient runtime systems for these architectures is a critical issue. StarPU typically makes it much easier for high performance libraries or compiler environments to exploit heterogeneous multicore machines possibly equipped with GPGPUs or Cell processors: rather than handling low-level issues, programmers may concentrate on algorithmic concerns.Portability is obtained by the means of a unified abstraction of the machine. StarPU offers a unified offloadable task abstraction named "codelet". Rather than rewriting the entire code, programmers can encapsulate existing functions within codelets. In case a codelet may run on heterogeneous architectures, it is possible to specify one function for each architectures (eg. one function for CUDA and one function for CPUs). StarPU takes care to schedule and execute those codelets as efficiently as possible over the entire machine. In order to relieve programmers from the burden of explicit data transfers, a high-level data management library enforces memory coherency over the machine: before a codelet starts (eg. on an accelerator), all its data are transparently made available on the compute resource.Given its expressive interface and portable scheduling policies, StarPU obtains portable performances by efficiently (and easily) using all computing resources at the same time. StarPU also takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models.
StarPU is a task programming library for hybrid architectures
The application provides algorithms and constraints: - CPU/GPU implementations of tasks - A graph of tasks, using either the StarPU's high level GCC plugin pragmas or StarPU's rich C API
StarPU handles run-time concerns - Task dependencies - Optimized heterogeneous scheduling - Optimized data transfers and replication between main memory and discrete memories - Optimized cluster communications
Rather than handling low-level scheduling and optimizing issues, programmers can concentrate on algorithmic concerns!
Functional Description: StarPU is a runtime system that offers support for heterogeneous multicore machines. While many efforts are devoted to design efficient computation kernels for those architectures (e.g. to implement BLAS kernels on GPUs), StarPU not only takes care of offloading such kernels (and implementing data coherency across the machine), but it also makes sure the kernels are executed as efficiently as possible.
-
Participants: Corentin Salingue, Andra Hugo, Benoît Lize, Cédric Augonnet, Cyril Roelandt, François Tessier, Jérôme Clet-Ortega, Ludovic Courtès, Ludovic Stordeur, Marc Sergent, Mehdi Juhoor, Nathalie Furmento, Nicolas Collin, Olivier Aumage, Pierre-André Wacrenier, Raymond Namyst, Samuel Thibault, Simon Archipoff, Xavier Lacoste, Terry Cojean, Yanis Khorsi, Philippe Virouleau, LoÏc Jouans and Leo Villeveygoux
-
Publications: Asynchronous Task-Based Execution of the Reverse Time Migration for the Oil and Gas Industry - A Compiler Algorithm to Guide Runtime Scheduling - Achieving high-performance with a sparse direct solver on Intel KNL - Modeling Irregular Kernels of Task-based codes: Illustration with the Fast Multipole Method - Scheduling of Dense Linear Algebra Kernels on Heterogeneous Resources - Critical resources management and scheduling under StarPU - Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model - Programmation of heterogeneous architectures using moldable tasks - The StarPU Runtime System at Exascale ? : Scheduling and Programming over Upcoming Machines. - A Visual Performance Analysis Framework for Task-based Parallel Applications running on Hybrid Clusters - Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach - Detecção de Anomalias de Desempenho em Aplicações de Alto Desempenho baseadas em Tarefas em Clusters Híbridos - Resource aggregation for task-based Cholesky Factorization on top of heterogeneous machines - On Runtime Systems for Task-based Programming on Heterogeneous Platforms - Resource aggregation in task-based applications over accelerator-based multicore machines - Controlling the Memory Subscription of Distributed Applications with a Task-Based Runtime System - Exploiting Two-Level Parallelism by Aggregating Computing Resources in Task-Based Applications Over Accelerator-Based Machines - Exploiting Two-Level Parallelism by Aggregating Computing Resources in Task-Based Applications Over Accelerator-Based Machines - Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model - Bridging the gap between OpenMP 4.0 and native runtime systems for the fast multipole method - Scalability of a task-based runtime system for dense linear algebra applications - Faithful Performance Prediction of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures - Towards seismic wave modeling on heterogeneous many-core architectures using task-based runtime system - Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms - Composing multiple StarPU applications over heterogeneous machines: A supervised approach - Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite - A runtime approach to dynamic resource allocation for sparse direct solvers - Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures - Toward OpenCL Automatic Multi-Device Support - Harnessing clusters of hybrid nodes with a sequential task-based programming model - Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes - Modulariser les ordonnanceurs de tâches : une approche structurelle - Overview of Distributed Linear Algebra on Hybrid Nodes over the StarPU Runtime - StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators - Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures - Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes - Adaptive Task Size Control on High Level Programming for GPU/CPU Work Sharing - Composing multiple StarPU applications over heterogeneous machines: a supervised approach - Implementation of FEM Application on GPU with StarPU - Le problème de la composition parallèle : une approche supervisée - Support exécutif scalable pour les architectures hybrides distribuées - SOCL: An OpenCL Implementation with Automatic Multi-Device Adaptation Support - C Language Extensions for Hybrid CPU/GPU Programming with StarPU - Programming Models and Runtime Systems for Heterogeneous Architectures - Programmation unifiée multi-accélérateur OpenCL - StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators - Parallelization on Heterogeneous Multicore and Multi-GPU Systems of the Fast Multipole Method for the Helmholtz Equation Using a Runtime System - High-Level Support for Pipeline Parallelism on Many-Core Architectures - Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems - Programmation des architectures hétérogènes à l’aide de tâches divisibles - StarPU: a unified platform for task scheduling on heterogeneous multicore architectures - PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems - The PEPPHER Approach to Programmability and Performance Portability for Heterogeneous many-core Architectures - Flexible runtime support for efficient skeleton programming on hybrid systems - LU Factorization for Accelerator-based Systems - QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators - Programmation multi-accélérateurs unifiée en OpenCL - Détection optimale des coins et contours dans des bases d'images volumineuses sur architectures multicœurs hétérogènes - Association de modèles de programmation pour l'exploitation de clusters de GPUs dans le calcul intensif - Programming heterogeneous, accelerator-based multicore machines:current situation and main challenges - Scheduling Tasks over Multicore machines enhanced with acelerators: a Runtime System's Perspective - Composabilité de codes parallèles sur architectures hétérogènes - Data-Aware Task Scheduling on Multi-Accelerator based Platforms - Dynamically scheduled Cholesky factorization on multicore architectures with GPU accelerators. - StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines - StarPU : un support exécutif unifié pour les architectures multicoeurs hétérogènes - Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures - StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures - Exploiting the Cell/BE architecture with the StarPU unified runtime system - Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method - Composability of parallel codes on heterogeneous architectures - Are Static Schedules so Bad ? A Case Study on Cholesky Factorization - Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources - Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs - Resource aggregation for task-based Cholesky Factorization on top of modern architectures - Visual Performance Analysis of Memory Behavior in a Task-Based Runtime on Hybrid Platforms - Tolérance aux pannes dans l'exécution distribuée de graphes de tâches