The Roma project aims at designing models, algorithms, and scheduling
strategies to optimize the execution of scientific applications.

Scientists now have access to tremendous computing power. For instance, the top supercomputers contain more than 100,000 cores, and volunteer computing grids gather millions of processors. Furthermore, it had never been so easy for scientists to have access to parallel computing resources, either through the multitude of local clusters or through distant cloud computing platforms.

Because parallel computing resources are ubiquitous, and because the available computing power is so huge, one could believe that scientists no longer need to worry about finding computing resources, even less to optimize their usage. Nothing is farther from the truth. Institutions and government agencies keep building larger and more powerful computing platforms with a clear goal. These platforms must allow to solve problems in reasonable timescales, which were so far out of reach. They must also allow to solve problems more precisely where the existing solutions are not deemed to be sufficiently accurate. For those platforms to fulfill their purposes, their computing power must therefore be carefully exploited and not be wasted. This often requires an efficient management of all types of platform resources: computation, communication, memory, storage, energy, etc. This is often hard to achieve because of the characteristics of new and emerging platforms. Moreover, because of technological evolutions, new problems arise, and fully tried and tested solutions need to be thoroughly overhauled or simply discarded and replaced. Here are some of the difficulties that have, or will have, to be overcome:

We are convinced that dramatic breakthroughs in algorithms and
scheduling strategies are required for the scientific computing
community to overcome all the challenges posed by new and emerging
computing platforms. This is required for applications to be
successfully deployed at very large scale, and hence for enabling the
scientific computing community to push the frontiers of knowledge as
far as possible. The Roma project-team aims at providing fundamental
algorithms, scheduling strategies, protocols, and software packages to
fulfill the needs encountered by a wide class of scientific computing
applications, including domains as diverse as geophysics, structural
mechanics, chemistry, electromagnetism, numerical optimization, or
computational fluid dynamics, to quote a few. To fulfill this goal,
the Roma project-team takes a special interest in dense and sparse
linear algebra.

The work in the Roma team is organized along three research themes.

For HPC applications, scale is a major opportunity. The largest supercomputers contain tens of thousands of nodes and future platforms will certainly have to enroll even more computing resources to enter the Exascale era. Unfortunately, scale is also a major threat. Indeed, even if each node provides an individual MTBF (Mean Time Between Failures) of, say, one century, a machine with 100,000 nodes will encounter a failure every 9 hours in average, which is shorter than the execution time of many HPC applications.

To further darken the picture, several types of errors need to be considered when computing at scale. In addition to classical fail-stop errors (such as hardware failures), silent errors (a.k.a silent data corruptions) must be taken into account. The cause for silent errors may be for instance soft errors in L1 cache, or bit flips due to cosmic radiations. The problem is that the detection of a silent error is not immediate, and that they only manifest later, once the corrupted data has propagated and impacted the result.

Our work investigates new models and algorithms for resilience at extreme-scale. Its main objective is to cope with both fail-stop and silent errors, and to design new approaches that dramatically improve the efficiency of state-of-the-art methods. Application resilience currently involves a broad range of techniques, including fault prediction, error detection, error containment, error correction, checkpointing, replication, migration, recovery, etc. Extending these techniques, and developing new ones, to achieve efficient execution at extreme-scale is a difficult challenge, but it is the key to a successful deployment and usage of future computing platforms.

In this theme, we focus on the design of scheduling strategies that finely take into account some platform characteristics beyond the most classical ones, namely the computing speed of processors and accelerators, and the communication bandwidth of network links. Our work mainly considers the following two platform characteristics:

In this theme, we work on various aspects of sparse direct solvers for
linear systems. Target applications lead to sparse systems made of
millions of unknowns. In the scope of the PaStiX solver, co-developed
with the Inria HiePACS team, there are two main objectives: reducing
as much as possible memory requirements and exploiting modern parallel
architectures through the use of runtime systems.

A first research challenge is to exploit the parallelism of modern
computers, made of heterogeneous (CPUs+GPUs) nodes. The approach
consists of using dynamic runtime systems (in the context of the
PaStiX solver, Parsec or StarPU) to schedule tasks.

Another important direction of research is the exploitation of low-rank representations. Low-rank approximations are commonly used to compress the representation of data structures. The loss of information induced is often negligible and can be controlled. In the context of sparse direct solvers, we exploit the notion of low-rank properties in order to reduce the demand in terms of floating-point operations and memory usage. To enhance sparse direct solvers using low-rank compression, two orthogonal approaches are followed: (i) integrate new strategies for a better scalability and (ii) use preprocessing steps to better identify how to cluster unknowns, when to perform compression and which blocks not to compress.

CSC is a term (coined circa 2002) for interdisciplinary research at the intersection of discrete mathematics, computer science, and scientific computing. In particular, it refers to the development, application, and analysis of combinatorial algorithms to enable scientific computing applications. CSC’s deepest roots are in the realm of direct methods for solving sparse linear systems of equations where graph theoretical models have been central to the exploitation of sparsity, since the 1960s. The general approach is to identify performance issues in a scientific computing problem, such as memory use, parallel speed up, and/or the rate of convergence of a method, and to develop combinatorial algorithms and models to tackle those issues. Most of the time, the research output includes experiments with real life data to validate the developed combinatorial algorithms and fine tune them.

In this context, our work targets (i) the preprocessing phases of direct methods, iterative methods, and hybrid methods for solving linear systems of equations; (ii) high performance tensor computations. The core topics covering our contributions include partitioning and clustering in graphs and hypergraphs, matching in graphs, data structures and algorithms for sparse matrices and tensors (different from partitioning), and task mapping and scheduling.

Sparse linear system solvers have a wide range of applications as they are used at the heart of many numerical methods in computational science: whether a model uses finite elements or finite differences, or requires the optimization of a complex linear or nonlinear function, one often ends up solving a system of linear equations involving sparse matrices. There are therefore a number of application fields: structural mechanics, seismic modeling, biomechanics, medical image processing, tomography, geophysics, electromagnetism, fluid dynamics, econometric models, oil reservoir simulation, magneto-hydro-dynamics, chemistry, acoustics, glaciology, astrophysics, circuit simulation, and work on hybrid direct-iterative methods.

Tensors, or multidimensional arrays, are becoming very important because of their use in many data analysis applications. The additional dimensions over matrices (or two dimensional arrays) enable gleaning information that is otherwise unreachable. Tensors, like matrices, come in two flavors: dense tensors and sparse tensors. Dense tensors arise usually in physical and simulation applications: signal processing for electroencephalography (also named EEG, electrophysiological monitoring method to record electrical activity of the brain); hyperspectral image analysis; compression of large grid-structured data coming from a high-fidelity computational simulation; quantum chemistry etc. Dense tensors also arise in a variety of statistical and data science applications. Some of the cited applications have structured sparsity in the tensors. We see sparse tensors, with no apparent/special structure, in data analysis and network science applications. Well known applications dealing with sparse tensors are: recommender systems; computer network traffic analysis for intrusion and anomaly detection; clustering in graphs and hypergraphs modeling various relations; knowledge graphs/bases such as those in learning natural languages.

Anne Benoit has been elected a Senior Member of Institut Universitaire de France.

PaStiX is a scientific library that provides a high performance parallel solver for very large sparse linear systems based on block direct and block ILU(k) methods. It can handle low-rank compression techniques to reduce the computation and the memory complexity. Numerical algorithms are implemented in single or double precision (real or complex) for LLt, LDLt and LU factorization with static pivoting (for non symmetric matrices having a symmetric pattern). The PaStiX library uses the graph partitioning and sparse matrix block ordering packages Scotch or Metis.

The PaStiX solver is suitable for any heterogeneous parallel/distributed architecture when its performance is predictable, such as clusters of multicore nodes with GPU accelerators or KNL processors. In particular, we provide a high-performance version with a low memory overhead for multicore node architectures, which fully exploits the advantage of shared memory by using a hybrid MPI-thread implementation.

The solver also provides some low-rank compression methods to reduce the memory footprint and/or the time-to-solution.

The ROMA team has been working on resilience problems for several years. In 2023, we have focused on several problems.

The Young/Daly formula provides an approximation of the optimal checkpoint period for a parallel application executing on a supercomputing platform. The Young/Daly formula was originally designed for preemptible tightly-coupled applications. In an invited publication at IC3 2022, we had provided some background and various application scenarios to assess the usefulness and limitations of the formula. In 2023, we have considerably extended the scope of our survey and we have submitted this contribution to the special issue of FGCS scheduled for 2024 and which will focus on JLESC collaboration results.

This work considers an application executing for a fixed duration, namely the length of the reservation that it has been granted. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. We address two scenarios. In the first scenario, a checkpoint can be taken at any time; despite its simplicity, this natural problem has not been considered yet (to the best of our knowledge). We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. The second scenario is more involved: the application is a linear workflow consisting of a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the application checkpoints at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue executing at the end of each task. We instantiate this second scenario with several examples of probability distribution laws for task durations.

This work has been published in FTXS'2023, a workshop co-located with SC'2023 17.

This work studies checkpointing strategies for parallel applications subject to failures.
The optimal strategy to minimize total execution time, or makespan, is well known when failure inter-arrival times obey an Exponential distribution,
but it is unknown for non-memoryless failure distributions.
We explain why the latter fact is misunderstood in recent literature.
We propose a general strategy that maximizes the expected efficiency until the next failure,
and we show that this strategy achieves an asymptotically optimal makespan, thereby establishing the first optimality result for arbitrary
failure distributions.
Through extensive simulations, we show that the new strategy is always at least
as good as the Young/Daly strategy for various failure distributions. For distributions
with a high infant mortality (such as LogNormal with shape parameter

This work has been published in the ACM TOPC journal 8.

We report here the work undertaken by the ROMA team in multi-criteria strategies, which focuses on taking into account energy and memory constraints, but also budget constraints or specific constraints for scheduling online requests.

The drive to decarbonize the power grid to slow the pace of climate change has caused dramatic variation in the cost, availability, and carbon-intensity of power. This has begun to shape the planning and operation of datacenters. This work focuses on the design of scheduling algorithms for independent jobs that are submitted to a platform whose resource capacity varies over time. Jobs are submitted online and assigned on a target machine by the scheduler, which is agnostic to the rate and amount of resource variation. The optimization objective is the goodput, defined as the fraction of time devoted to effective computations (re-execution does not count). We introduce several novel algorithms that: (i) decide which fraction of the resources can be used safely; (ii) maintain a risk index associated to each machine; and (iii) achieves a global load balance while mapping longer jobs to safer machines. We assess the performance of these algorithms using one set of actual workflow traces together with three sets of synthetic traces. The goodput achieved by our algorithms increases up to 10% compared to standard first-fit approaches, while we never experience any loss in complementary metrics such as the maximum or average stretch.

This work has been published in PMBS'2023, a workshop co-located with SC'2023 22.

Lossy compression and asynchronous I/O are two of the most effective solutions for reducing storage overhead and enhancing I/O performance in large-scale high-performance computing (HPC) applications. However, current approaches have limitations that prevent them from fully leveraging lossy compression, and they may also result in task collisions, which restrict the overall performance of HPC applications. To address these issues, we propose an optimization approach for the task scheduling problem that encompasses computation, compression, and I/O. Our algorithm adaptively selects the optimal compression and I/O queue to minimize the performance degradation of the computation. We also introduce an intra-node I/O workload balancing mechanism that evenly distributes the workload across different processes. Additionally, we design a framework that incorporates fine-grained compression, a compressed data buffer, and a shared Huffman tree to fully benefit from our proposed task scheduling. Experimental results with up to 16 nodes and 64 GPUs from ORNL Summit, as well as real-world HPC applications, demonstrate that our solution reduces I/O overhead by up to 3.8×× and 2.6×× compared to non-compression and asynchronous I/O solutions, respectively.

This work will appear in EuroSys'2024 20.

This work focuses on energy minimization for the mapping and scheduling of real-time workflows under reliability constraints. Workflow instances are input periodically to the system. Each instance is composed of several tasks and must complete execution before the arrival of the next instance, and with a prescribed reliability threshold. While the shape of the dependence graph is identical for each instance, task execution times are stochastic and vary from one instance to the next. The reliability threshold is met by executing several replicas for each task. The target platform consists of identical processors equipped with Dynamic Voltage and Frequency Scaling (DVFS) capabilities. A different frequency can be assigned to each task replica to save energy, but it may have negative effect on the deadline and reliability target.

This difficult tri-criteria mapping and scheduling problem (energy, deadline, reliability) has been studied only recently for workflows with arbitrary dependence constraints. We investigate new mapping and scheduling strategies based upon layers in the task graph. These strategies better balance replicas across processors, thereby decreasing the time overlap between different replicas of a given task, and saving energy. We compare these strategies with two state-of-the-art approaches and a reference baseline on a variety of benchmark workflows. Our best heuristics achieve an average energy gain of 60% over the competitors and of 82% over the baseline.

This work has been published in the JPDC journal 15.

Scheduling Slack algorithm
groups tasks by packs of Slack, and in particular, we
study the performance of this algorithm from an asymptotical point
of view, under the assumption that the execution times of the tasks
follow a given probability distribution.
The study is building on a comparison of the most heavily loaded
machine compared to the least loaded one.
Furthermore, we extend the results when the objective is to minimize
the energy consumption rather than the makespan, since reducing the
energy consumption of the computing centers is an ever-growing
concern for economical and ecological reasons.
Finally, we perform extensive simulations to empirically assess the
performance of the algorithms with both synthetic and realistic
execution time distributions.

This work appeared in the proceedings of EuroPar 2023 18

Directed acyclic graphs are commonly used to model scientific workflows, by expressing dependencies between tasks, as well as the resource requirements of the workflow. As a special case, rooted directed trees occur in several applications, for instance in sparse matrix computations. Since typical workflows are modeled by large trees, it is crucial to schedule them efficiently, so that their execution time (or makespan) is minimized. Furthermore, it is usually beneficial to distribute the execution on several compute nodes, hence increasing the available memory, and allowing us to parallelize parts of the execution. To exploit the heterogeneity of modern clusters in this context, we investigate the partitioning and mapping of tree-shaped workflows on two types of target architecture models: in AM1, each processor can have a different memory size, and in AM2, each processor can also have a different speed (in addition to a different memory size). We design a three-step heuristic for AM1, which adapts and extends previous work for homogeneous clusters. The changes we propose concern the assignment to processors (accounting for the different memory sizes) and the availability of suitable processors when splitting or merging subtrees. For AM2, we extend the heuristic for AM1 with a two-phase local search approach. Phase A is a swap-based hill climber, while (the optional) Phase B is inspired by iterated local search. We evaluate our heuristics for AM1 and AM2 with extensive simulations, and we demonstrate that exploiting the heterogeneity in the cluster significantly reduces the makespan, compared to the state of the art for homogeneous processors.

This is the extension of a previous work published in 2022 in the HeteroPar workshop, in conjunction with EuroPar, where we were targeting only AM1 (same-speed processors). It has been published in CCPE 13.

This work introduces scheduling algorithms to maximize the expected number of independent tasks that can be executed on a parallel platform within a given budget and under a deadline constraint. The main motivation for this problem comes from imprecise computations, where each job has a mandatory part and an optional part, and the objective is to maximize the number of optional parts that are successfully executed, in order to improve the accuracy of the results. The optional parts of the jobs represent the independent tasks of our problem. Task execution times are not known before execution; instead, the only information available to the scheduler is that they obey some (unknown) probability distribution. The scheduler needs to acquire some information before deciding for a cutting threshold: instead of allowing all tasks to run until completion, one may want to interrupt long-running tasks at some point. In addition, the cutting threshold may be reevaluated as new information is acquired when the execution progresses further. This work presents several algorithms to determine a good cutting threshold, and to decide when to re-evaluate it. In particular, we use the Kaplan-Meier estimator to account for tasks that are still running when making a decision. The efficiency of our algorithms is assessed through an extensive set of simulations with various budget and deadline values, and ranging over 13 probability Scheduling Stochastic Tasks With Unknown Probability Distribution distributions. In particular, the AutoPerSurvival(40%,0.005) strategy is proved to have a performance of 77% compared to the upper bound even in the worst case. This shows the robustness of our strategy.

This work was published in the journal Algorithmica 11.

Key-value stores distribute data across several storage nodes to handle large amounts of parallel requests. Proper scheduling of these requests impacts the quality of service, as measured by achievable throughput and (tail) latencies. In addition to scheduling, performance heavily depends on the nature of the workload and the deployment environment. It is, unfortunately, difficult to evaluate different scheduling strategies consistently under the same operational conditions. Moreover, such strategies are often hard-coded in the system, limiting flexibility. We have proposed Hector, a modular framework for implementing and evaluating scheduling policies in Apache Cassandra. Hector enables users to select among several options for key components of the scheduling workflow, from the request propagation via replica selection to the local ordering of incoming requests at a storage node. We have demonstrated the capabilities of Hector by comparing strategies in various settings. For example, we found that leveraging cache locality effects may be of particular interest: we proposed a new replica selection strategy, called Popularity-Aware, that can support 6 times the maximum throughput of the default algorithm under specific key access patterns. We have also shown that local scheduling policies have a significant effect when parallelism at each storage node is limited.

This work was presented at the ICPP 2023 conference 19.

Hardware accelerators, such as GPUs, now provide a large part of the computational power used for scientific simulations. GPUs come with their own (limited) memory and are connected to the main memory of the machine via a bus with limited bandwidth. Scientific simulations often operate on very large data, to the point of not fitting in the limited GPU memory. In this case, one has to turn to out-of-core computing: data are kept in the CPU memory, and moved back and forth to the GPU memory when needed for the computation. This out-of-core situation also happens when processing on multicore CPUs with limited memory huge datasets stored on disk. In both cases, data movement quickly becomes a performance bottleneck. Task-based runtime schedulers have emerged as a convenient and efficient way to manage large applications on such heterogeneous platforms. They are in charge of choosing which tasks to assign on which processing unit and in which order they should be processed. In this work, we have focused on this problem of scheduling for a task-based runtime to improve data locality in an out-of-core setting, in order to reduce data movements. We have designed a data-aware strategy for both task scheduling and data eviction from limited memories. We have compared this to existing scheduling techniques in runtime systems. Using the StarPU runtime, we have shown that our strategy achieves significantly better performance when scheduling tasks on multiple GPUs with limited memory, as well as on multiple CPU cores with limited main memory.

This work has been submitted for publication and is available as a research report 30. A preliminary version of this work has been published in the journal FGCS 12.

We continued our work on the optimization of sparse solvers by concentrating on data locality when mapping tasks to processors, and by studying the tradeoff between memory and performance when using low-rank compression. We worked on combinatorial problems arising in sparse matrix and tensors computations. The computations involved direct methods for solving sparse linear systems and tensor factorizations. The combinatorial problems were based on matchings on bipartite graphs, partitionings, and hyperedge queries.

We investigate the maximum bottleneck matching problem in bipartite graphs. Given a bipartite graph with nonnegative edge weights, the problem is to determine a maximum cardinality matching in which the minimum weight of an edge is the maximum. To the best of our knowledge, there are two widely used solvers for this problem based on two different approaches. There exists a third known approach in the literature, which seems inferior to those two which is presumably why there is no implementation of it. We take this third approach, make theoretical observations to improve its behavior, and implement the improved method. Experiments with the existing two solvers show that their run time can be too high to be useful in many interesting cases. Furthermore, their performance is not predictable, and slight perturbations of the input graph lead to considerable changes in the run time. On the other hand, the proposed solver's performance is much more stable; it is almost always faster than or comparable to the two existing solvers, and its run time always remains low. This work is published at a conference 21, and the codes are made available online with the CeCILL-B license.

We have worked on a conjecture of ours about partitioning 5-point stencils for perfect balance and minimum volume of communication, should the associated matrix undergoes matrix-vector multiplication in a distributed memory parallel computing. A month-long experimentation with Gurobi solver confirmed that the conjecture holds, however non-computational proofs are still needed. We stated this in a Dagstuhl report 26.

We also worked on the design of parallel and communication optimal algorithms for several dense matrix and tensor computations.

In this work, we focus on the parallel communication cost of multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK). SYRK requires half the computation of general matrix multiplication because of the symmetry of the output matrix. Recent work (Beaumont et al., SPAA '22) has demonstrated that the sequential I/O complexity of SYRK is also a constant factor smaller than that of general matrix multiplication. Inspired by this progress, we establish memory-independent parallel communication lower bounds for SYRK with smaller constants than general matrix multiplication, and we show that these constants are tight by presenting communication-optimal algorithms. The crux of the lower bound proof relies on extending a key geometric inequality to symmetric computations and analytically solving a constrained nonlinear optimization problem. The optimal algorithms use a triangular blocking scheme for parallel distribution of the symmetric output matrix and corresponding computation.

This work has been published in SPAA 2023 16.

Multiple Tensor-Times-Matrix (Multi-TTM) is a key computation in algorithms for computing and operating with the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower bounds that determine how much data movement is required (under mild conditions) to perform the Multi-TTM computation in parallel. The crux of the proof relies on analytically solving a constrained, nonlinear optimization problem. We also present a parallel algorithm to perform this computation that organizes the processors into a logical grid with twice as many modes as the input tensor. We show that with correct choices of grid dimensions, the communication cost of the algorithm attains the lower bounds and is therefore communication optimal. Finally, we show that our algorithm can significantly reduce communication compared to the straightforward approach of expressing the computation as a sequence of tensor-times-matrix operations when the input and output tensors vary greatly in size.

This work will appear in the SIMAX journal 9.

2023 was the second year of the ChalResil associate team. ChalResil stands for Challenges in resilience at scale and is operated between ROMA (PI Yves Robert) and the Innovative Computing Laboratory of
the University of Tennessee Knoxville, USA (PI Thomas Herault). Many fundamental challenges in the resilience field have yet to be addressed, and ChalResil focuses on some critical ones.

The year 2023 was quite productive for ChalResil. In addition to the several joint results reported elsewhere in this document, we have organized the 16th Scheduling for Large Scale Systems Workshop, which was held at The University of Tennessee in Knoxville, May 22-24, 2023. The workshop was organized by George Bosilca (UTK) and Yves Robert (Inria). The other three permanent members of ChalResil participated in the workshop, and gave a presentation. There was a total 25 participants, out of them 5 from Inria. Further details can be found on the workshop webpage online.

The MODS associate team has started in 2023. MODS
stands for Match and Order: improving direct solvers for cardiac
simulations and is operated between ROMA (PI Grégoire Pichon) and
SIMULA Laboratory, Oslo, Norway (PI Johannes Langguth). The goal of
the MODS project is to enhance robustness, scalability, and
performance of sparse direct solvers by developing novel parallel
matching and ordering algorithms. The results will be tested on and
applied to simulations of cardiac electrophysiology developed by
SIMULA.

During the year 2023, Grégoire Pichon and Bora Uçar (ROMA) have visited SIMULA during a week in September. A visit from three SIMULA member is planned for January 2024. Further details on this associate team can be found online.

The PeachTree associated team online
has reached to its completion.
The work was carried out locally by Somesh Singh and Bora Uçar, as the partner was not available.

Homeland project is funded by PHC Procope programme. This project investigates various problems related graphs and hypergraphs (such as clustering, streaming partitioning, orientation, maximum independent set).

Julien Langou, professor at the University Denver (USA) has been awarded an Inria International Chair to visit the ROMA team in the period 2023–2026. He spent 1.5 months in the team in June-July 2023 to start collaborations with researchers in ROMA.

The ANR Project Solharis was launched in November 2019, for a
duration of 48 months. It gathers five academic partners (the
HiePACS, Roma, RealOpt, STORM and TADAAM INRIA project-teams, and
CNRS-IRIT) and two industrial partners (CEA/CESTA and Airbus CRT). This
project aims at producing scalable methods for direct methods for
the solution of sparse linear systems on large scale and
heterogeneous computing platforms, based on task-based runtime systems.

The proposed research is organized along three distinct research thrusts. The first objective deals with the development of scalable linear algebra solvers on task-based runtimes. The second one focuses on the deployment of runtime systems on large-scale heterogeneous platforms. The last one is concerned with scheduling these particular applications on a heterogeneous and large-scale environment.

The ANR Project SPARTACLUS was launched in January 2023 for a
duration of 48 months. This is a JCJC project lead by Grégoire
Pichon and including other participants of the ROMA team. This project
aims at building new ordering strategies to enhance the behavior of
sparse direct solvers using low-rank compression.

The objective of this project is to end up with a common tool to perform the ordering and the clustering for sparse direct solvers when using low-rank compression. We will provide statistics that are currently missing and that will help understanding the compressibility of each block. The objective is to enhance sparse direct solvers, in particular targeting larger problems. The benefits will directly apply to academic or industrial applications using sparse direct solvers.

Anne Benoit, Andrew A. Chien (University of Chicago and Argonne National Laboratory, USA) and Yves Robert have co-organized a workshop in the Paris Center of the University of Chicago, on March 29-31, 2023. This workshop gathered scheduling leaders from the United States, France, Europe, and Asia to discuss the research challenges facing the scheduling community, arising from the increasing fluctuations of renewable energy in the power grid. In such an environment, scheduling to reduce carbon-emissions, to reduce power cost, help decarbonize the power grid, or just to stabilize the grid all requires scheduling with awareness of variable capacity.

The Workshop Report outlining the research challenges documented by this group of researchers has been completed, and is available online. The full workshop notes, including position papers from all attendees, as well as workshop organization, are available at online.

George Bosilca (University Tennessee Knoxville, USA) and Yves Robert have organized the 16th Workshop on Scheduling for Large Scale Systems in Knoxville in May 2023. Further details can be found on the workshop webpage.