Section: New Results
Communication avoiding algorithms for linear algebra
Participants : Laura Grigori, Amal Khabou, Mathias Jacquelin, Sophie Moufawad.
The focus of this research is on the design of efficient parallel algorithms for solving problems in numerical linear algebra, as solving very large sets of linear equations and large least squares problems, often with millions of rows and columns. These problems arise in many numerical simulations, and solving them is very time consuming.
Our research focuses on developing new algorithms for linear algebra problems, that minimize the required communication, in terms of both latency and bandwidth. We have introduced in 2008 two communication avoiding algorithms for computing the LU and QR factorizations, that we refer to as CALU and CAQR (joint work with J. Demmel and M. Hoemmen from U.C. Berkeley, J. Langou from C.U. Denver, and H. Xiang then at Inria) [18] [8] . Since then, we continue designing communication avoiding algorithm for other operations in both dense and sparse linear algebra. The communication avoiding algorithms are now studied by several other groups, including groups at Inria, and they start being implemented and being available in public libraries as ScaLAPACK.
During 2012, our research [43] has focused on
the design of the LU decomposition with panel rank revealing pivoting
(LU_PRRP), an LU factorization algorithm based on strong rank
revealing QR panel factorization. LU_PRRP is more stable than
Gaussian elimination with partial pivoting (GEPP), with a theoretical
upper bound of the growth factor of
Our work has also focused on designing algorithms that are optimal
over multiple levels of memory hierarchy and parallelism. In
[32] we present an
algorithm for performing the LU factorization of dense matrices that
is suitable for computer systems with two levels of parallelism. This
algorithm is able to minimize both the volume of communication and the
number of messages transferred at every level of the two-level
hierarchy of parallelism. We present its implementation for a cluster
of multicore processors based on MPI and Pthreads. We show that this
implementation leads to a better performance than routines
implementing the LU factorization in well-known numerical libraries.
For matrices that are tall and skinny, that is they have many more
rows than columns, our algorithm outperforms the corresponding
algorithm from ScaLAPACK by a factor of