GRAND-LARGE - 2012 - Annual activity report

GRAND-LARGE

GRAND-LARGE - 2012

Project-Team Grand-large

Members

Overall Objectives

Grand-Large General Objectives

Scientific Foundations

Software

New Results

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Communication avoiding algorithms for linear algebra

Participants : Laura Grigori, Amal Khabou, Mathias Jacquelin, Sophie Moufawad.

The focus of this research is on the design of efficient parallel algorithms for solving problems in numerical linear algebra, as solving very large sets of linear equations and large least squares problems, often with millions of rows and columns. These problems arise in many numerical simulations, and solving them is very time consuming.

Our research focuses on developing new algorithms for linear algebra problems, that minimize the required communication, in terms of both latency and bandwidth. We have introduced in 2008 two communication avoiding algorithms for computing the LU and QR factorizations, that we refer to as CALU and CAQR (joint work with J. Demmel and M. Hoemmen from U.C. Berkeley, J. Langou from C.U. Denver, and H. Xiang then at Inria) [18] [8] . Since then, we continue designing communication avoiding algorithm for other operations in both dense and sparse linear algebra. The communication avoiding algorithms are now studied by several other groups, including groups at Inria, and they start being implemented and being available in public libraries as ScaLAPACK.

During 2012, our research [43] has focused on the design of the LU decomposition with panel rank revealing pivoting (LU_PRRP), an LU factorization algorithm based on strong rank revealing QR panel factorization. LU_PRRP is more stable than Gaussian elimination with partial pivoting (GEPP), with a theoretical upper bound of the growth factor of ${(1 + τ b)}^{(n / b) - 1}$ , where $b$ is the size of the panel used during the block factorization, $τ$ is a parameter of the strong rank revealing QR factorization, and $n$ is the number of columns of the matrix. For example, if the size of the panel is $b = 64$ , and $τ = 2$ , then ${(1 + 2 b)}^{(n / b) - 1} = {(1.079)}^{n - 1} ≪ 2^{n - 1}$ , where $2^{n - 1}$ is the upper bound of the growth factor of GEPP. Our extensive numerical experiments show that the new factorization scheme is as numerically stable as GEPP in practice, but it is more resistant to pathological cases. The LU_PRRP factorization does only $O (n^{2} b)$ additional floating point operations compared to GEPP. We have also introduced CALU_PRRP, a communication avoiding version of LU_PRRP that minimizes communication. CALU_PRRP is based on tournament pivoting, with the selection of the pivots at each step of the tournament being performed via strong rank revealing QR factorization. CALU_PRRP is more stable than CALU, the communication avoiding version of GEPP, with a theoretical upper bound of the growth factor of ${(1 + τ b)}^{\frac{n}{b} (H + 1) - 1}$ , where $H$ is the height of the reduction tree used during tournament pivoting. The upper bound of the growth factor of CALU is $2^{n (H + 1) - 1}$ . CALU_PRRP is also more stable in practice and is resistant to pathological cases on which GEPP and CALU fail.

Our work has also focused on designing algorithms that are optimal over multiple levels of memory hierarchy and parallelism. In [32] we present an algorithm for performing the LU factorization of dense matrices that is suitable for computer systems with two levels of parallelism. This algorithm is able to minimize both the volume of communication and the number of messages transferred at every level of the two-level hierarchy of parallelism. We present its implementation for a cluster of multicore processors based on MPI and Pthreads. We show that this implementation leads to a better performance than routines implementing the LU factorization in well-known numerical libraries. For matrices that are tall and skinny, that is they have many more rows than columns, our algorithm outperforms the corresponding algorithm from ScaLAPACK by a factor of $4.5$ on a cluster of 32 nodes, each node having two quad-core Intel Xeon EMT64 processors.

Previous |

Home | Next next