## Section: New Results

### Communication avoiding algorithms

Our group continues to work on algorithms for dense and sparse linear algebra operations that minimize communication. During this year we focused on communication avoiding iterative methods and designing algorithms for computing rank revealing and low rank approximations of dense and sparse matrices.

In [9], we discuss sparse matrix-matrix multiplication (or SpGEMM), which is an important operation for many algorithms in scientific computing. In our previous work we have identified lower bounds on communication for this operation, which is the limiting factor of SpGEMM. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos–Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.

In [10] we discuss algorithms that not only aim at minimizing communication, but they also aim at reducing the number of writes to secondary storage. Most of the prior work does not distinguish between loads and stores, i.e., between reads and writes to a particular memory unit. But in fact there are some current and emerging nonvolatile memory technologies (NVM) where writes can be much more expensive (in time and energy) than reads. NVM technologies are being considered for scientific applications on extreme scale computers and for cluster computing platforms, in addition to commodity computers.

This motivates us to first refine prior work on communication lower bounds of algorithms which did not distinguish between loads and
stores to derive new lower bounds on writes to different levels of a memory hierarchy. When these
new lower bounds on writes are asymptotically smaller than the
previous bounds on the total number of loads and stores, we ask
whether there are algorithms that attain them. We call such
algorithms, that both minimize the total number of loads and stores
(i.e., are CA), and also do asymptotically fewer writes than reads,
*write-avoiding (WA)*. In this paper, we identify several classes
of problems where either sequential or parallel WA algorithms exist,
or provably cannot.

In [7] we introduce a new approach for reducing communication in Krylov subspace methods that consists of enlarging the Krylov subspace by a maximum of $t$ vectors per iteration, based on the domain decomposition of the graph of $A$.We show in this paper that the enlarged Krylov projection subspace methods lead to faster convergence in terms of iterations and parallelizable algorithms with less communication, with respect to Krylov methods.

In this paper we focus on Conjugate Gradient (CG), a Krylov projection method for symmetric (Hermitian) positive definite matrices. We discuss two new versions of Conjugate Gradient. The first method, multiple search direction with orthogonalization CG (MSDO-CG), is an adapted version of MSD-CG with the A-orthonormalization of the search directions to obtain a projection method that guarentees convergence at least as fast as CG. The second projection method that we propose here, long recurrence enlarged CG (LRE-CG), is similar to GMRES in that we build an orthonormal basis for the enlarged Krylov subspace rather than finding search directions. Then, we use the whole basis to update the solution and the residual. We compare the convergence behavior of both methods using different A-orthonormalization and orthonormalization methods and then we compare the most stable versions with CG and other related methods. Both methods converge faster than CG in terms of iterations, but LRE-CG converges faster than MSDO-CG since it uses the whole basis to update the solution rather than only $t$ search directions. And the more subdomains are introduced or the larger $t$ is, the faster is the convergence of both methods with respect to CG in terms of iterations. For example, for $t=64$ the MSDO-CG and LRE-CG methods converge in $75\%$ up to 98 less iteration with respect to CG for the different test matrices.

In [12] we present an algorithm for computing a low rank approximation of a sparse matrix based on a truncated LU factorization with column and row permutations. We present various approaches for determining the column and row permutations that show a trade-off between speed versus deterministic/probabilistic accuracy. We show that if the permutations are chosen by using tournament pivoting based on QR factorization, then the obtained truncated LU factorization with column/row tournament pivoting, LU_CRTP, satisfies bounds on the singular values which have similarities with the ones obtained by a communication avoiding rank revealing QR factorization. Experiments on challenging matrices show that LU_CRTP provides a good low rank approximation of the input matrix and it is less expensive than the rank revealing QR factorization in terms of computational and memory usage costs, while also minimizing the communication cost. We also compare the computational complexity of our algorithm with randomized algorithms and show that for sparse matrices and high enough but still modest accuracies, our approach is faster.