# Reconstructing householder vectors from tall-skinny QR

- The standard householder QR algorithm and Tall-Skinny QR: differences
- 1D parallel algorithms
- 1 D Householder QR
- 1 D Tall Skinny QR TSQR
- LU decomposition of A - R
- Cholesky-QR decomposition

- 2D parallel algorithms
- 2D block-cyclic householder QR
- Communication avoiding CAQR
- Butterfly CAQR

The present paper deals with the fact that the output of TSQR algorithm is not represented as householder format, hence was born the idea to provide a new version of TSQR algorithm while taking advantage of its multiple benefits such as: communication efficiency, less computational cost In addition, this algorithm is considered stable numerically, as is proven by many experiments.

This paper describes also some of the parallel algorithms which are based on the householder representation, and it presents alternatives to the householder reconstruction algorithm, but they suffer from numerical stability issue.

[...] 2D parallel algorithms All 2D parallel algorithms require the use of 1D algorithms in order to perform a factorization of each matrix panel and propose different methods to update the trailing matrix. What specifically characterizes this kind of algorithms is that the input matrix resides on processor grids. 2D block-cyclic householder QR This algorithm proceeds in two main phases: - Factorization of each matrix panel by using the 1D householder algorithm. - Updating the trailing matrix after the computation of T. There are two factors that can incur additional costs in term of interprocessor communication and memory-bandwidth: - The trailing matrix updates. [...]

[...] These trees represent the communication structure of the algorithms. The difference between a reduction and an all-reduction consists in the fact that the matrix final R is owned by only one processor D Householder QR The purpose of this algorithm is to produce results in householder format during the trailing matrix update phase, it proposes a method to compete the process of parallelization which is the use of two all reductions, even if they contribute to increase the latency cost since these two all reductions should arise on each column in the matrix D Tall Skinny QR TSQR It can be performed with any tree. [...]

[...] Before introducing the CAQR algorithm, it is opportune to present the algorithm which implements the application of the orthogonal factor as a binary reduction tree, its purpose is to update the trailing matrix. If we look for improved computation and communication costs, it is advisable to form just an implicit representation of householder matrices, not an explicit format of the orthogonal factor. Butterfly CAQR As mentioned in the latter paragraph, the CAQR algorithm relies on the use of a tree TSQR in order to update the trailing matrix, but this tree shape generates load imbalance issue, so that the use of butterfly CAQR algorithm appears more efficient than CAQR. [...]

[...] - To compute the householder vectors using the algorithm of modified LU. The main purpose of this algorithm is to guarantee sign agreement between the columns of the orthogonal factor and rows of R. One of its characteristics is to compute the upper triangular matrix T from the upper triangular factor U not directly from the householder vectors, that's why it features improved costs compared to the householder QR algorithm, concerning the different types of cost. The algorithm TSQR with householder reconstruction has an optimized version that allows to minimize the computation and communication costs. [...]