Research Highlights

Solving Sparse Linear Systems Faster than Matrix Multiplication

Presenting an algorithm that solves linear systems with sparse coefficient matrices asymptotically faster than matrix multiplication for any ω > 2.

arc of data cubes, illustration
Read the related Technical Perspective


Can linear systems be solved faster than matrix multiplication? While there has been remarkable progress for the special cases of graph structured linear systems, in the general setting, the bit complexity of solving an n × n linear system Ax = b is Õ(nω), where ω < 2.372864 is the matrix multiplication exponent. Improving on this has been an open problem even for sparse linear systems with poly(n) condition number.

In this paper, we present an algorithm that solves linear systems in sparse matrices asymptotically faster than matrix multiplication for any ω > 2. This speedup holds for any input matrix A with o(nω−1/log(k(A))) non-zeros, where k(A) is the condition number of A. For poly(n)-conditioned matrices with Õ(n) nonzeros, and the current value of ω, the bit complexity of our algorithm to solve to within any 1/poly(n) error is O(n2.331645).

Our algorithm can be viewed as an efficient, randomized implementation of the block Krylov method via recursive low displacement rank factorizations. It is inspired by the algorithm of [Eberly et al. ISSAC ’06 ’07] for inverting matrices over finite fields. In our analysis of numerical stability, we develop matrix anti-concentration techniques to bound the smallest eigenvalue and the smallest gap in eigenvalues of semi-random matrices.

1. Introduction

Solving a linear system Ax=b is a basic algorithmic problem with direct applications to scientific computing, engineering, and physics, and is at the core of algorithms for many other problems, including optimization, data science, and computational geometry. It has enjoyed an array of elegant approaches, from Cramer’s rule and Gaussian elimination to numerically stable iterative methods to more modern randomized variants based on random sampling8,19 and sketching.23 Despite much recent progress on faster solvers for graph-structured linear systems8,9,19 progress on the general case has been elusive.

Most of the work in obtaining better running time bounds for linear systems solvers has focused on efficiently computing the inverse of A, or some factorization of it. Such operations are in turn closely related to the cost of matrix multiplication. Matrix inversion can be reduced to matrix multiplication via divide-and-conquer, and this reduction was shown to be stable when the word size for representing numbersb is increased by a factor of O( log n).4 The current best runtime of ω<2.38 follows a long line of work on faster matrix multiplication algorithms and is also the current best running time for solving Ax=b: when the input matrix/vector are integers, matrix multiplication based algorithms can obtain the exact rational solution using O(nω) word operations.20

Methods for matrix inversion or factorization are often referred to as direct methods in the linear systems literature. This is in contrast to iterative methods, which gradually converge to the solution. Iterative methods have low space overhead, and therefore are widely used for solving large, sparse, linear systems that arise in scientific computing. Another reason for their popularity is that iterative methods are naturally suited to producing approximate solutions of desired accuracy in floating point arithmetic, the de facto method for representing real numbers. Perhaps the most famous iterative method is the Conjugate Gradient (CG) / Lanczos algorithm.16 It was introduced as an O(n·nnz) time algorithm under exact arithmetic, where nnz is the number of non-zeros in the input matrix. However, this bound only holds under the Real RAM model where words have unbounded precision. When taking bit sizes into account, it incurs an additional factor of n. Despite much progress in iterative techniques in the intervening decades, obtaining gains over matrix multiplication in the presence of round-off errors has remained an open question.

The convergence and stability of iterative methods typically depend on some condition number of the input. When all intermediate steps are carried out using precision close to the condition number of A, the running time bounds of the CG algorithm, as well as other currently known iterative methods, depend polynomially on the condition number of the input matrix A. Formally, the condition number of a symmetric matrix A, κ(A), is the ratio between the maximum and minimum eigenvalues of A. Here the best known rate of convergence when all intermediate operations are restricted to bit-complexity O(log (κ(A))) is O(κ(A) log (1/ϵ)) iterations to achieve error ϵ. This is known to be tight if one restricts to matrix-vector multiplications in the intermediate steps.12,17 This means for moderately conditioned (e.g., with κ=poly(n)), sparse, systems, the best runtime bounds are still via direct methods, which are stable when O(log (1/κ)) words of precision are maintained in intermediate steps.4

Many of the algorithms used in practice in scientific computing for solving linear systems involving large, sparse matrices are based on combining direct and iterative methods: we will briefly discuss these perspectives in Section 1.3. In terms of asymptotic complexity, the practical successes of many such methods naturally lead to the question of whether one can provably do better than the O(min{nω,nnz·κ(A)}) time corresponding to the faster of direct or iterative methods. Somewhat surprisingly, despite the central role of this question in scientific computing and numerical analysis, as well as extensive studies of linear systems solvers, progress on it has been elusive. The continued lack of progress on this question has led to its use as a hardness assumption for showing conditional lower bounds for numerical primitives such as linear elasticity problems25 and positive linear programs.10 One formalization of such hardness is the Sparse Linear Equation Time Hypothesis (SLTH) from:10 SLTHkγ denotes the assumption that a sparse linear system with κnnz(A)k cannot be solved in time faster than nnz(A)γ to within relative error ϵ=n10k. Here, improving over the smaller running time of both direct and iterative methods can be succinctly encapsulated as refuting SLTHkmin{1+k/2,ω}.c

We provide a faster algorithm for solving sparse linear systems. Our formal result is the following (we use the form defined in:10 Linear Equation Approximation Problem, LEA).

Theorem 1.

Given a matrix A with maximum dimension n, nnz(A) non-zeros (whose values fit into a single word), along with a parameter κ(A) such that κ(A)σmax(A)/σmin(A), a vector b and error requirement ϵ, we can compute, under fixed point arithmetic, in time O ( max { n n z ( A ) ω 2 ω 1 n 2 , n 5 ω 4 ω + 1 }  log  2 ( κ / ϵ ) )

a vector x such that A x Π A b 2 2     ϵ   Π A b 2 2 ,

where c is a fixed constant and ΠA is the projection operator onto the column space ofA.

Note that ΠAb2=ATb(ATA), and when A is square and full rank, it is just b2.

The cross-over point for the two bounds is at nnz(A)=n3(ω1)ω+1. In particular, for the sparse case with nnz(A)=O(n), and the bound of ω2.38, we get an exponent of

max { 2 + ω 2 ω 1 , 5 ω 4 ω + 1 } < max { 2.28 , 2.34 } = 2.34.

As nnnz, this also translates to a running time of O(nnz5ω4ω+1), which as 5ω4ω+1=ω(ω2)2ω+1, refutes SLTHkω for constant values of k and any value of ω>2.

We can parameterize the asymptotic gain over matrix multiplication for moderately sparse instances. Here we use the O~(·) notation to hide lower-order terms, specifically O~(f(n)) denotes O(f(n)· log c(f(n))) for some absolute constant c.

Corollary 2.

For any matrixAwith dimension at mostn, O(nω1θ)non-zeros, and condition numbernO(1), a linear system inAcan be solved to accuracynO(1)in timeO~(max{n5ω4ω+1,nωθ(ω2)ω1}).

Here the cross-over point happens at θ=(ω1)(ω2)ω+1. Also, because 5ω4ω+1=ω(ω2)2ω+1, we can also infer that for any 0<θω2 and any ω>2, the runtime is o(nω), or asymptotically faster than matrix multiplication.

1.1 Idea

At a high level, our algorithm follows the block Krylov space method (see e.g., Chapter 6.12 of Saad16). This method is a multi-vector extension of the CG/Lanczos method, which in the single-vector setting is known to be problematic under round-off errors both in theory12 and in practice.16 Our algorithm starts with a set of s initial vectors, Bn×s, and forms a column space by multiplying these vectors by A repeatedly, m times. Formally, the block Krylov space matrix is

K = [   B | A B | A 2 B | | A m 1 B   ] .

The core idea of Krylov space methods is to efficiently orthogonalize this column space. For this space to be spanning, block Krylov space methods typically choose s and m so that sm=n.

The conjugate gradient algorithm can be viewed as an efficient implementation of the case s=1, m=n, with B set to b, the RHS of the input linear system. The block case with larger values of s was studied by Eberly, Giesbrecht, Giorgi, Storjohann, and Villard5 over finite fields, and they gave an O(n2.28)  timed algorithm for computing the inverse of an O(n)-sparse matrix over a finite field.

Our algorithm also leverages the top-level insight of the Eberly et al. results: the Gram matrix of the Krylov space matrix, is a block Hankel matrix. Solving linear systems in this Gram matrix, (AK)T(AK), lead to solvers for linear systems in A because

( ( A K ) Τ ( A K ) ) 1 = ( K Τ A Τ A K ) 1 = K 1 A 1 A Τ K Τ ,

so as long as A and K are both invertible, composing this on the left by K and on the right by AK gives

A 1 = K ( ( A K ) Τ ( A K ) ) 1 A T K T .

Eberly et al. viewed the Gram matrix as an m-by-m matrix containing s-by-s sized blocks, and critically leveraged the fact that the blocks along each anti-diagonal are identical:

( A K ) T ( A K ) = B T A 2 B B T A 3 B B T A 4 B B T A m + 1 B B T A 3 B B T A 4 B B T A 5 B B T A m + 2 B B T A 4 B B T A m + 1 B B T A 5 B B T A m + 2 B B T A 6 B B T A m + 3 B B T A m + 3 B B T A 2 m B

Formally, the s-by-s inner product matrix formed from AiB and AjB is BTAi+jB, and depends only on i+j. So instead of m2 blocks each of size s×s, we are able to represent an n-by-n matrix with only about m blocks.

Operations involving these m blocks of the Hankel matrix can be handled using O~(m) block operations. This is perhaps easiest seen for computing matrix-vector products using K. If we use {i} to denote the ith block of the Hankel matrix, and define

H { i , j } = M ( i + j )

for a sequence of matrices M, we get that the ith block of the product Hx can be written in block-form as

( H x ) { i } = j H { i , j } x { j } = j M ( i + j ) x { j } .

Note this is precisely the convolution of (a sub-interval) of M and x, with shifts indicated by i. Therefore, in matrix-vector multiplication (the “forward” direction), a speedup by a factor of about m is possible with fast convolution algorithms. The performance gains of the Eberly et al. algorithms5 can be viewed as being of a similar nature, albeit in the more difficult direction of solving linear systems. Specifically, they utilize algorithms for the Padé problem of computing a polynomial from the result of its convolution.1 Over finite fields, or under exact arithmetic, such algorithms for matrix Padé problems take O(m log m) block operations,1 for a total of O~(sωm) operations.

The overall time complexity is based on two opposing goals:

  1. Quickly generate the Krylov space: repeated multiplication by A allows us to generate AiB using O(ms·nnz)=O(n·nnz) arithmetic operations. Choosing a sparse B then allows us to compute BTAiB in O(n·s) arithmetic operations, for a total overhead of O(n2).

  2. Quickly invert the Hankel matrix. Each operation on an s-by-s block takes O(sω) time. Under the optimistic assumption of O~(m) block operations, the total is O~(m·sω).

Under these assumptions, and the requirement of nms, the total cost becomes about O(n·nnz+m·sω), which is at most O(n·nnz) as long as m>nω2ω1. However, this runtime complexity is over finite fields, where numerical stability is not an issue. Over the reals, under round-off errors, one must contend with numerical errors without blowing up the bit complexity. This is a formidable challenge; indeed, as mentioned earlier, with exact arithmetic, the CG method takes time O(n·nnz), but this is misleading since the computation is effective only when the word sizes are increased by a factor of n (to about n log κ words), which leads to an overall complexity of O(n2·nnz· log κ).

1.2 Our Contributions

Our algorithm can be viewed as the numerical generalization of the algorithms from.5 We work with real numbers of bounded precision, instead of entries over a finite field. The core of our approach can be summarized as follows.

Doing so requires developing tools for two topics that have been extensively studied in mathematics, doing so requires separately developing:

  1. Obtain low numerical cost solvers for block Hankel/Toeplitz matrices. Many of the prior algorithms rely on algebraic identities that do not generalize to the block setting, and are often (experimentally) numerically unstable.7

  2. Develop matrix anti-concentration bounds for analyzing the word lengths of inverses of random Krylov spaces. This is to upper bound the probability of random matrices being in some set of small measure, which in our case is the set of nearly singular matrices. Previously, such bounds were known assuming the matrix entries are independent18,21 but Krylov matrices have correlated columns.

Before we describe the difficulties and new tools needed, we first provide some intuition on why a factor m increase in word lengths may be the right answer by upper-bounding the magnitudes of entries in an m-step Krylov space. By rescaling, we may assume that the minimum singular value of A is at least 1/κ, and the maximum entry in A is at most κ. The maximum magnitude of (entries of) Amb is bounded by the maximum magnitude of A to the power of m, times a factor corresponding to the number of summands in the matrix product:

|| A m b || n m ||| A ||| m   || b || ( n k ) m     || b || κ O ( m ) || b || .

Here the last inequality is via the assumption of κn. So by forming K via horizontally concatenating Aig for sparse Gaussian vectors g, we have with high probability that the maximum magnitude of an entry of K, and in turn AK, is at most κO(m). In other words, O(m log κ) words in front of the decimal point is sufficient with high probability.

Should such a bound of O(m log κ) hold for all numbers that arise in the algorithm, including the matrix inversion steps, and the matrix B is sparse with O(n) entries, the cost of computing the block-Krylov matrices becomes O(m log κ·ms·nnz), while the cost of the matrix inversion portion encounters an overhead of O(m log κ), for a total of O~(m2sω log κ). In the sparse case of nnz=O(n), and nms, this becomes:

O ( n 2 m  log  κ + m 2 s ω  log  κ ) = O ( n 2 m  log  κ + n ω m ω 2  log  κ ) .

Due to the gap between n2 and nω, setting m appropriately gives improvement over nω when  log κ=no(1).

However, the magnitude of an entry in the inverse depends on the smallest magnitude, or in the matrix case, its minimum singular value. Bounding and propagating the min singular value, which intuitively corresponds to how close a matrix is to being degenerate, represents our main challenge. In exact/finite fields settings, non-degeneracies are certified via the Schwartz-Zippel lemma about polynomial roots. The numerical analog of this is more difficult — the Krylov space matrix K is asymmetric, even for a symmetric matrix A. It is much easier for an asymmetric matrix with correlated entries to be close to singular.

Consider for example a two-banded, two-block matrix with all diagonal entries set to the same random variable α (see Figure 1):

A i j = { 1 if   i = j   and   j n / 2 , α if   i = j + 1   and   j n / 2 , α if   i = j + 1   and   n / 2 < j , 2 if   i = j + 1   and   n / 2 < j , 0 otherwise .
Figure 1.  The difference between matrix anti-concentration over finite fields and reals: a matrix that is full rank for all α0, but is always ill conditioned.

In the exact case, this matrix is full rank unless α=0, even over finite fields. On the other hand, its minimum singular value is close to 0 for all values of α. To see this, it’s useful to first make the following observation about the min singular value of one of the blocks.

Observation 3.

The minimum singular value of a matrix with 1s on the diagonal, α on the entries immediately below the diagonal, and 0 everywhere else is at most |α|(n1), due to the test vector [1;α;α2;;(α)n1].

Then, in the top-left block, as long as |α|>3/2, the top left block has minimum singular value at most (2/3)n1. On the other hand, rescaling the bottom-right block by 1/α to get 1s on the diagonal gives 2/α on the off-diagonal. So as long as |α|<3/2, this value is at least 4/3, which in turn implies a minimum singular value of at most (3/4)n1 in the bottom right block. This means no matter what value α is set to, this matrix will always have a singular value that’s exponentially close to 0. Furthermore, the Gram matrix of this matrix also gives such a counter example to symmetric matrices with (non-linearly) correlated entries. Previous works on analyzing condition numbers of asymmetric matrices also encounter similar difficulties; a more detailed discussion of it can be found in Section 7 of Sankar et al.18

In order to bound the bit complexity of all intermediate steps of the block Krylov algorithm by O~(m· log κ), we devise a more numerically stable algorithm for solving block Hankel matrices, as well as provide a new perturbation scheme to quickly generate a well-conditioned block Krylov space. Central to both of our key components is the close connection between condition number and bit complexity bounds.

First, we give a more numerically stable solver for block Hankel/Toeplitz matrices. Fast solvers for Hankel (and closely related Toeplitz) matrices have been extensively studied in numerical analysis, with several recent developments on more stable algorithms.24 However, the notion of numerical stability studied in these algorithms is the variant where the number of bits of precision is fixed. Our attempts at converting these into asymptotic bounds yielded dependencies quadratic in the number of digits in the condition number, which in our setting translates to a prohibitive cost of O~(m2) (i.e., the overall cost would be higher than nω).

Instead, we combine developments in recursive block Gaussian elimination4 with the low displacement rank representation of Hankel/Toeplitz matrices.7 Such representations allow us to implicitly express both the Hankel matrix and its inverse by displaced versions of low-rank matrices. This means the intermediate size of instances arising from recursion is O(s) times the dimension, for a total size of O(n log n), giving a total of O~(nsω1) arithmetic operations involving words of size O~(m). We provide a rigorous analysis of the accumulation of round-off errors similar to the analysis of recursive matrix multiplication based matrix inversion from.4

Motivated by this close connection with the condition number of Hankel matrices, we then try to initialize with Krylov spaces of low condition number. Here we show that a sufficiently small perturbation suffices for producing a well conditioned overall matrix. In fact, the first step of our proof, showing that a small sparse random perturbation to A guarantees good separations between its eigenvalues, is a direct combination of bounds on eigenvalue separation of random Gaussians13 as well as min eigenvalue of random sparse matrices.11 This separation then ensures that the powers of A, A1,A2,Am, are sufficiently distinguishable from each other. Such considerations also come up in the smoothed analysis of numerical algorithms.18

The randomness of the Krylov matrix induced by the initial set of random vectors B is more difficult to analyze: each column of B affects m columns of the overall Krylov space matrix. In contrast, all existing analyses of lower bounds of singular values of possibly asymmetric random matrices18,21 rely on the randomness in the columns of matrices being independent. The dependence between columns necessitates analyzing singular values of random linear combinations of matrices, which we handle by adapting ϵ-net based proofs of anti-concentration bounds. Here we encounter an additional challenge in bounding the minimum singular value of the block Krylov matrix. We resolve this issue algorithmically: instead of picking a Krylov space that spans the entire n, we stop short by picking ms=nO~(m) This set of extra columns significantly simplifies the proof of singular value lower bounds. This is similar in spirit to the analysis of the minimum singular value of a random matrix, which is easier for a non-square matrix.15 In the algorithm, the remaining columns are treated as a separate block that we handle via a Schur complement at the very end of the algorithm. Since this block is small, so is its overhead on the running time.

1.3 History and Related Work

Our algorithm has close connections with multiple lines of research on efficient solvers for sparse linear systems. The topic of efficiently solving linear systems has been extensively studied in computer science, applied mathematics and engineering. For example, in the Society of Industrial and Applied Mathematics News’ ‘top 10 algorithms of the 20th century’, three of them (Krylov space methods, matrix decompositions, and QR factorizations) are directly related to linear systems solvers.3

At a high level, our algorithm is a hybrid linear systems solver. It combines iterative methods, namely block Krylov space methods, with direct methods that factorize the resulting Gram matrix of the Krylov space. Hybrid methods have their origins in the incomplete Cholesky method for speeding up elimination/factorization based direct solvers. A main goal of these methods is to reduce the Ω(n2) space needed to represent matrix factorizations/inverses. This high space requirement is often even more problematic than time requirements when handling large sparse matrices. Such reductions can occur in two ways: either by directly dropping entries from the (intermediate) matrices, or by providing more succinct representations of these matrices using additional structure.

The main structure of our algorithm is based on the latter line of work on solvers for structured matrices. Such systems arise from physical processes where the interactions between objects have invariances (e.g., either by time or space differences). Examples of such structure include circulant matrices, Hankel/Toeplitz matrices and distances from n-body simulations.7 Many such algorithms require exact preservation of the structure in intermediate steps. As a result, many of these works develop algorithms over finite fields.

More recently, there has been work on developing numerically stable variants of these algorithms for structured matrices, or more generally, matrices that are numerically close to being structured.24 However, these results only explicitly discussed in the entry-wise Hankel/Toeplitz case (which corresponds to block size s=1). Furthermore, because they rely on domain-decomposition techniques similar to fast multiple methods, they produce one bit of precision for each outer iteration loop. As the Krylov space matrix has condition number exp(Ω(m)), such methods would lead to another factor of m in the solve cost if invoked directly.

Instead, our techniques for handling and bounding numerical errors are more closely related to recent developments in provably efficient sparse Cholesky factorizations.9 These methods generate efficient preconditioners using only the condition that intermediate steps of Gaussian elimination, known as Schur complements, have small representations. They avoid the explicit generation of the dense representations of Schur complements by treating them as operators, and apply randomized tools to directly sample/sketch the final succinct representations, which have much smaller size and algorithmic cost.

On the other hand, previous works on sparse Cholesky factorizations required the input matrix to be decomposable into a sum of simple elements, often through additional combinatorial structure of the matrices. In particular, this line of work on combinatorial preconditioning was initiated through a focus on graph Laplacians, which are built from 2-by-2 matrix blocks corresponding to edges of undirected graphs.19 Since then, there have been substantial generalizations to the structures amenable to such approaches, notably to finite element matrices, and directed graphs/irreversible Markov chains. However, recent works have also shown that many classes of structures involving more than two variables are complete for general linear systems.25 Nonetheless, the prevalence of approximation errors in such algorithms led to the development of new ways to bound numerical round-off errors in algorithms, which will be critical to our elimination routine for block-Hankel matrices.

Key to recent developments in combinatorial preconditioning is matrix concentration.22 Such bounds provide guarantees for (relative) eigenvalues of random sums of matrices. For generating preconditioners, such randomness arises from whether each element is kept or not, and a small condition number (which in turn implies a small number of outer iterations using the preconditioner) corresponds to a small deviation between the original and sampled matrices. In contrast, we introduce randomness in order to obtain block Krylov spaces whose minimum eigenvalue is large. As a result, the matrix tool we need is anti-concentration, which somewhat surprisingly is far less studied. Previous works on it are related to similar problems of numerical precision18,21 and address situations where the entries in the resulting matrix are independent. Our bound on the min singular value of the random Krylov space also yields a crude bound for a sum of rectangular random matrices.

1.4 Subsequent Improvements and Extensions

Nie14 gave a more general and tighter version of matrix anti-concentration that also works for square matrices, answering an open question we posed. For an m-step Krylov space instantiated using nm vectors, Nie’s bound reduces the middle term in our analysis from O(n2m3) to O(n2m), thus leading to a running time of nnz·n·m+nω·m2ω, which matches the bound for finite fields. Moreovver, it does so without the padding step at the end. We elect to keep for this article our epsilon-net based analyses, and the padded algorithm required for it, both for completeness, and due to it being a more elementary approach toward the problem with a simpler proof.

Faster matrix multiplication is an active area of work with recent progresses. Due to the dependence on fast multiplication in our algorithm, such improvements also lead to improvements in the running time of solving sparse systems as well.

Our main result for solving sparse linear systems has also been extended to solving sparse regression, with faster than matrix multiplication bounds for sufficiently sparse matrices.6 The complexity of sparse linear programming remains an interesting open problem.

2. Algorithm

We describe the algorithm, as well as the running times of its main components in this section. To simplify the discussion, we assume the input matrix A is symmetric, and has poly(n) condition number. If it is asymmetric (but invertible), we implicitly apply the algorithm to ATA, using the identity A1=(ATA)1AT derived from (ATA)1=A1AT. Also, recall from the discussion after Theorem 1 that we use O~(·) to hide logarithmic terms in order to simplify runtimes.

Before giving details of our algorithm, we first discuss what constitutes a linear system solver algorithm, specifically the equivalence between many such algorithms and linear operators.

For an algorithm ALG that takes a matrix B as input, we say that ALG is linear if there is a matrix ZALG such that for any input B, the output of running the algorithm on B is the same as multiplying B by ZALG:

A LG B = Z A LG B .

In this section, in particular in the pseudocode in Algorithm 2, we use the name of the procedure, SOLVEA(b,δ), interchangeably with the operator corresponding to a linear algorithm that solves a system in A, on vector b, to error δ>0. In the more formal analysis, we will denote such corresponding linear operators using the symbol Z, with subscripts corresponding to the routine if appropriate.

Figure 2.  Pseudocode for block Krylov space algorithm: Solve·(·,·) are operators corresponding to linear system solving algorithms whose formalization we discuss at the start of this section.

This operator/matrix based analysis of algorithms was first introduced in the analysis of a recursive Chebyshev iteration by Spielman and Teng,19 with credits to the technique also attributed to V. Rokhlin. It has the advantage of simplifying the analyis of multiple iterations of such algorithms, as we can directly measure Frobenius norm differences between such operators and the exact ones that they approximate.

Under this correspondence, the goal of producing an algorithm that solves Ax=b for any b as input becomes equivalent to producing a linear operator ZA that approximates A1, and then running it on the input b. For convenience, we also let the solver take as input a matrix instead of a vector, in which case the output is the result of solves against each of the columns of the input matrix as the RHS.

The high-level description of our algorithm is in Figure 2.

Some of the steps of the algorithm require care for efficiency as well as for tracking the number of words needed to represent the numbers. We assume a bound on bit complexity of O~(m) when κ=poly(n) in the brief description of costs in the outline of the steps below.

We start by perturbing the input matrix, resulting in a symmetric positive definite matrix where all eigenvalues are separated by αA. Then we explicitly form a Krylov matrix from a sparse random Gaussian matrix, see Fig. 3. For any vector u, we can compute Aiu from Ai1u via a single matrix-vector multiplication in A. So computing each column of K requires O(nnz(A)) operations, each involving a length n vector with words of length O~(m). So we get the matrix K, as well as AK, in time

O ˜ ( n n z ( A ) n m ) .
Figure 3.  Randomized m-step Krylov Space Matrix with n-by-s sparse Gaussian GS as starter.

To obtain a solver for AK, we instead solve its Gram matrix (AK)T(AK). Each block of KTK has the form (GS)TAiGS for some 2i2m, and can be computed by multiplying (GS)T and AiGS. As AiGS is an n-by-s matrix, each non-zero in GS leads to a cost of O(s) operations involving words of length O~(m). Then because we chose GS to have O~(m3) non-zeros per column, the total number of non-zeros in GS is about O~(s·m3)=O~(nm2). This leads to a total cost (across the m values of i) of:

O ˜ ( n 2 m 3 ) .

The key step is then Step 2, a block version of the Conjugate Gradient method. It will be implemented using a recursive data structure based on the notion of displacement rank.7 To get a sense of why a faster algorithm may be possible, note that there are only O(m) distinct blocks in the matrix (AK)T(AK). So a natural hope is to invert these blocks by themselves; the cost of (stable) matrix inversion,4 times the O~(m) numerical word complexity, would then give a total of

O ˜ ( m 2 s ω ) = O ˜ ( m 2 ( n m ) ω ) = O ˜ ( n ω m 2 ω ) .

Of course, it does not suffice to solve these m s-by-s blocks independently. Instead, the full algorithm, as well as the SOLVEM operator, is built from efficiently convolving such s-by-s blocks with matrices using Fast Fourier Transforms. Such ideas can be traced back to the development of super-fast solvers for (entry-wise) Hankel/Toeplitz matrices.7

Choosing s and m so that n=sm would then give the overall running time, assuming that we can bound the minimum singular value of K by exp(O~(m)). This is a shortcoming of our analysis: we can only prove such a bound when nsmΩ(m). The underlying reason is that rectangular semi-random matrices can be analyzed using ϵ-nets, and thus are significantly easier to analyze than square matrices.

This means we can only use m and s such that nms=Θ(m), and we need to pad K with nms columns to guarantee a full rank, invertible, matrix. To this end, we add Θ(m) dense Gaussian columns to K to form Q, and solve the system AQ, and its associated Gram matrix (AQ)T(AQ) instead. These matrices are shown in Figure 4.

Figure 4.  Full matrix AQ and its Associated Gram Matrix (AQ)T(AQ). Note that by our choice of parameters m is much smaller than sn/m.

Since these additional columns are entry-wise i.i.d, the minimum singular value can be analyzed using existing tools18,21 namely lower bounding the inner product of a random vector against any vector. Thus, we can lower bound the minimum singular value of Q, and in turn AQ, by exp(O~(m)).

This bound in turn translates to a lower bound on the minimum eigenvalue of the Gram matrix of AQ, (AQ)T(AQ). Partitioning its entries by those from K and G gives four blocks: one (sm)-by-(sm) block corresponding to (AK)T(AK), one Θ(m)-by-Θ(m) block corresponding to (AG)T(AG), and then the cross terms. To solve this matrix, we apply block-Gaussian elimination, or equivalently, form the Schur complement onto the Θ(m)-by-Θ(m) corresponding to the columns in AG.

To compute this Schur complement, it suffices to solve the top-left block (corresponding to (AK)T(AK)) against every column in the cross term. As there are at most Θ(m)<s columns, this solve cost comes out to less than O~(sωm) as well. We are then left with a Θ(m)-by-Θ(m) matrix, whose solve cost is a lower order term.

So the final solver costs

O ˜ ( n n z ( A ) n m + n 2 m 3 + n ω m 2 ω )

which leads to the final running time by choosing m to balance the terms. This bound falls short of the ideal case given in Equation 1 mainly due to the need for a denser B to the well-conditionedness of the Krylov space matrix. Instead of O(n) non-zeros total, or about O(m) per column, we need poly(m) non-zero variables per column to ensure the an exp(O(m)) condition number of the block Krylov space matrix K. This in turn leads to a total cost of O(n·nnz·poly(m)) for computing the blocks of the Hankel matrix, and a worse trade off when summed against the nωmω2 term.

3. Acknowledgments

Richard Peng was supported in part by NSF CAREER award 1846218/2330255, and Santosh Vempala by NSF awards AF-1909756 and AF-2007443. We thank Mark Giesbrecht for bringing to our attention the literature on block-Krylov space algorithms.


    • 1. Beckermann, B. and Labahn, G. A uniform approach for the fast computation of matrix-type Padé approximants. SIAM J. on Matrix Analysis and Applications 15, 3 (1994), 804823.
    • 2. Blum, L. Computing over the reals: Where Turing meets Newton. Notices of the AMS 51, 9 (2004), 10241034.
    • 3. Cipra, B.A. The best of the 20th century: Editors name top 10 algorithms. SIAM News 33, 4 (2000), 12;
    • 4. Demmel, J., Dumitriu, I., Holtz, O., and Kleinberg, R. Fast matrix multiplication is stable. Numerische Mathematik 106, 2 (2007), 199224.
    • 5. Eberly, W. et al. Faster inversion and other black box matrix computations using efficient block projections. In Symbolic and Algebraic Computation, Intern. Symp., ISSAC 2007, Waterloo, Ontario, Canada, July 28 - August 1, 2007, Proceedings , 2007, 143150.
    • 6. Ghadiri, M., Peng, R., and Vempala, S.S. The bit complexity of efficient continuous optimization. In Proceedings of the 64th Symp. on Foundations of Computer Science, 2023;
    • 7. Gray, R.M. Toeplitz and Circulant Matrices: A Rev. Now Publishers Inc, 2006.
    • 8. Koutis, I., Miller, G.L., and Peng, R. A fast solver for a class of linear systems. Communications of the ACM 55, 10 (Oct. 2012), 99107.
    • 9. Kyng, R. Approximate Gaussian Elimination. PhD thesis, Yale University, 2017;
    • 10. Kyng, R., Wang, D., and Zhang, P. Packing LPs are hard to solve accurately, assuming linear equations are hard. Proceedings of the 2020 ACM-SIAM Symp. on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020. Chawla, S. (ed). SIAM, 2020, 279296;
    • 11. Luh, K. and Vu, V. Sparse random matrices have simple spectrum. arXiv preprint arXiv:1802.03662, 2018.
    • 12. Musco, C., Musco, C., and Sidford, A. Stability of the Lanczos method for matrix function approximation. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symp. on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, 2018, 16051624;
    • 13. Nguyen, H., Tao, T., and Vu, V. Random matrices: Tail bounds for gaps between eigenvalues. Probability Theory and Related Fields 167, 3-4, (2017), 777816.
    • 14. Nie, Z. Matrix anti-concentration inequalities with applications. In STOC ’22: 54th Annual ACM SIGACT Symp. on Theory of Computing, Rome, Italy, June 20 - 24, 2022 . Leonardi, S. and Gupta, A. (eds). ACM, (2022), 568581;
    • 15. Rudelson, M. and Vershynin, R. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the Intern. Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II–IV: Invited Lectures. World Scientific, 2010, 15761602
    • 16. Saad, Y. Iterative Methods for Sparse Linear Systems, 2nd edition. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2003;
    • 17. Sachdeva, S. and Vishnoi, N.K. Faster algorithms via approximation theory. Theoretical Computer Science 9, 2 (2013), 125210.
    • 18. Sankar, A., Spielman, D.A., and Teng, S.-H. Smoothed analysis of the condition numbers and growth factors of matrices. SIAM J. on Matrix Analysis and Applications 28, 2 (2006), 446476.
    • 19. Spielman, D. and Teng, S. Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM J. on Matrix Analysis and Applications 35, 3 (2014), 835885;
    • 20. Storjohann, A. The shifted number system for fast linear algebra on integer matrices. J. of Complexity 21, 4 (2005), 609650;
    • 21. Tao, T. and Vu, V.H. Smooth analysis of the condition number and the least singular value. Math. Comput. 79, 272 (2010), 23332352;
    • 22. Tropp, J.A. An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8, 1-2, (2015), 1230.
    • 23. Woodruff, D.P. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science 10, 1-2, (2014), 1157;
    • 24. Xi, Y., Xia, J., Cauley, S., and Balakrishnan, V. Superfast and stable structured solvers for toeplitz least squares via randomized sampling. SIAM J. on Matrix Analysis and Applications 35, 1 (2014), 4472.
    • 25. Zhang, P. Hardness and Tractability For Structured Numerical Problems. PhD thesis, Georgia Institute of Technology, 2018.
    • Currently, ω<2.371866.
    • We will be measuring bit-complexity under fixed-point arithmetic. Here the machine word size is on the order of the maximum number of digits of precision in A, and the total cost is measured by the number of word operations. The need to account for bit-complexity of the numbers naturally led to the notion of condition number.2 The logarithm of the condition number measures the additional number of words needed to store A-1 (and thus A-1b) compared to A. In particular, matrices with poly(n) condition number can be stored with a constant factor overhead in precision, and are numerically stable under standard floating point number representations.
    • The hardness results in Kyng et al.10 were based on SLTH1.51.99 under the Real RAM model in part due to the uncertain status of conjugate gradient in different models of computation.
    • Rounding up two decimals places in the exponent.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More