Abstract
Can linear systems be solved faster than matrix multiplication? While there has been remarkable progress for the special cases of graph structured linear systems, in the general setting, the bit complexity of solving an n × n linear system Ax = b is Õ(nω), where ω < 2.372864 is the matrix multiplication exponent. Improving on this has been an open problem even for sparse linear systems with poly(n) condition number.
In this paper, we present an algorithm that solves linear systems in sparse matrices asymptotically faster than matrix multiplication for any ω > 2. This speedup holds for any input matrix A with o(nω−1/log(k(A))) non-zeros, where k(A) is the condition number of A. For poly(n)-conditioned matrices with Õ(n) nonzeros, and the current value of ω, the bit complexity of our algorithm to solve to within any 1/poly(n) error is O(n2.331645).
Our algorithm can be viewed as an efficient, randomized implementation of the block Krylov method via recursive low displacement rank factorizations. It is inspired by the algorithm of [Eberly et al. ISSAC ’06 ’07] for inverting matrices over finite fields. In our analysis of numerical stability, we develop matrix anti-concentration techniques to bound the smallest eigenvalue and the smallest gap in eigenvalues of semi-random matrices.
Introduction
Solving a linear system is a basic algorithmic problem with direct applications to scientific computing, engineering, and physics, and is at the core of algorithms for many other problems, including optimization, data science, and computational geometry. It has enjoyed an array of elegant approaches, from Cramer’s rule and Gaussian elimination to numerically stable iterative methods to more modern randomized variants based on random sampling8,19 and sketching.23 Despite much recent progress on faster solvers for graph-structured linear systems8,9,19 progress on the general case has been elusive.
Most of the work in obtaining better running time bounds for linear systems solvers has focused on efficiently computing the inverse of , or some factorization of it. Such operations are in turn closely related to the cost of matrix multiplication. Matrix inversion can be reduced to matrix multiplication via divide-and-conquer, and this reduction was shown to be stable when the word size for representing numbersb is increased by a factor of .4 The current best runtime of follows a long line of work on faster matrix multiplication algorithms and is also the current best running time for solving : when the input matrix/vector are integers, matrix multiplication based algorithms can obtain the exact rational solution using word operations.20
Methods for matrix inversion or factorization are often referred to as direct methods in the linear systems literature. This is in contrast to iterative methods, which gradually converge to the solution. Iterative methods have low space overhead, and therefore are widely used for solving large, sparse, linear systems that arise in scientific computing. Another reason for their popularity is that iterative methods are naturally suited to producing approximate solutions of desired accuracy in floating point arithmetic, the de facto method for representing real numbers. Perhaps the most famous iterative method is the Conjugate Gradient (CG) / Lanczos algorithm.16 It was introduced as an time algorithm under exact arithmetic, where is the number of non-zeros in the input matrix. However, this bound only holds under the Real RAM model where words have unbounded precision. When taking bit sizes into account, it incurs an additional factor of . Despite much progress in iterative techniques in the intervening decades, obtaining gains over matrix multiplication in the presence of round-off errors has remained an open question.
The convergence and stability of iterative methods typically depend on some condition number of the input. When all intermediate steps are carried out using precision close to the condition number of , the running time bounds of the CG algorithm, as well as other currently known iterative methods, depend polynomially on the condition number of the input matrix . Formally, the condition number of a symmetric matrix , , is the ratio between the maximum and minimum eigenvalues of . Here the best known rate of convergence when all intermediate operations are restricted to bit-complexity is iterations to achieve error . This is known to be tight if one restricts to matrix-vector multiplications in the intermediate steps.12,17 This means for moderately conditioned (e.g., with ), sparse, systems, the best runtime bounds are still via direct methods, which are stable when words of precision are maintained in intermediate steps.4
Many of the algorithms used in practice in scientific computing for solving linear systems involving large, sparse matrices are based on combining direct and iterative methods: we will briefly discuss these perspectives in Section 1.3. In terms of asymptotic complexity, the practical successes of many such methods naturally lead to the question of whether one can provably do better than the time corresponding to the faster of direct or iterative methods. Somewhat surprisingly, despite the central role of this question in scientific computing and numerical analysis, as well as extensive studies of linear systems solvers, progress on it has been elusive. The continued lack of progress on this question has led to its use as a hardness assumption for showing conditional lower bounds for numerical primitives such as linear elasticity problems25 and positive linear programs.10 One formalization of such hardness is the Sparse Linear Equation Time Hypothesis (SLTH) from:10 denotes the assumption that a sparse linear system with cannot be solved in time faster than to within relative error . Here, improving over the smaller running time of both direct and iterative methods can be succinctly encapsulated as refuting .c
We provide a faster algorithm for solving sparse linear systems. Our formal result is the following (we use the form defined in:10 Linear Equation Approximation Problem, LEA).
Given a matrix with maximum dimension , non-zeros (whose values fit into a single word), along with a parameter such that , a vector and error requirement , we can compute, under fixed point arithmetic, in time
a vector such that
where is a fixed constant and is the projection operator onto the column space of.
Note that , and when is square and full rank, it is just .
The cross-over point for the two bounds is at . In particular, for the sparse case with , and the bound of , we get an exponent of
As , this also translates to a running time of , which as , refutes for constant values of and any value of .
We can parameterize the asymptotic gain over matrix multiplication for moderately sparse instances. Here we use the notation to hide lower-order terms, specifically denotes for some absolute constant .
For any matrixwith dimension at most, non-zeros, and condition number, a linear system incan be solved to accuracyin time.
Here the cross-over point happens at . Also, because , we can also infer that for any and any , the runtime is , or asymptotically faster than matrix multiplication.
Idea
At a high level, our algorithm follows the block Krylov space method (see e.g., Chapter 6.12 of Saad16). This method is a multi-vector extension of the CG/Lanczos method, which in the single-vector setting is known to be problematic under round-off errors both in theory12 and in practice.16 Our algorithm starts with a set of initial vectors, , and forms a column space by multiplying these vectors by repeatedly, times. Formally, the block Krylov space matrix is
The core idea of Krylov space methods is to efficiently orthogonalize this column space. For this space to be spanning, block Krylov space methods typically choose and so that .
The conjugate gradient algorithm can be viewed as an efficient implementation of the case , , with set to , the RHS of the input linear system. The block case with larger values of was studied by Eberly, Giesbrecht, Giorgi, Storjohann, and Villard5 over finite fields, and they gave an timed algorithm for computing the inverse of an -sparse matrix over a finite field.
Our algorithm also leverages the top-level insight of the Eberly et al. results: the Gram matrix of the Krylov space matrix, is a block Hankel matrix. Solving linear systems in this Gram matrix, , lead to solvers for linear systems in because
so as long as and are both invertible, composing this on the left by and on the right by gives
Eberly et al. viewed the Gram matrix as an -by- matrix containing -by- sized blocks, and critically leveraged the fact that the blocks along each anti-diagonal are identical:
Formally, the -by- inner product matrix formed from and is , and depends only on . So instead of blocks each of size , we are able to represent an -by- matrix with only about blocks.
Operations involving these blocks of the Hankel matrix can be handled using block operations. This is perhaps easiest seen for computing matrix-vector products using . If we use to denote the th block of the Hankel matrix, and define
for a sequence of matrices , we get that the block of the product can be written in block-form as
Note this is precisely the convolution of (a sub-interval) of and , with shifts indicated by . Therefore, in matrix-vector multiplication (the “forward” direction), a speedup by a factor of about is possible with fast convolution algorithms. The performance gains of the Eberly et al. algorithms5 can be viewed as being of a similar nature, albeit in the more difficult direction of solving linear systems. Specifically, they utilize algorithms for the Padé problem of computing a polynomial from the result of its convolution.1 Over finite fields, or under exact arithmetic, such algorithms for matrix Padé problems take block operations,1 for a total of operations.
The overall time complexity is based on two opposing goals:
Quickly generate the Krylov space: repeated multiplication by allows us to generate using arithmetic operations. Choosing a sparse then allows us to compute in arithmetic operations, for a total overhead of .
Quickly invert the Hankel matrix. Each operation on an -by- block takes time. Under the optimistic assumption of block operations, the total is .
Under these assumptions, and the requirement of , the total cost becomes about , which is at most as long as . However, this runtime complexity is over finite fields, where numerical stability is not an issue. Over the reals, under round-off errors, one must contend with numerical errors without blowing up the bit complexity. This is a formidable challenge; indeed, as mentioned earlier, with exact arithmetic, the CG method takes time , but this is misleading since the computation is effective only when the word sizes are increased by a factor of (to about words), which leads to an overall complexity of .
Our Contributions
Our algorithm can be viewed as the numerical generalization of the algorithms from.5 We work with real numbers of bounded precision, instead of entries over a finite field. The core of our approach can be summarized as follows.
Doing so requires developing tools for two topics that have been extensively studied in mathematics, doing so requires separately developing:
Obtain low numerical cost solvers for block Hankel/Toeplitz matrices. Many of the prior algorithms rely on algebraic identities that do not generalize to the block setting, and are often (experimentally) numerically unstable.7
Develop matrix anti-concentration bounds for analyzing the word lengths of inverses of random Krylov spaces. This is to upper bound the probability of random matrices being in some set of small measure, which in our case is the set of nearly singular matrices. Previously, such bounds were known assuming the matrix entries are independent18,21 but Krylov matrices have correlated columns.
Before we describe the difficulties and new tools needed, we first provide some intuition on why a factor increase in word lengths may be the right answer by upper-bounding the magnitudes of entries in an -step Krylov space. By rescaling, we may assume that the minimum singular value of is at least , and the maximum entry in is at most . The maximum magnitude of (entries of) is bounded by the maximum magnitude of to the power of , times a factor corresponding to the number of summands in the matrix product:
Here the last inequality is via the assumption of . So by forming via horizontally concatenating for sparse Gaussian vectors , we have with high probability that the maximum magnitude of an entry of , and in turn , is at most . In other words, words in front of the decimal point is sufficient with high probability.
Should such a bound of hold for all numbers that arise in the algorithm, including the matrix inversion steps, and the matrix is sparse with entries, the cost of computing the block-Krylov matrices becomes , while the cost of the matrix inversion portion encounters an overhead of , for a total of . In the sparse case of , and , this becomes:
(1)Due to the gap between and , setting appropriately gives improvement over when .
However, the magnitude of an entry in the inverse depends on the smallest magnitude, or in the matrix case, its minimum singular value. Bounding and propagating the min singular value, which intuitively corresponds to how close a matrix is to being degenerate, represents our main challenge. In exact/finite fields settings, non-degeneracies are certified via the Schwartz-Zippel lemma about polynomial roots. The numerical analog of this is more difficult — the Krylov space matrix is asymmetric, even for a symmetric matrix . It is much easier for an asymmetric matrix with correlated entries to be close to singular.
Consider for example a two-banded, two-block matrix with all diagonal entries set to the same random variable (see Figure 1):
In the exact case, this matrix is full rank unless , even over finite fields. On the other hand, its minimum singular value is close to 0 for all values of . To see this, it’s useful to first make the following observation about the min singular value of one of the blocks.
The minimum singular value of a matrix with 1s on the diagonal, on the entries immediately below the diagonal, and 0 everywhere else is at most , due to the test vector .
Then, in the top-left block, as long as , the top left block has minimum singular value at most . On the other hand, rescaling the bottom-right block by to get 1s on the diagonal gives on the off-diagonal. So as long as , this value is at least , which in turn implies a minimum singular value of at most in the bottom right block. This means no matter what value is set to, this matrix will always have a singular value that’s exponentially close to 0. Furthermore, the Gram matrix of this matrix also gives such a counter example to symmetric matrices with (non-linearly) correlated entries. Previous works on analyzing condition numbers of asymmetric matrices also encounter similar difficulties; a more detailed discussion of it can be found in Section 7 of Sankar et al.18
In order to bound the bit complexity of all intermediate steps of the block Krylov algorithm by , we devise a more numerically stable algorithm for solving block Hankel matrices, as well as provide a new perturbation scheme to quickly generate a well-conditioned block Krylov space. Central to both of our key components is the close connection between condition number and bit complexity bounds.
First, we give a more numerically stable solver for block Hankel/Toeplitz matrices. Fast solvers for Hankel (and closely related Toeplitz) matrices have been extensively studied in numerical analysis, with several recent developments on more stable algorithms.24 However, the notion of numerical stability studied in these algorithms is the variant where the number of bits of precision is fixed. Our attempts at converting these into asymptotic bounds yielded dependencies quadratic in the number of digits in the condition number, which in our setting translates to a prohibitive cost of (i.e., the overall cost would be higher than ).
Instead, we combine developments in recursive block Gaussian elimination4 with the low displacement rank representation of Hankel/Toeplitz matrices.7 Such representations allow us to implicitly express both the Hankel matrix and its inverse by displaced versions of low-rank matrices. This means the intermediate size of instances arising from recursion is times the dimension, for a total size of , giving a total of arithmetic operations involving words of size . We provide a rigorous analysis of the accumulation of round-off errors similar to the analysis of recursive matrix multiplication based matrix inversion from.4
Motivated by this close connection with the condition number of Hankel matrices, we then try to initialize with Krylov spaces of low condition number. Here we show that a sufficiently small perturbation suffices for producing a well conditioned overall matrix. In fact, the first step of our proof, showing that a small sparse random perturbation to guarantees good separations between its eigenvalues, is a direct combination of bounds on eigenvalue separation of random Gaussians13 as well as min eigenvalue of random sparse matrices.11 This separation then ensures that the powers of , , are sufficiently distinguishable from each other. Such considerations also come up in the smoothed analysis of numerical algorithms.18
The randomness of the Krylov matrix induced by the initial set of random vectors is more difficult to analyze: each column of affects columns of the overall Krylov space matrix. In contrast, all existing analyses of lower bounds of singular values of possibly asymmetric random matrices18,21 rely on the randomness in the columns of matrices being independent. The dependence between columns necessitates analyzing singular values of random linear combinations of matrices, which we handle by adapting -net based proofs of anti-concentration bounds. Here we encounter an additional challenge in bounding the minimum singular value of the block Krylov matrix. We resolve this issue algorithmically: instead of picking a Krylov space that spans the entire , we stop short by picking This set of extra columns significantly simplifies the proof of singular value lower bounds. This is similar in spirit to the analysis of the minimum singular value of a random matrix, which is easier for a non-square matrix.15 In the algorithm, the remaining columns are treated as a separate block that we handle via a Schur complement at the very end of the algorithm. Since this block is small, so is its overhead on the running time.
History and Related Work
Our algorithm has close connections with multiple lines of research on efficient solvers for sparse linear systems. The topic of efficiently solving linear systems has been extensively studied in computer science, applied mathematics and engineering. For example, in the Society of Industrial and Applied Mathematics News’ ‘top 10 algorithms of the 20th century’, three of them (Krylov space methods, matrix decompositions, and QR factorizations) are directly related to linear systems solvers.3
At a high level, our algorithm is a hybrid linear systems solver. It combines iterative methods, namely block Krylov space methods, with direct methods that factorize the resulting Gram matrix of the Krylov space. Hybrid methods have their origins in the incomplete Cholesky method for speeding up elimination/factorization based direct solvers. A main goal of these methods is to reduce the space needed to represent matrix factorizations/inverses. This high space requirement is often even more problematic than time requirements when handling large sparse matrices. Such reductions can occur in two ways: either by directly dropping entries from the (intermediate) matrices, or by providing more succinct representations of these matrices using additional structure.
The main structure of our algorithm is based on the latter line of work on solvers for structured matrices. Such systems arise from physical processes where the interactions between objects have invariances (e.g., either by time or space differences). Examples of such structure include circulant matrices, Hankel/Toeplitz matrices and distances from -body simulations.7 Many such algorithms require exact preservation of the structure in intermediate steps. As a result, many of these works develop algorithms over finite fields.
More recently, there has been work on developing numerically stable variants of these algorithms for structured matrices, or more generally, matrices that are numerically close to being structured.24 However, these results only explicitly discussed in the entry-wise Hankel/Toeplitz case (which corresponds to block size ). Furthermore, because they rely on domain-decomposition techniques similar to fast multiple methods, they produce one bit of precision for each outer iteration loop. As the Krylov space matrix has condition number , such methods would lead to another factor of in the solve cost if invoked directly.
Instead, our techniques for handling and bounding numerical errors are more closely related to recent developments in provably efficient sparse Cholesky factorizations.9 These methods generate efficient preconditioners using only the condition that intermediate steps of Gaussian elimination, known as Schur complements, have small representations. They avoid the explicit generation of the dense representations of Schur complements by treating them as operators, and apply randomized tools to directly sample/sketch the final succinct representations, which have much smaller size and algorithmic cost.
On the other hand, previous works on sparse Cholesky factorizations required the input matrix to be decomposable into a sum of simple elements, often through additional combinatorial structure of the matrices. In particular, this line of work on combinatorial preconditioning was initiated through a focus on graph Laplacians, which are built from 2-by-2 matrix blocks corresponding to edges of undirected graphs.19 Since then, there have been substantial generalizations to the structures amenable to such approaches, notably to finite element matrices, and directed graphs/irreversible Markov chains. However, recent works have also shown that many classes of structures involving more than two variables are complete for general linear systems.25 Nonetheless, the prevalence of approximation errors in such algorithms led to the development of new ways to bound numerical round-off errors in algorithms, which will be critical to our elimination routine for block-Hankel matrices.
Key to recent developments in combinatorial preconditioning is matrix concentration.22 Such bounds provide guarantees for (relative) eigenvalues of random sums of matrices. For generating preconditioners, such randomness arises from whether each element is kept or not, and a small condition number (which in turn implies a small number of outer iterations using the preconditioner) corresponds to a small deviation between the original and sampled matrices. In contrast, we introduce randomness in order to obtain block Krylov spaces whose minimum eigenvalue is large. As a result, the matrix tool we need is anti-concentration, which somewhat surprisingly is far less studied. Previous works on it are related to similar problems of numerical precision18,21 and address situations where the entries in the resulting matrix are independent. Our bound on the min singular value of the random Krylov space also yields a crude bound for a sum of rectangular random matrices.
Subsequent Improvements and Extensions
Nie14 gave a more general and tighter version of matrix anti-concentration that also works for square matrices, answering an open question we posed. For an -step Krylov space instantiated using vectors, Nie’s bound reduces the middle term in our analysis from to , thus leading to a running time of , which matches the bound for finite fields. Moreovver, it does so without the padding step at the end. We elect to keep for this article our epsilon-net based analyses, and the padded algorithm required for it, both for completeness, and due to it being a more elementary approach toward the problem with a simpler proof.
Faster matrix multiplication is an active area of work with recent progresses. Due to the dependence on fast multiplication in our algorithm, such improvements also lead to improvements in the running time of solving sparse systems as well.
Our main result for solving sparse linear systems has also been extended to solving sparse regression, with faster than matrix multiplication bounds for sufficiently sparse matrices.6 The complexity of sparse linear programming remains an interesting open problem.
Algorithm
We describe the algorithm, as well as the running times of its main components in this section. To simplify the discussion, we assume the input matrix is symmetric, and has condition number. If it is asymmetric (but invertible), we implicitly apply the algorithm to , using the identity derived from . Also, recall from the discussion after Theorem 1 that we use to hide logarithmic terms in order to simplify runtimes.
Before giving details of our algorithm, we first discuss what constitutes a linear system solver algorithm, specifically the equivalence between many such algorithms and linear operators.
For an algorithm that takes a matrix as input, we say that is linear if there is a matrix such that for any input , the output of running the algorithm on is the same as multiplying by :
In this section, in particular in the pseudocode in Algorithm 2, we use the name of the procedure, , interchangeably with the operator corresponding to a linear algorithm that solves a system in , on vector , to error . In the more formal analysis, we will denote such corresponding linear operators using the symbol , with subscripts corresponding to the routine if appropriate.
This operator/matrix based analysis of algorithms was first introduced in the analysis of a recursive Chebyshev iteration by Spielman and Teng,19 with credits to the technique also attributed to V. Rokhlin. It has the advantage of simplifying the analyis of multiple iterations of such algorithms, as we can directly measure Frobenius norm differences between such operators and the exact ones that they approximate.
Under this correspondence, the goal of producing an algorithm that solves for any as input becomes equivalent to producing a linear operator that approximates , and then running it on the input . For convenience, we also let the solver take as input a matrix instead of a vector, in which case the output is the result of solves against each of the columns of the input matrix as the RHS.
The high-level description of our algorithm is in Figure 2.
Some of the steps of the algorithm require care for efficiency as well as for tracking the number of words needed to represent the numbers. We assume a bound on bit complexity of when in the brief description of costs in the outline of the steps below.
We start by perturbing the input matrix, resulting in a symmetric positive definite matrix where all eigenvalues are separated by . Then we explicitly form a Krylov matrix from a sparse random Gaussian matrix, see Fig. 3. For any vector , we can compute from via a single matrix-vector multiplication in . So computing each column of requires operations, each involving a length vector with words of length . So we get the matrix , as well as , in time
To obtain a solver for , we instead solve its Gram matrix . Each block of has the form for some , and can be computed by multiplying and . As is an -by- matrix, each non-zero in leads to a cost of operations involving words of length . Then because we chose to have non-zeros per column, the total number of non-zeros in is about . This leads to a total cost (across the values of ) of:
The key step is then Step 2, a block version of the Conjugate Gradient method. It will be implemented using a recursive data structure based on the notion of displacement rank.7 To get a sense of why a faster algorithm may be possible, note that there are only distinct blocks in the matrix . So a natural hope is to invert these blocks by themselves; the cost of (stable) matrix inversion,4 times the numerical word complexity, would then give a total of
Of course, it does not suffice to solve these -by- blocks independently. Instead, the full algorithm, as well as the operator, is built from efficiently convolving such -by- blocks with matrices using Fast Fourier Transforms. Such ideas can be traced back to the development of super-fast solvers for (entry-wise) Hankel/Toeplitz matrices.7
Choosing and so that would then give the overall running time, assuming that we can bound the minimum singular value of by . This is a shortcoming of our analysis: we can only prove such a bound when . The underlying reason is that rectangular semi-random matrices can be analyzed using -nets, and thus are significantly easier to analyze than square matrices.
This means we can only use and such that , and we need to pad with columns to guarantee a full rank, invertible, matrix. To this end, we add dense Gaussian columns to to form , and solve the system , and its associated Gram matrix instead. These matrices are shown in Figure 4.
Since these additional columns are entry-wise i.i.d, the minimum singular value can be analyzed using existing tools18,21 namely lower bounding the inner product of a random vector against any vector. Thus, we can lower bound the minimum singular value of , and in turn , by .
This bound in turn translates to a lower bound on the minimum eigenvalue of the Gram matrix of , . Partitioning its entries by those from and gives four blocks: one -by- block corresponding to , one -by- block corresponding to , and then the cross terms. To solve this matrix, we apply block-Gaussian elimination, or equivalently, form the Schur complement onto the -by- corresponding to the columns in .
To compute this Schur complement, it suffices to solve the top-left block (corresponding to ) against every column in the cross term. As there are at most columns, this solve cost comes out to less than as well. We are then left with a -by- matrix, whose solve cost is a lower order term.
So the final solver costs
which leads to the final running time by choosing to balance the terms. This bound falls short of the ideal case given in Equation 1 mainly due to the need for a denser to the well-conditionedness of the Krylov space matrix. Instead of non-zeros total, or about per column, we need non-zero variables per column to ensure the an condition number of the block Krylov space matrix . This in turn leads to a total cost of for computing the blocks of the Hankel matrix, and a worse trade off when summed against the term.
Acknowledgments
Richard Peng was supported in part by NSF CAREER award 1846218/2330255, and Santosh Vempala by NSF awards AF-1909756 and AF-2007443. We thank Mark Giesbrecht for bringing to our attention the literature on block-Krylov space algorithms.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment