In every corner of machine learning and statistics, there is a need for estimators that work not just in an idealized model, but even when their assumptions are violated. Unfortunately, in high dimensions, being provably robust and being efficiently computable are often at odds with each other.
We give the first efficient algorithm for estimating the parameters of a high-dimensional Gaussian that is able to tolerate a constant fraction of corruptions that is independent of the dimension. Prior to our work, all known estimators either needed time exponential in the dimension to compute or could tolerate only an inverse-polynomial fraction of corruptions. Not only does our algorithm bridge the gap between robustness and algorithms, but also it turns out to be highly practical in a variety of settings.
1. Introduction
Machine learning is filled with examples of estimators that work well in idealized settings but fail when their assumptions are violated. Consider the following illustrative example: We are given samples X1, X2, … , XN from a one-dimensional Gaussian
and our goal is to estimate its mean μ and its variance σ2. It is well-known that the empirical mean and empirical variance are effective, which are defined as
In fact, these are examples of a more general paradigm within statistics called maximum likelihood estimation: When we know the distribution comes from some parametric family, we choose the parameters that are the most likely to have generated the observed data. In 1922, Ronald Fisher12 formulated the maximum likelihood principle. It has many wonderful properties (under various technical conditions), such as converging to the true parameters as the number of samples goes to infinity, a property called consistency. Moreover, it has asymptotically the smallest possible variance among all unbiased estimators, a property called asymptotic consistency.
In 1960, John Tukey24 challenged the conventional wisdom in parametric estimation by asking a simple question: Are there provably robust methods to estimate the parameters of a one-dimensional Gaussian? He showed that various estimators that were not asymptotically consistent (and had thus fallen out of favor) outperformed the maximum likelihood estimator when the data is not exactly Gaussian, but instead comes from a distribution that is close to being Gaussian. His paper launched the field of robust statistics15,13 that seeks to design estimators that behave well in a neighborhood around the true model. In one dimension, robust statistics prescribes that it is better to use the empirical median than the empirical mean. Similarly, it is better to use the empirical median absolute deviation (or any number of other estimators based on quantiles) than the empirical standard deviation. See Section 3.1.
Although there is an urgent need for provably robust estimators in virtually every application of machine learning, there is a major obstacle to directly applying ideas from robust statistics. The difficulty is that virtually all provably robust estimators are hard to compute in high dimensions. In this work, we are interested in the following family of questions:
QUESTION 1.1. Let D be a family of distributions on Rd. Suppose we are given samples generated from the following process: First, m samples are drawn from some unknown distribution P in D. Then, an adversary is allowed to arbitrarily corrupt an ε-fraction of the samples. Can we efficiently find a distribution P’ in D that is f(ε, d)-close, in total variation distance, to P?
Our most important example is the direct generalization of John Tukey’s challenge24 to higher dimensions: Is there a provably robust algorithm for estimating the parameters of a high-dimensional Gaussian? Without algorithmic considerations, robust statistics already provides prescriptions such as the Tukey median25 and the minimum volume enclosing ellipsoid23 for estimating the high-dimensional mean and covariance, respectively. However, the best-known algorithms for computing these estimates run in time that is exponential in the dimension. In fact, we are not aware of any moderate-sized datasets with dimension larger than six where these estimates have been successfully computed!
In contrast, there are other techniques that one might try. For example, instead of computing the Tukey median, we could compute the coordinate-wise median. This can obviously be done in polynomial time but encounters a different sort of difficulty: It turns out that by adding corruptions along a direction that is not-axis aligned, an adversary can badly compromise the estimator. Quantitatively, even if an adversary is only allowed to corrupt only an ε-fraction of the samples, they can force the estimator to find a Gaussian that is as far as in ℓ2-distance, implying total variation distance is close to one.
The main meta-question behind our work is: Is it possible to design estimators that are both provably robust in high dimensions (i.e., they do not lose dimension-dependent factors in their robustness guarantees) and also efficiently computable? We will give the first provably robust and computationally efficient methods for learning the parameters of a high-dimensional Gaussian, as well as various other related models. In concurrent and independent work, Lai et al.20 gave alternative algorithms albeit with weaker guarantees. We discuss their work in Section 1.3.
The types of questions we study here also have roots in computational learning theory. They are related to the agnostic learning model of Kearns et al.17 where the goal is to learn a labeling function whose agreement with some underlying target function is close to the best possible, among all functions in some given class. In contrast, we are interested in an unsupervised learning problem, but it is also agnostic in the sense that we want to find the approximately closest fit from among our family of distributions. Within machine learning, these types of problems are also called estimation under model misspecification. The usual prescription is to use the maximum likelihood estimator, which is unfortunately hard to compute in general. Even ignoring computational considerations, the maximum likelihood estimator is only guaranteed to converge to the distribution P’ in D that is closest (in Kullback-Leibler divergence) to the distribution from which the observations are generated. This is problematic because such a distribution is not necessarily close to P at all.
More broadly, in recent years, there has been considerable progress on a variety of problems in this domain, such as algorithms with provable guarantees for learning mixture models, phylogenetic trees, hidden Markov models, topic models and independent component analysis. These algorithms are based on the method of moments and crucially rely on the assumption that the observations were actually generated by a model in the family. However, this simplifying assumption is not meant to be exactly true, and it is an important direction to explore what happens when it holds only in an approximate sense. Our work can be thought of as a first step toward relaxing the distributional assumptions in these applications, and subsequent work in algorithmic robust statistics has given new methodologies for robustly estimating higher moments under weaker assumptions about the uncorrupted distribution.5, 10, 14, 19
All of our algorithms are based on a common recipe. The first step is to answer the following easier question: Even if we were given a candidate hypothesis P’, how could we tell if it is ε-close in total variation distance to P? The usual way to certify closeness is to exhibit a coupling between P and P’ that marginally samples from both distributions, with the property that the samples are the same with probability 1 – ε. However, we have no control over the process by which samples are generated from P, in order to produce such a coupling. And even then, the way that an adversary decides to corrupt samples can introduce complex statistical dependencies.
We circumvent this issue by working with an appropriate notion of parameter distance, which we use as a proxy for the total variation distance between two distributions in the class D. See Section 2.2. Various notions of parameter distance underlie various efficient algorithms for distribution learning in the following sense. If θ and θ‘ are two sets of parameters that define distributions Pθ and Pθ‘ in a given class D, a learning algorithm often relies on establishing the following type of relation between the total variation distance dTV (Pθ, Pθ‘) and the parameter distance dp(θ, θ‘):
Unfortunately, in our agnostic setting, we cannot afford for (1) to have any dependence on the dimension d at all. Any such dependence would appear in the error guarantee of our algorithm. Instead, the starting point of our algorithms is a notion of parameter distance that satisfies
that allows us to reformulate our goal of designing robust estimators, with distribution-independent error guarantees, as the goal of robustly estimating θ according to dp. In several settings, the choice of the parameter distance is rather straightforward. It is often the case that some variants of the ℓ2-distance between the parameters work.
Given our notion of parameter distance satisfying (2), our main ingredient is an efficient method for robustly estimating the parameters. We provide two algorithmic approaches that are based on similar principles. Our first approach is fast and practical, requiring only approximate eigenvalue computations. Our second approach relies on convex programming, which has the advantage that it is possible to mix in different types of constraints (such as those generated by the sum-of-squares hierarchy) to tackle more complicated settings. Notably, either approach can be used to give all of our concrete learning applications with nearly identical error guarantees. In what follows, we specialize to the problem of robustly learning the mean μ of a Gaussian whose covariance is promised to be the identity, which we will use to illustrate how both approaches operate. We emphasize that learning the parameters in more general settings requires many additional ideas.
Our first algorithmic approach is an iterative greedy method that, in each iteration, filters out some of the corrupted samples. In particular, given a set of samples S’ that contains a large set S of uncorrupted samples, an iteration of our algorithm either returns the sample mean of S’ or finds a filter that allows us to efficiently compute a set S” ⊂ S’ that is much closer to S. Note the sample mean (even after we remove points that are obviously outliers) can be -far from the true mean in ℓ2-distance. The filter approach shows that either the sample mean is already a good estimate for μ or else there is an elementary spectral test that rejects some of the corrupted points and almost none of the uncorrupted ones. The crucial observation is that if a small number of corrupted points are responsible for a large change in the sample mean, it must be the case that many of the corrupted points are very far from the mean in some particular direction.
Our second algorithmic approach relies on convex programming. Here, instead of rejecting corrupted samples, we compute appropriate weights wi for the samples Xi, so that the weighted empirical average is close to μ. We require the weights to be in the convex set Cτ, whose defining constraints are:
We prove that any set of weights in Cτ yields a good estimate . The catch is that the set Cτ is defined based on μ, which is unknown. Nevertheless, it turns out that we can use the same types of spectral arguments that underlie the filtering approach to design an approximate separation oracle for Cτ. Combined with standard results in convex optimization, this yields our second algorithm for robustly estimating μ.
The third and final ingredient is some new concentration bounds. In both of the approaches above, at best we hope that we can remove all of the corrupted points and be left with only the uncorrupted ones, and then use standard estimators (e.g., the empirical average) on them. However, an adversary could have removed an ε-fraction of the samples in a way that biases the empirical average of the remaining uncorrupted samples. What we need are concentration bounds that show for sufficiently large N, for samples X1, X2, …, XN from a Gaussian with mean μ and identity covariance, that every (1 – ε)N set of samples produces a good estimate for μ. In some cases, we can derive such concentration bounds by appealing to known concentration inequalities and taking a union bound. However, in other cases (e.g., concentration bounds for degree two polynomials of Gaussian random variables), the existing concentration bounds are not strong enough, and we need other arguments to prove what we need.
Finally, we briefly discuss how to adapt our techniques to robustly learn the covariance. Suppose the mean is zero and consider the following convex set Cτ, where Σ is the unknown covariance matrix:
Again, the constraints defining the convex set are based on the parameters of the distribution (this time, they use knowledge of Σ). We design an approximate separation oracle for this unknown convex set by analyzing the spectral properties of the fourth moment tensor of a Gaussian. It turns out that our algorithms for robustly learning the mean when the covariance is the identity and robustly learning the covariance when the mean is zero can be combined to solve the general problem.
We give the first efficient algorithms for agnostically learning several important distribution classes with dimension-independent error guarantees. Our main result is an algorithm for robustly learning a high-dimensional Gaussian with an almost optimal error guarantee. Throughout this paper, we write when referring to our sample complexity, to signify that our algorithm works if N ≥ C f(d, ε, δ)polylog(f(d, ε, δ)) for a large enough universal constant C.
THEOREM 1.2. Let μ, Σ be arbitrary and unknown, and let ε > 0 be given. There is a polynomial time algorithm that given an ε-corrupted set of N samples from N(μ, Σ) with produces and so that with probability 0.99, we have
In later work,5 we improved the sample complexity to
which is optimal up to constant factors even when there are no corruptions. Moreover, it was observed in Diakonikolas et al.6 that the error guarantee of these algorithms can be improved to O(ε log(1/ε)), which is the best possible for statistical query algorithms.9
Beyond robustly learning a high-dimensional Gaussian, we give the first efficient robust learning algorithms with dimension-independent error guarantees for various other statistical tasks, such as robust estimation of a binary product distribution, robust density estimation for mixtures of any constant number of spherical Gaussians, and mixtures of two binary product distributions (under some natural balanced-ness condition). We emphasize that obtaining these results requires additional conceptual and technical ingredients. We defer a description of these results to the full version of our paper.
In concurrent and independent work, Lai et al.20 also study high-dimensional robust estimation. Their results hold more generally for distributions with bounded moments, but our guarantees are stronger (and optimal up to polylogarithmic factors) for the fundamental problem of robustly learning a Gaussian.
After both our and their work, there has been a flurry of activity in the area, such as algorithms for robust list learning when the fraction of corruptions is greater than a half,3 algorithms for sparse mean estimation whose sample complexity is sublinear in the dimension,2 lower bounds against statistical query algorithms,9 and extensions to other generative models with weaker moment conditions14, 19 and various supervised learning problems.22, 7 An overview of recent developments in the area can be found in Diakonikolas and Kane.8 We also note that spectral techniques for robust learning, which are relatives of our algorithms, appeared in earlier work.18, 1 These works employed a “hard” filtering step (for a supervised learning problem), which only removes outliers and consequently leads to errors that scale logarithmically with the dimension.
2. Preliminaries
Formally, we will work in the following corruption model: DEFINITION 2.1. For a given ε > 0 and an unknown distribution P, we say that S is an ε–corrupted set of samples from P of size N if S = G ∪ E \ Sr, where G is a set of N independent samples from P, Sr ⊂ G, and E and Sr satisfy |E| = |Sr| ≤ εN.
In other words, a set of samples is ε-corrupted if an ε-fraction of the samples has been arbitrarily changed, which we can think of as a two-step process: first the adversary removes the samples in Sr and then adds in its own arbitrarily chosen data points E. Note that the ε-corruption model is a strong model of corruption, and it gives it more power than other classical notions of corruption, such as Huber’s contamination model. We can visualize how the adversary can change the probability density function of P as follows:
Here, the blue curve is the original density function, and the green curve is the new density function, which is merely approximately close to P. The regions where the blue curve lies above the green curve are places where the adversary has deleted samples, and the regions where the green curve is above the blue curve are places where it has injected samples. In fact, the true process is even more complicated because if an adversary first inspects the samples and then decides what to corrupt, even the samples it has not corrupted are no longer necessarily independent.
It turns out that this model has very close connections to a natural measure of distance between distributions, namely, total variation distance:
DEFINITION 2.2. Given two distributions P, Q over Rd with probability density functions p, q, respectively, the total variation distance between P and Q is given by
The reason for this connection is as follows:
FACT 2.3. Let P, Q be two distributions with dTV(P, Q) = ε. Let S be N i.i.d. samples from Q. Then, with probability at least 1 − exp(−Ω(εn)), S can be viewed as a set of (1 + o(1))ε–corrupted samples from P.
This means that, up to subconstant factors in the fraction of corrupted points, learning from corrupted data is at least as hard as learning a distribution P from samples, if all we get are samples from some other distribution, which is ε-close to it in total variation distance. This fact immediately implies that if we are given ε-corrupted samples from P, the best we can generally hope for is to recover some so that . As we shall see, it is often possible to match this lower bound (up to logarithmic factors).
2.2. Connections to parameter distance
In this paper, our focus is on robust Gaussian estimation, when P = N(μ, Σ) is a Gaussian distribution. That is, given a set of N ε-corrupted samples from some unknown Gaussian distribution P, the goal is to output a Gaussian distribution such that is small. As it turns out, learning a Gaussian in total variation distance is closely tied to learning the parameters of the distribution, in the natural affine-invariant measure. This is captured by the following two lemmata. Throughout this paper, we will say that if f(x) ≤ Cg(x) for all X and some universal constant C. We also let ||A||F denote the Frobenius norm of a matrix A.
LEMMA 2.4. For any μ, μ‘ ∈ Rd, we have
LEMMA 2.5. For any full-rank positive semidefinite matrices Σ, Σ’, we have
where ∥A∥Σ≜∥Σ−1/2 AΣ−1/2∥F.
The first lemma states that if the covariances are both the identity, then the total variation distance between the two Gaussians is essentially the ℓ2-distance between the means of the Gaussians, except when the means are far apart. Note that total variation distance is always at most 1, and so when the means are very far apart, we only get a constant lower bound.
The second lemma is similar: It says that if both means are zero, then the total variation distance is captured by the Frobenius norm distance between the covariances, but “pre-conditioned” by one of the covariances. This is simply a high-dimensional analog of the fact that in one dimension, if we wish to get a meaningful approximation to the variance of a Gaussian, we need to learn it to multiplicative error.
3. Robust Estimation
3.1. Univariate robust estimation
For the sake of exposition, we begin with robust univariate Gaussian estimation. A first observation is that the empirical mean is not robust: even changing a single sample can move our estimate by an arbitrarily large amount. To see this, let be the empirical mean of the dataset before corruptions, and let be the empirical mean after increasing the value of the sample X1 by some amount t. Although standard concentration arguments imply that is small, we have that , which we can make arbitrarily large with our choice of t. Fortunately, we describe a simple approach based on order statistics, which will allow us to estimate both the mean and the variance, even when a constant fraction of our dataset has been corrupted.
The most well-known robust estimator for the mean of a Gaussian is the median. More precisely, we let
Similarly, as an estimate for the standard deviation, we can consider a rescaling of the median absolute deviation (MAD), letting
where Φ−1 is the inverse of the Gaussian cumulative distribution function. The rescaling is required to make the MAD a consistent estimator for the standard deviation. The median and MAD allow us to robustly estimate the underlying Gaussian:
THEOREM 3.1. Given a set of ε–corrupted samples from N(μ, σ2), with probability at least 1 – δ, we have
for a universal constant C.
This estimator is the best of all possible worlds. It is provably robust. It can be computed efficiently. In fact, it also achieves the information-theoretically optimal sample complexity. The median and MAD are examples of robust estimators based on order statistics. There are other provably robust estimators based on winsorizing.13
3.2. Natural multivariate approaches that fail
There are many natural approaches for generalizing what we have learned in the one-dimensional case to the high-dimensional case. But as we will see, there is a tension between being provably robust and computationally efficient. First, consider a coordinate-by-coordinate approach where we robustly estimate the mean along each coordinate direction, and concatenate the d univariate estimates into an estimate for the d-dimensional mean vector. Although this achieves error Θ(ε) in each direction’s subproblem, combining the estimates results in an ℓ2 error of . In high-dimensional settings, this gives vacuous bounds on the total variation distance except for vanishingly small values of ε.
Alternatively, one could attempt to extend the median-based estimator to multivariate settings. Although the same definition of the median does not apply in more than one dimension, there are many ways to generalize it. One such generalization is the Tukey median,25 proposed specifically for the problem of robust estimation. The Tukey median of a dataset is the point (not necessarily in the dataset) that maximizes the minimum number of points on one side of any half-space through the point. Although this achieves the desired O(ε) accuracy, it is unfortunately NP-hard to approximate on worst-case datasets.16 Another multivariate notion of median is the geometric median, the point that minimizes the sum of ℓ2-distances to points in the dataset. Although this is efficiently computable in polynomial time, it unfortunately also can be shown to incur error.20
All the approaches mentioned so far have one of the following drawbacks:
- The optimization problem is NP-hard, making it intractable in settings of even moderate dimensionality.
- A large dimension-dependent factor appears in the error, resulting in very weak accuracy guarantees in high-dimensional settings.
At least one of these issues persists in all previously-known approaches, and either one would preclude realizable multi-variate robust estimation. As few more examples, tournament-based hypothesis selection methods give accurate results, but are not computationally efficient. Alternatively, one could consider a pruning-based argument, which removes all points that are too far from the rest of the dataset. This is computationally efficient, but we again incur error of .
The primary contribution of our main result is a method that avoids both these issues simultaneously, providing an algorithm that is computationally efficient and does not lose a dimension-dependent factor in the accuracy.
To offer some insight into why things go wrong in multivariate settings, we delve a bit deeper into the pruning-based approach. For the time being, we restrict our attention to Gaussians with identity covariance. It is well-known that, given a dataset generated according to a d-dimensional spherical Gaussian distribution, all the data points will be tightly concentrated at a distance from the mean. Thus, we can think about the distribution as being concentrated on a thin spherical shell, as depicted here:
A smart adversary can place all his corruptions within the shell too, in such a way that they move the empirical mean by as much as in ℓ2-distance. This demonstrates an intrinsic limitation of any algorithm, which only looks locally for corruptions—as a result, any effective algorithm must remove points based on global properties of the dataset.
This is captured in the following key geometric lemma:
Lemma 3.2. Let ε ∈ (0, 1/2). Let S be an ε-corrupted set of points from N(μ, I) of size at least Ω(d/ε2). Let denote the empirical mean and covariance of S, that is,
Then, with probability at least 0.99, we have:
This lemma is a slight rephrasing of Lemma 4.15 in Diakonikolas et al.4 At a high level, Lemma 3.2 states that if the true mean and the empirical (and potentially corrupted) mean are far apart, then the empirical variance must be noticeably different along some direction. Thus, the spectral norm of can be used to certify that our estimate is close to the true mean in the sense that
On the other hand, when the empirical mean has been corrupted, Lemma 3.2 gives us a way to algorithmically make progress. It isolates a specific direction—namely, the top eigenvector of – I—in which the corrupted points must contribute a lot. Both of the algorithms we describe use information gleaned about the corruptions from the empirical moments in somewhat different ways.
Filtering approach. The filtering approach works by removing points from the dataset, using the above intuition. It proceeds as follows:
- Compute the top eigenvalue λ and eigenvector v of .
- If λ is sufficiently small, terminate and output .
- Otherwise, compute , and for an adaptively chosen threshold T, remove all Xi so that τi > T, and repeat.
If this is done carefully, then Lemma 3.2 guarantees that we always throw out many bad points compared to the number of good points we throw out. See Diakonikolas et al.4 for a detailed description of how the threshold is chosen. To make this formal, for any two sets A, B, define Γ(A, B)≜|AΔB|/|A|, which measures the relative size of the symmetric difference compared to the size of A. Then, we have the following guarantee for the filter:
Lemma 3.3 (INFORMAL). Let S = G ∪ E\Sr be an ε-corrupted set of points from N(μ, I) of size at least . Then, with probability at least 0.99 and after a simple preprocessing step, the filter satisfies the following property: Given any S’ ⊆ S satisfying Γ(G, S‘) ≤ 2ε, the filter either
- outputs so that , or
- outputs T so that Γ(G, T) ≤ Γ(G, S‘) − ε/α, where α = d log(d/ε) log(d log(d/ε)).
Note that Γ(G, S) ≤ 2ε initially. Now applying Lemma 3.3 we can guarantee that the procedure terminates after at most O(α) iterations. And when we terminate, again by Lemma 3.3, we are guaranteed to output , which is close to the true mean. As described, each application of the filter would require computing the top eigenvector of a d × d matrix, which would be prohibitively slow. However, it turns out that a rough approximation to the top eigenvector suffices for the correctness of the filter algorithm, and thus each iteration can be performed in nearly-linear time via an approximate power method.
Convex programming approach. The second approach uses the intuition behind Lemma 3.2 somewhat differently. Instead of trying to directly remove all of the consequential outliers, we seek to iteratively decrease their influence. Let S = G ∪ E\Sr be an ε-corrupted set of points. For each point Xi ∈ S, we associate a nonnegative weight wi. Ideally, we would want these weights to be uniform over S\E, and zero otherwise. A priori the only thing we know about S\E is that it has size at least (1 – ε)N. Consequently, a natural constraint to put on the weights w is that they must lie within the convex hull of the set of weights that are uniform on sets of size (1 – ε)N. This is the following set:
Given this, one can show that a slight extension of Lemma 3.2 implies that it suffices to find a set of weights w ∈ Wn,ε so that the empirical distribution over S with these weights is spectrally close to the identity after centering at μ. To make this more formal, for any w ∈ WN,ε, let
be the mean and covariance of the empirical distribution over S, with weights w. Define also
to be the empirical covariance, except we center at the true mean of the unknown distribution. Note that M(w) is a linear function of w, and so in particular, it is a convex function. Then, it suffices to solve the following convex problem:
where C > 0 is some universal constant. If w ∈ WN, ε satisfies (5), then one can show that μ(w) is close to μ with high probability, provided that N = Ω(d/ε2).
There is an obvious difficulty in solving (5). Namely, the description of (5) requires knowledge of μ, the parameter that we wish to estimate! Luckily, we can still construct a separation oracle for (5), which will suffice for computing a solution. In particular, given w ∈ WN,ε, we want an algorithm that
- if w satisfies (5), outputs YES and
- otherwise, outputs a hyperplane ℓ so that ℓ(w) > 0 but ℓ(w’) < 0 for all w’ satisfying (5).
First, note that if we knew μ, then constructing such an oracle would be straightforward. We simply compute the largest eigenvalue λ of M(w) – I in magnitude, and its associated eigenvector v. If |λ| > Cε log 1/ε, then output YES. If not, observe that the following is a separating hyperplane for w:
where σ is the sign of λ.
Now we need to remove the assumption that we know μ. The key insight is that Lemma 3.2 allows us to substitute the top eigenvalue of Σ(w) – I in magnitude and its associated eigenvector for λ and v, respectively, and μ(w) for μ in (6). At a high level, this is because if μ(w) is close to μ, then Σ(w) is very close to M(w). On the other hand, if μ(w) is far from μ, then Lemma 3.2 guarantees that the shift caused by centering at μ(w) rather than μ is overshadowed by the large eigenvalues of M(w) – I.
3.4. Robust covariance estimation
The same geometric intuition that underlies our algorithms for robustly learning the mean also forms the basis for our algorithms for robustly learning the covariance. This time, we momentarily restrict our attention to Gaussians with zero mean. In the case of robust mean estimation, Lemma 3.2 states that a shift in the first moment caused by a small fraction of outliers causes a noticeable deviation in the second moment. It turns out that the same principles work for robustly learning the covariance, we just need to use higher moments. In particular, if we want to detect when the empirical second moment has been compromised by a small fraction of outliers, there must be some evidence in the fourth moment. However, making this rigorous is technically involved.
At a high level, the main difficulty is that in the case of robust mean estimation, we know the structure of the second moment, even if we do not know the mean. Namely, we assume that the covariance is the identity. However, the structure of the fourth moment depends heavily on the unknown covariance, and as a result, it is nontrivial to formulate the proper analog of Lemma 3.2 for this setting.
Fortunately, the relationship between the second moment and the fourth moment of a Gaussian follows a predictable formula, as a special case of Isserlis’ theorem. For any vector ν ∈ Rd, let ν ⊗ ν ∈ Rd2 denote the tensor product of v with itself. Similarly, for any matrix M ∈ Rd×d, let M⊗2 ∈ Rd2 × d2 be its tensor product with itself. Finally, let M♭ ∈ Rd2 be the d2-dimensional vector that comes from flattening M into a vector. Then, the key identity is the following: for any covariance matrix Σ, we have
Consider the case where the unknown covariance Σ is well-conditioned, so that it suffices to learn Σ to small error in Frobenius norm. Let {X1, …, XN} be an ε-corrupted dataset and set Yi = Xi ⊗ Xi for all i ∈ [N]. If Yi is uncorrupted, then E[Yi] = Σ♭, so recovering Σ in Frobenius norm exactly corresponds to learning the mean of the uncorrupted Yi to small error in ℓ2 norm. Moreover, by (7), the covariance of the uncorrupted Yi is 2 Σ⊗2.
Thus learning the covariance reduces to a complicated variant of the mean estimation problem, where the covariance depends on the unknown mean, but in a structured way. The relationship between them is sufficiently nice so that if the empirical mean of the Yi is corrupted by outliers, then this still manifests as a large eigenvalue of the empirical covariance of the Yi. This allows us to formulate a more sophisticated analog of Lemma 3.2 for this setting (see Claim 4.29 in Diakonikolas et al.4). By then, leveraging this geometric structure, we can then devise generalizations of the filtering and the convex programming approaches to robustly learn the covariance.
3.5. Assembling the general algorithm
At this point, we have designed efficient algorithms to solve two important subcases of our general problem. Specifically, we can
- robustly estimate N(μ, I), up to error in total variation distance and
- robustly estimate N(0, Σ), up to error O(ε log(1/ε)) in total variation distance.
In fact, we can combine these primitives into an algorithm that works in the general case when both μ and Σ are unknown. The first observation is that we can use the doubling trick (even in the presence of noise) to zero out the mean. In particular, given two independent samples X1 and X2 from a distribution that is ε-close to a Gaussian N(μ, Σ), their difference X1 – X2 will be 2ε-close in distribution to N(0, 2Σ).
The second observation is that, given an estimate of the covariance, we can approximately whiten our dataset. After applying the transformation to our data, we get noisy samples from
This almost fits into the setting where just the mean is unknown, because the resulting covariance matrix is close to (but not exactly equal to) the identity matrix. Fortunately, we can exploit the robustness of our algorithms to handle this error, because the data is generated from a distribution that is O(ε log(1/ε))-close to a Gaussian with identity covariance. Putting the pieces together, we get an error guarantee of O(ε log3/2(1/ε)). The overall algorithm is described in Algorithm 1. We will use X and Y; to denote a set of samples that are fed into various subroutines.
Algorithm 1 Algorithm for robustly learning a Gaussian
- function RECOVERROBUSTGAUSSIAN(ε, X1, …, X2N)
- Let ← LEARNCOVARIANCE(4ε, χ) for
- Let ← LEARNMEAN(O(ε log(1/ε)), Y) for
- return
4. Experiments
Our algorithms (or rather, natural variants of them) not only have provable guarantees in terms of their efficiency and robustness but also turn out to be highly practical. In Diakonikolas et al.,5 we studied their performance on both synthetic and real-world data, and we discuss the results in this section.
In Figure 1, we demonstrate our results on synthetic data, for estimating the mean and covariance of a Gaussian. We compare our Filtering method with the algorithms of Lai et al.,20 the empirical plug-in estimator, the empirical estimator in combination with pruning, random sample consensus (RANSAC),11 and the geometric median (for mean estimation). The first row of plots displays mean estimation for an isotropic Gaussian (with ε = 0.1), the second row displays covariance estimation for an isotropic Gaussian, and the third row displays covariance estimation for a Gaussian with a highly-skewed covariance matrix (both with ε = 0.05). The first column of plots compares all the methods, whereas the second column of plots omits the less accurate methods, to allow a more fine-grained comparison between our algorithm and competitive methods. The x-axis of each plot indicates the dimension of the problem, and the y-axis indicates the error incurred by the estimation method, where the baseline of 0 is the error of the plug-in estimator on the uncorrupted data. In Figure 1, for the mean estimation plots, this error is measured via ℓ2-distance, whereas for covariance estimation, this is measured in terms of Mahalanobis distance.
Figure 1. Robust parameter estimation on synthetic data. Our method (filtering) is shown to outperform all alternatives for both mean estimation (first row) and covariance estimation (last two rows).
In all experiments, we found that our algorithm outperformed all other methods, often by substantial margins. As predicted by the theory, our error appears to remain constant as the dimension increases, but increases for all other methods (albeit minimally for LRV methods, which depend only logarithmically on the dimension). For mean estimation, our method performs better than LRV, which in turn performs much better than all alternatives. Similar trends are observed for covariance estimation, though the results are especially pronounced when estimating a skewed covariance, in which our methods outperform all others by orders of magnitude.
In our semisynthetic experiments, displayed in Figure 2, we revisit a classic study of Novembre et al.21 In this study, the authors obtained a high-dimensional genomic dataset from the POPRES project. They annotated each data point with the individual’s country of origin and projected the dataset onto the top two principal components of the dataset. As displayed in the top-left plot in Figure 2, they found that the resulting projection closely resembles the map of Europe, thus leading to the adage that “genes mirror geography.” However, omitted from the description above is a crucial manual data curation process, in which immigrants were removed from the dataset, as they were considered to be genetic outliers. Our methods provide an automatic and principled way of removing outliers.
Figure 2. Robust exploratory data analysis on semisynthetic data. The top-left figure shows the projection of a high-dimensional genomic dataset onto its top two principal components, which resembles the map of Europe (center). In the presence of synthetic outliers, this structure is lost (top-right). Our methods for robust covariance estimation allow us to preserve this structure (bottom).
In our experiments, we worked with a projection of the original dataset onto the top 20 principal components. We injected synthetic noise points (ε = 0.1) into the dataset and repeated the experimental procedure described above. Even with a pruning step, we found the empirical estimator was not able to preserve the structure of Europe (top-right of Figure 1). However, our method (based on our robust Gaussian covariance estimation algorithm) was able to relatively faithfully recreate the original map of Europe (bottom-left and bottom-right of Figure 1). Despite our filter being designed for Gaussian data, the method worked on the genomic data (which is not necessarily Gaussian) with minimal alterations.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment