# Communications of the ACM

Research highlights

# Disentangling Gaussians

View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook

The Gaussian mixture model (GMM) is one of the oldest and most widely used statistical models. It is comprised of a weighted combination of heterogeneous Gaussian sources. As a simple one-dimensional example, consider measurements of heights of adults in a certain population, where the distribution of heights can be closely approximated as a mixture of two univariate Gaussians, one for males and one for females.3 Can one recover the parameters of the Gaussians from unlabeled height measurements alone (with no gender information)?

Our work focuses on the case where the mixture consists of a small but unknown number of Gaussian "components" that may overlapthe combined density may even have a single peak, as in the height example, and the dimensionality may be high. Much of the previous work on this problem attempts to learn the parameters through clustering, and consequently needs to make a strong separation assumption on the components in the mixture. The primary contribution of our research is to avoid this assumption by instead basing our learning algorithm upon the algebraic structure of the mixture. Our algorithm succeeds even if the components overlap almost entirely, in which case accurate clustering is no longer possible.

We give a simple notion of "condition number" of a GMM which characterizes its complexity up to polynomial factors. Generally speaking, the conclusion is that the sample complexity and computational complexity of this general problem are in every way polynomial, except for the dependence on the number of Gaussians, which is necessarily exponential.

Statisticians have long known that from random samples from a GMM it is possible to identify the Gaussians in the limitone can eventually recover to arbitrary precision each Gaussian component's mean, variance, and proportion, given sufficiently many examples.14 However, their analysis provides no bounds on convergence rateas the number of samples increases, the convergence might be logarithmically slow even for two Gaussians in one dimension. Moreover, even if the problem has reasonable sample complexity, it is not clear, a priori, how to yield a computationally efficient algorithm. In practice, the heuristics in widespread use, such as the EM algorithm, suffer from local optima and have weak theoretical guarantees.

In seminal work, Dasgupta5 put forth a polynomial-time clustering approach for learning GMMs that is provably accurate under certain assumptions. In particular, if the Gaussians are sufficiently "well separated" then one can usually group together all the samples originating from the same Gaussian. As a by-product, parameter estimation follows easily by estimating the mean, covariance matrix, and proportion of samples from each cluster separately. In rapid succession, increasingly powerful clustering techniques that require "separability assumptions" between all pairs of Gaussians have since been analyzed.1, 2, 4, 6, 11, 16

In some sense, the parameter estimation problem is more fundamental than clustering, because accurate parameter estimates can be easily used for accurate clustering. And in the general case where the Gaussians may overlap, it is impossible to cluster the points with any degree of confidence, just as one cannot accurately predict gender from the fact that a person is, say, 1.7 m tall. Nonetheless, from height data alone one may still recover the means, variances, and mixing weights of the two constituent Gaussians. Hence, while one cannot hope to cluster the individual samples, one can decompose the Gaussian mixture and efficiently disentangle the Gaussian components.

1.1. Our approach

We describe a polynomial-time GMM learning algorithmwe emphasize that throughout, we favor clarity of presentation and ease of analysis at the cost of impracticality. The algorithm is based on the random projection method (see, e.g., Vempala15). Since the projection of a multinormal distribution onto one dimension is a normal distribution, the projection of a GMM is a GMM in one dimension. Roughly, the algorithm proceeds by projecting the data onto a sequence of vectors, solving the associated sequence of one-dimensional learning problems, and then reconstructing the solution to the original high-dimensional GMM problem.

There are several obstacles that must be surmounted to consummate this approach. First and foremost, is the question of solving the problem in one dimension. One-dimensional problems are often easy if one is willing to resort to brute-force search; for the problem of learning GMMs, it has already been mentioned that the bounds are necessarily exponential in the number of Gaussian components, hence one might expect that a crude brute-force search would suffice to solve the one-dimensional case. We show that this is true. The difficulty of the one-dimensional case is not in designing the algorithm, but in guaranteeing that this crude brute-force search can be conducted over a sufficiently coarse grid. We show this by establishing, what we term, the polynomially robust identifiability of GMMs, which we discuss in Section 4.

Supposing one has an efficient algorithm for the one-dimensional problem, a second obstacle in our highlevel approach is ensuring that the projected data that are given as inputs to the one-dimensional algorithm are meaningful. Consider, for example, a GMM that consists of two Gaussian components that have identical covariances, but different means. If, unluckily, we project the data onto a vector orthogonal to the difference in the means, then the resulting one-dimensional mixture will have just a single component. Further complicating this concern is the existence of GMMs, such as that depicted in Figure 3, for which two or more essentially non-overlapping components will, with very high probability, project to nearly identical Gaussians in a random projection. How can we hope to disentangle these components if, in nearly every projection, they are indistinguishable?

We demonstrate that this problem can only arise if a certain spectral condition holds. When this condition holds, however, we show that a clustering-based approach will be successful. Specifically, when the spectral condition is met, we will be able to partition the data samples into two sets, such that the Gaussian components from which the samples in the first set were drawn are (nearly) disjoint from the set of components from which the second set of samples were drawn. Thus this partition of the samples corresponds to a partition of the GMM into two sub-mixtures; we can then apply our algorithm recursively to each of these two sets of samples.

For clarity of exposition, in Section 3, we first describe a simplified version of our algorithm that does not require the clustering and recursion steps, though has slightly weaker performance guarantees. In Section 5, we describe the full algorithm.

### 2. Model and Results

The input to our algorithm is a sample set of n points in d dimensions, drawn independently from GMM , with mixing weights wi > 0 satisfying iwi = 1, and with k different Gaussian components Fi, where each Fi = N(i, i) is a distinct d-dimensional Gaussian with mean i Rd and covariance matrix i Rd×d. The output is an estimate GMM . The goal of the estimation algorithm is to correctly learn k and furthermore to approximate the parameter set {(w1, 1, 1), ..., (wk, k, k)}. Note that one cannot hope to learn the correct ordering of the components Fi, since any permutation results in an identical distribution.

To measure the distance between Gaussians N(, ) and N(', '), we employ the statistical distance. This natural metric is defined, more generally, for probability distributions. For distributions G and G' with respective probability density functions g and g', the statistical distance is defined as

This quantity is zero if the distributions are identical and one if the distributions have disjoint support. The main strength of statistical distance as a metric is that it measures the information theoretic similarity of two distributions; for instance, if two distributions A and B have Dtv(A, B) = 0.01, then no algorithm can distinguish a set of 100 samples from A from a set of 100 samples from B with probability more than 2/3. A second advantage of statistical distance is that it is scale invariant and affine invariant, meaning that an identical rescaling of two distributions does not affect the distance. In contrast, Euclidean error estimates such as '2 and change as one rescales the problem.a

We now define a quantity that controls "how hard" it is to learn a particular GMM.

DEFINITION 1. The condition number (F) of GMM is defined to be

Any estimation algorithm requires, at a minimum, a number of samples proportional to (F) to have any chance of accurate estimation. The reason is that one requires 1/wi samples to have a constant probability of encountering a single sample generated by Fi. Hence, for very small wi, a large sample size is necessary. Similarly, even if one exactly knows two probability distributions F and F', one requires 1/Dtv(F, F') samples to have a constant probability of distinguishing between the case that all samples arise from F or all samples arise from F'. Thus, at least a linear dependence on (F) is required, while our results have a polynomial dependence on (F).

We are now ready to state our main theorem.

THEOREM 1. For every k 1, there is a constant, ck, dependent on k, such that the following holds. For any d-dimensional GMM , and , the estimation algorithm when run on n samples independently drawn from F outputs GMM such that, with probability and there exists a permutation on {1, 2, ..., k} such that

The runtime of the estimation algorithm is polynomial in the number of samples, n.b

2.1. Interpretation

Our algorithm takes as input samples from a d-dimensional GMM F, and outputs estimates of the parameters of F. The main theorem shows that as the number of samples, n, increases the error of our estimates decreases as O(d/n)), where is a constant that only depends on the number of components, k. This rate of convergence is striking both because it exhibits a polynomial (as opposed to exponential) dependence on the number of dimensions, d, and because it exhibits an inverse polynomial dependence on the number of samples. This is in contrast to the inverse logarithmic rate of convergence that has been hypothesized.9 Lastly, the big-Oh notation hides a leading constant that has a polynomial dependence on the condition number of the mixture, (F).

In applications in which a learner has reasonable upper bounds on k and (F), we could determine upper bounds on how much data is required to learn a good estimateand run our algorithm accordingly. In contrast, if an upper bound on either k or (F) is missing, every estimator can be fooled into either outputting a mixture with the wrong number of components, or outputting a mixture which is far from the true GMM.

While the runtime and data requirement of our algorithm scale super-exponentially as a function of the number of Gaussian components, we show that, unfortunately, an exponential dependence on k is necessary at least in the case where the Gaussians overlap significantly. We give a simple construction of a mixture of k (polynomially) overlapping Gaussians that are exponentially close to a single Gaussian. This state of affairs is in contrast to the aforementioned clustering algorithms which depend only polynomially on k. Hence, when there are a large number of very well-separated Gaussians, clustering seems to be a better approach.

2.2. History and related work

Perhaps the earliest GMM study was conducted by Karl Pearson in the 1890s.12 He was given a dataset consisting of measurements of 1000 crabs found near Naples, and conjectured that the dataset arose as a GMM of two components, corresponding to two crab species. Pearson then attempted to recover estimates of the parameters of the two hypothesized species, using the method of moments. In particular, he computed empirical estimates of the first six (raw) moments , for i = 1, 2, ..., 6 from sample points x1, ..., xn R. Using only the first five moments, he solved a cleverly constructed ninth-degree polynomial, by hand, from which he derived a set of candidate mixture parameters. Finally, he heuristically chose the candidate among them whose sixth moment most closely agreed with the empirical estimate.

The potential problem with this approach, which Pearson acknowledged, was the issue of robust identifiability. Perhaps there exist two different mixtures, where the components of one mixture are very different from the components of the other mixture, but nevertheless the densities and the moments of the two mixtures are extremely similar. We show that this cannot be the case, in some sense validating Pearson's approach.

Recently, propitiated by the emergence of huge, high-dimensional datasets, the question of learning GMMs was revisited by the theoretical computer science community. In this body of work, initiated by Dasgupta,5 a line of computer scientists designed polynomial time algorithms for identifying and clustering in high dimensions.1, 2, 4, 6, 11, 16 The problem of clustering is that of partitioning the points into sets, with the hope that the points in each set are drawn from different Gaussians. This task generally requires the Gaussians to have little overlap (statistical distance near 1); in many such cases they were able to find computationally efficient algorithms for GMMs of more than two Gaussians.

Recall that our goal is to learn the mixture F = iwiFi on a component by component basis. An easier problem which has also received some attention is that of learning the density function of the entire mixture F without trying to figure out the parameters of the individual components. Recently, a polynomial-time density estimation algorithm was given for axis-aligned GMMs, without any nonoverlap assumption.8,c

Finally, we note that there is a vast literature on heuristics for the problem of learning GMMs, such as the EM algorithm.7 Our focus in this work is on algorithms with provable guarantees. Even though heuristics such as the EM algorithm often work well in practice, these approaches are not guaranteed to converge to the true parameters. Even worse, the EM algorithm (even for univariate mixtures of just two Gaussians) has been observed to converge very slowly (see Redner and Walker for a thorough treatment13) if the algorithm is initialized from a bad starting point.

### 3. A Simple Algorithm

We start by describing a simple algorithm that illustrates our approach to learning GMMs. While the performance guarantees of this algorithm are slightly weaker than those of the full algorithm, described in Section 5, the mechanics of the reduction of the high-dimensional problem into a series of one-dimensional learning problems is made clear.

One would expect that the problem of (approximately) determining the parameters of a univariate mixture of Gaussians should be algorithmically easy because we can resort to a brute-force search. However, the surprising difficulty is in proving that a brute-force search over a coarse grid of the parameters does not return wrong estimates. For example, what if there are two distinct (univariate) mixtures of Gaussians that do not have close parameters, but are nevertheless extremely close as distributions in statistical distance? The central challenge in proving that a brute-force search does work is in proving that this hypothetical example cannot arisethat two mixtures of (univariate) Gaussians that do not have similar parameters must be noticeably different.

To prove this we appeal to the method of moments, first introduced by Pearson. We prove that mixtures of k Gaussians are "polynomially robustly identifiable"that is if two mixtures of at most k Gaussian components have sufficiently different parameters, then one of the low-order moments will be noticeably different. In fact, one of the first 4k 2 moments will be noticeably different, and there are distinct mixtures of k Gaussians that can match exactly on the first linearly (in k) many moments.

Our one-dimensional learning algorithm will draw polynomially many samples from a GMM with k components, and compute the first 4k 2 moments of this set of samples. We refer to these as the empirical moments. Then for every candidate set of parameters {(1, 21, w1), ..., (k, 2k, wk)} in our grid, we analytically compute the first 4k 2 moments of the corresponding distribution and compare these values to the empirical moments. If each of the first 4k - 2 analytically computed moments is close enough to the corresponding empirical moment, then we declare success and output that candidate set of parameters.

Through elementary concentration bounds we can demonstrate that, whatever the true parameters are, the closest set of parameters in the grid will, with high probability, pass the test. Thus the algorithm will almost always output a set of parameters. The correctness of the algorithm then rests on demonstrating that the algorithm will never output a set of parameters that is too far from the true set of parameters. Section 4showing the polynomially robust identifiability of GMMsis dedicated to establishing the correctness of this algorithm. Notice that in the case of mixtures of two Gaussians, this brute-force search considers only the first six moments and hence we are able to obtain provable guarantees for a variant of Pearson's Sixth Moment Test.

3.2. A simple high-dimensional algorithm

Given samples from a high-dimensional GMM, our strategy is to reduce the problem to a series of one-dimensional learning problems. The goal is to obtain, for each component in the mixture, an estimate of this component's mean and co-variance matrix when projected onto many different directions. For each component, we can then use these estimates to set up a system of linear constraints on the high-dimensional mean and co-variance matrix of the component. This system can then be solved, yielding good estimates for these high-dimensional parameters.

We choose a vector v uniformly at random from the unit sphere and d2 perturbations v1,1 ..., vd,d of v. For each direction vi,j, we project the mixture onto direction vi,j and run our one-dimensional learning algorithm. Hence we obtain set of d2 parameters of one-dimensional mixtures of k Gaussians. We must now label the components of these d2 mixtures consistently, such that for each i = 1, ..., k, the ith Gaussian component in one of the d2 one-dimensional mixtures corresponds to the projection of the same high-dimensional component as the ith component of all the other one-dimensional mixtures.

In the general setting to which the full-algorithm of Section 5 applies, we will not always be able to do this consistent labeling. For the purposes of our simple high-dimensional learning algorithm, we will assume a certain technical condition that will make consistent labeling easy. Specifically, we assume that all components in the original mixture have noticeably different parameters. We prove that if each pair of componentssay, N(, ) and N(', ')has either '2 > , or 'Fr > , then with high probability over a randomly chosen direction v, the projected means or projected variances will differ by at least poly . Intuitively, this makes consistent labeling easy because when we project our data onto v, each component is noticeably different. Since each vi,j is a small perturbation of v, the projection of any high-dimensional component onto v and vi,j will be similar, allowing for easy consistent labeling.

For each component in the original mixture, after labeling, we have an estimate of the component's mean and variance when projected onto each direction vi,j. The one-dimensional parameters of the projection of a Gaussian are related to the high-dimensional parameters by a system of linear equations. If our one-dimensional estimates were not estimates, but were exact then we could back-solve this system to obtain the exact high-dimensional parameters. We bound the condition number of this system of equations, ensuring that if our one-dimensional estimates are close, then we obtain good estimates for the high-dimensional parameters.

3.3. Performance of the simple algorithm

The following proposition characterizes the performance of the Simple High-Dimensional Algorithm described above.

PROPOSITION 1. There exists a constant, ck dependent on k, such that given n independent samples from a GMM of k' k components in d dimensions, with probability at least 1 the simple high-dimensional algorithm, when run on inputs k, , and the n samples, will return a GMM such that there exists a labeling of the components satisfying, for all i, , and provided the following conditions hold:

• for all i, j k',

The runtime of the estimation algorithm is polynomial in the number of samples.

The key distinction between the performance of this algorithm and the performance of the more general algorithm that establishes Theorem 1 is in terms of the distance metric. Here, the input mixture is required to have components whose parameters differ from each other by at least . Our more general algorithm performs on samples from GMMs whose components can have arbitrarily similar parameters, provided that the statistical distance between components is at least . Additionally, the mixture returned by this simple algorithm is only guaranteed to have parameters that are very close in Euclidean distance to the true parametersthe mixture returned by the more general algorithm is guaranteed to be very close in both parameter distance and statistical distance.

### 4. Polynomially Robust Identifiability

It is well known that two GMMs whose components differ cannot have identical probability densities (see Teicher,14 for example)this is commonly referred to as "identifiability." Thus, given access to arbitrarily many samples, one can recover the parameters of a GMM to any desired accuracy. For the purposes of learning with limited data, however, we require a significantly more robust form of identifiability.

Consider two GMMs F, F' whose parameter sets, up to relabeling, differ by that is, no matter how one matches the components of the first GMM with the components of the second, there is some component whose mean, variance, or mixing weight differs by at least from the corresponding component in the other mixture. Given this discrepancy in parameter sets, what is the statistical distance D(F, F')? If the answer is inverse exponential in 1/, then our goal of polynomially efficient learning is hopelessall algorithms for learning GMMs would necessarily require an amount of data exponential in the desired accuracy. Stated differently, the convergence rate of an optimal estimator for the parameters of a GMM would be, at best, inverse-logarithmic in the number of samples.

Fortunately, such a dependence is not necessarywe show that GMMs are "polynomially robustly identifiable": if mixtures F, F' have parameter sets that differ by , and mixing weights at least , then their statistical distance is bounded below by poly(), for any fixed number of components. In fact, we prove this by showing that there must be a poly() discrepancy in the first few moments of F and F'. For mixtures of at most k components, considering the first 4k 2 moments is sufficient.

THEOREM 2. There exists a constant, ck, dependent on k, such that given a pair of one-dimensional GMMs, F, F', consisting of at most k components, with (F), (F') < 1/, if, for all i 4k 2,

then F, F' have the same number of components. Furthermore, there exists a correspondence between the components of F and F' such that the discrepancy in mean, variance, and mixing weight between corresponding components is at most .

The above theorem guarantees that if two GMMs have very similar low-order moments, the parameter sets must also be very similar, thereby guaranteeing the correctness of the simple brute-force search algorithm for learning one-dimensional GMMs presented in Section 3.

4.1. Deconvolution and moments

Our proof of the polynomially robust identifiability of GMMs relies on considering what happens if one "deconvolves" the probability density functions of a GMM by a Gaussian of carefully chosen variance. The convolution of two Gaussians is a Gaussian, just as the sum of two normal random variables is normal. Hence, we can also consider the "deconvolution" of the mixture by a Gaussian of variance, say, this is a simple operation which subtracts from the variance of each Gaussian in the mixture: Given a GMM with parameter set {(1, 21, w1), ..., (k, 2k, wk)}, we define the -deconvolved mixture to be the GMM with parameter set {(1, 21 , w1), ..., (k, 2k , wk)}, provided that < min 2i.

The intuition behind considering this transformed mixture is that by decreasing the variance of each Gaussian component, we are effectively disentangling the mixture by making the components overlap less. To illustrate this intuition, suppose we deconvolve by that is close to the minimum variance of any component. Unless the smallest variance Gaussian is closely matched (in both mean, variance and mixing weight) by a Gaussian in the other mixture, then the two mixtures will have large statistical distance after deconvolution. If, on the other hand, the smallest variance Gaussian is closely matched, then this pair of components can be stripped away from the respective GMMs as this pair contributes negligibly to the discrepancy between the two mixtures. We can then proceed by induction.

Unfortunately, the deconvolution transformation does not preserve the statistical distance between two distributions. However, we show that it, at least roughly, preserves the disparity in low-order moments of the distributions. Specifically, letting F(F) denote the result of -deconvolving mixture F, we show that if there is an i 4k 2 such that the ith moment of F(F) is at least poly() different than the ith moment of F(F') then there is a j 4k 2 such that the jth moment of F is at least poly() different than the jth moment of F'. To simplify notation, let Mi[F] = ExF[xi].

LEMMA 3. Suppose that each constituent Gaussian in F or F' has variances in the interval [, 1]. Then

The key observation here is that the moments of F and F(F) are related by a simple linear transformation, which can also be viewed as a recurrence relation for Hermite polynomials.

4.2. Discrepancy in moments

The deconvolution operation gives us a method to produce a large statistical distance between two mixtures with sufficiently different parameters. Additionally, deconvolution approximately preserves discrepancy in low-order moments. So all that remains is to demonstrate that two mixtures (after an appropriate deconvolution) not only have large statistical distance, but also have a non-negligible discrepancy in moments.

To accomplish this, we show that there are at most 4k 2 zero-crossings of the difference in densities, f = F(F) F(F') and then we construct a degree 4k 2 polynomial p(x) that always has the same sign as f(x), so that when p(x) is integrated against f(x) the result is at least poly(). We construct this polynomial so that the coefficients are bounded, and this implies that there is some moment i (at most the degree of the polynomial) for which the difference between the ith raw moment of F(F) and of F(F') is large.

The first step is to show that the point-wise difference between the density functions of any two mixtures of k Gaussians is either identically zero, or has at most 4k 2 zero crossings. This bound can be easily shown to be tight. Our proof of this fact relies on the following theorem, due to Hummel and Gidas10:

THEOREM 4 (THM 2.1 IN Hummel and Gidas10). Given f(x): , that is analytic and has n zeros, then for any 2 > 0, the function g(x) = f(x) ° N(0, 2, x) has at most n zeros.

Given this theorem, we can then prove an upper bound on the number of zero crossings by isolating the smallest variance component through deconvolution, removing this component and then proceeding inductively; by the above theorem, deconvolution does not decrease the number of zero crossings and since each component that we remove in this way is essentially a delta function, its removal reduces the number of zero crossings by at most two.

The proof of Theorem 2 then follows from assembling the pieces: Given a pair of one-dimensional GMMs whose parameter sets differ: (1) strip away all components in pairs of closely corresponding components, which has negligible effect on the discrepancy in moments of the pair of GMMs; (2) deconvolve by nearly the variance of the skinniest remaining component; (3) since this component is now nearly a Dirac delta function and there is no closely matching component in the second GMM, the deconvolved GMMs have non-negligible statistical distance; (4) non-negligible statistical distance implies non-negligible moment discrepancy; and (5) if there is a discrepancy in one the low-order moments of two GMMs, then after convolution by a Gaussian, there will still be a discrepancy in some low-order moment.

### 5. The Full Algorithm

We now motivate and describe our general algorithm for learning GMMs which, with high probability, returns a mixture whose components are accurate in terms of statistical distance. To get an intuitive sense for the types of mixtures for which the simple high-dimensional algorithm fails, consider the mixture of three components depicted in Figure 3. The two narrow components are very similar: both their means, and their covariance matrices are nearly identical. With overwhelming probability, the projection of this mixture onto any one-dimensional space will result in these two components becoming indistinguishable given any reasonable amount of data. Nevertheless, the statistical distance between these two components is close to one, and thus, information theoretically, we should be able to distinguish them.

How can we hope to disentangle these two components if, in nearly every one-dimensional projection, these components are indistinguishable? The intuition for the solution is also provided in the example: we can cluster out these two components and recurse. In particular, there is a vector (corresponding to the direction of small variance of these two components) such that if we project all the data onto this direction, the pair of narrow Gaussians is almost completely "disentangled" from the third component. Almost all of the data corresponding to the two narrow components will be contained within a small interval when projected on this direction, and almost none of the data generated by the third component will be contained in this interval.

If we are able to successfully perform such a clustering of the original mixture into two sub-mixtures, we can recurse. The central insight is that if we consider the sub-mixture corresponding to just the two narrow Gaussians, then we can re-scale the space by applying an affine transformation so that the resulting mean and variance are zero and one, respectively, in every direction. This re-scaling has the effect of stretching out this sub-mixture along the direction of small variance. In the resulting mixture of two Gaussians, if we project on a randomly chosen direction, the components will be noticeably different.

Our full algorithm will follow this general planin each step, our algorithm either learns a good estimate and outputs this estimate, or else will cluster the mixture into two proper sub-mixtures and recurse. The remainder of this section is devoted to explaining how we can learn a direction of small variance, and hence enable the clustering and recursion step if we are not able to directly apply the Simple High-Dimensional Algorithm of Section 3 to learn good estimates for the GMM components.

5.1. Learning a direction to project

How does one find a vector in which direction some of the components have small variance? Intuitively, finding this direction seems to require knowledge of the true mixture. Our approach will be to first learn an estimate of the mixture that is close to some partition of the true components, and thus gain some insight into the general structure of the mixture.

Suppose we add d-dimensional Gaussian noise to samples drawn from the example GMM of Figure 3. This would have the effect of "fattening" each component. After "fattening," the two narrow components would have extremely small statistical distance. So we could run our simple learning algorithm on this "fattened" mixture. Even though this distribution is a mixture of three Gaussians, the mixture is statistically extremely close to a mixture of two Gaussians. Our simple learning algorithm will return an estimate mixture of two Gaussians with the property that each component is close to a sub-mixture of the "fattened" distribution.

Thus one of the components in this estimate will correspond to the sub-mixture of the two narrow components. By examining this component, we notice that it is "skinny" (after adjusting the covariance matrix to account for the noise that we artificially added). Hence if we compute the smallest eigenvector of this co-variance matrix, we recover a direction which allows us to cluster the original mixture into two sub-mixtures and recurse.

If the Simple High-Dimensional Algorithm is run on samples from a GMM in which all components have large minimum eigenvalue (for example, if the samples have been "fattened"), then the algorithm, when run with target accuracy , will successfully learn the mixture provided that for each pair of components, either the statistical distance is at least , or at most ' << , where ' = p() for some polynomial p. In the case that some set of components all have pairwise statistical distance at most ', then the simple high-dimensional algorithm will never realize that these components correspond to separate components, and will simply return a single component in place of this set. The difficulty is when there exists some pair of components whose statistical distance lies within this bad window [p(), ]. In such an instance, the Simple High-Dimensional Algorithm has no provable guarantees.

To avoid the potential difficulty of finding a target accuracy for which no pair of components have statistical distance within the associated inadmissable window, one simply runs the high-dimensional algorithm with a range of target accuracies, 1, ..., k2 with i < p(i1). While we will never know which runs succeeded, there are at most pairwise statistical distances, and each pairwise statistical distance can fall into the inadmissible window of at most one run, and thus a majority of the runs will be successful. All that remains is to find a set of at least k2/2 runs which are consistent: given two sets of parameters returned by runs with target accuracies 1 < 2, we say they are consistent if there is some surjective mapping of the components returned by the 1 run into the components returned by the 2 run, such that each component has similar mean and covariance to its image. Thus, one can find such a chain of at least k2/2 consistent runs, yielding a set of accurate parameters.

### 6. Exponential Dependence on k

While the dependency of our algorithm on the number of components is super-exponential, we also describe a lower bound showing that at least an exponential dependency is necessary, even for mixtures in just one dimension. We show this by giving an explicit construction, for any k, of two one-dimensional GMMs F1, F2 consisting of at most k Gaussian components where the mixing weights and pairwise statistical distances between components are at least 1/2k. Additionally, in any correspondence between the components of one mixture F1 and the components of mixture F2, there is at least one component in F1 whose mean or variance differs by at least 1 from that of the corresponding component of F2. Nevertheless, Dtv(F1, F2) < 1/eO(k), and thus it is information theoretically impossible to distinguish a set of eo(k) samples from F1 from a set of eo(k) samples from F2, and hence impossible to return the component parameters to within ±1/2 with any probability greater than 0.6.

THEOREM 5. There exists a pair of GMMs, F1, F2 with at most k components each and condition numbers at most 2k such Dtv(F1, F2) = 1/eO(k), yet for any mapping between the components of F1 and F2, there will be a component whose variance differs by at least 1 from that of its image.

The construction hinges on the inverse exponential in k statistical distance between N(0, 2), and a mixture of infinitely many Gaussians of unit variance whose components are centered at multiples of , where the weight assigned to the component centered at is given by . Verifying this claim is an exercise in Fourier analysis. We then modify the construction slightly so that both GMMs have at most k components, and all components have weight at least 1/2k.

### 7. Conclusion

The primary contribution of our research is to get a first handle on the sample complexity and computational complexity of this problemhow many samples are required to learn, how much runtime is necessary, and in which parameters are these exponential or polynomial? While we do not know the optimal achievable rates, distinguishing between polynomial and exponential is a telling start.

Asymptotic guarantees are merely a guide as to which algorithms may perform well. Our current algorithm is not designed to be practical in any meaningful sense. However, we hope it opens the door to future work on algorithms that are both of practical utility and theoretically motivated, i.e., efficient estimators which do not suffer from local optima.

### References

1. Achlioptas, D., McSherry, F. On spectral learning of mixtures of distributions. In COLT (2005), 458469.

2. Arora, S., Kannan, R. Learning mixtures of arbitrary Gaussians. In STOC (2001), 247257.

3. Brainard, J., Burmaster, D.E. Bivariate distributions for height and weight of men and women in the United States. Risk Anal. 12, 2 (1992), 267275.

4. Brubaker, S.C., Vempala, S. Isotropic PCA and affine-invariant clustering. In FOCS (2008), 551560.

5. Dasgupta, S. Learning mixtures of Gaussians. In FOCS (1999), 634644.

6. Dasgupta, S., Schulman, L.J. A two-round variant of EM for Gaussian mixtures. In UAI (2000), 152159.

7. Dempster, A.P., Laird, N.M., Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39, 1 (1977), 138.

8. Feldman, J., Servedio, R.A., O'Donnell, R. PAC learning axis-aligned mixtures of Gaussians with no separation assumption. In COLT (2006), 2034.

9. Hsu, D., Kakade, S.M., Zhang, T. A spectral algorithm for learning hidden Markov models. In COLT, 2009.

10. Hummel, R.A., Gidas, B.C. Zero crossings and the heat equation. Technical Report Number 111, Courant Institute of Mathematical Sciences at NYU, 1984.

11. Kannan, R., Salmasian, H., Vempala, S. The spectral method for general mixture models. SIAM J. Comput. 38, 3 (2008), 11411156.

12. Pearson, K. Contributions to the mathematical theory of evolution. Philos. Trans. R. Soc. London 185 (1984), 71110.

13. Redner, R.A., Walker, H.F. Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26 (1984), 195239.

14. Teicher, H. Identifiability of mixtures. Ann. Math. Statist. 32, 1 (1961), 244248.

15. Vempala, S. The random projection method. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol.65. American Mathematical Society, 2004.

16. Vempala, S., Wang, G. A spectral algorithm for learning mixture models. J. Comput. Syst. Sci. 68, 4 (2004), 841860.

### Authors

Ankur Moitra (moitra@ias.edu), Institute for Advanced Study, Princeton, NJ.

Gregory Valiant (gvaliant@eecs.berkeley.edu), University of California, Berkeley, Berkeley, CA.

### Footnotes

a. Moreover, guaranteeing low statistical distance implies low Euclidean distance, in the case where all parameters are bounded.

b. Runtime is measured in the Real RAM model of computing where arithmetic operations are performed on real numbers.

c. Axis-aligned Gaussians are those whose principal axes are parallel to the coordinate axes.

This work appeared as "Efficiently Learning Mixtures of Two Gaussians" (Kalai, Moitra, Valiant, STOC 2010) and "Settling the Polynomial Learnability of Mixtures of Gaussians" (Moitra, Valiant, FOCS 2010).

### Figures

Figure 1. The Gaussian approximations of the heights of adult women (red) and men (blue). Can one recover estimates of these Gaussian components given only the aggregate data without gender labels (black)? [Raw data from NHANES].

Figure 2. Illustration of the highlevel approach: (1) project the data onto a series of vectors and learn the parameters of the resulting one-dimensional GMMs, (2) determine a consistent labeling between the components of the recovered one-dimensional GMMs, and (3) for each component, combine the recovered one-dimensional parameters to reconstruct an estimate of the high-dimensional parameters.

Figure 3. An example of a GMM with three components F1, F2, F3, such that with high probability over random vectors, the one-dimensional projections of F2 and F3 will be very similar, despite Dtv(F2, F3) 1.

Figure. The Simple 1-D Algorithm

Figure. The Simple High-Dimensional Algorithm

Figure. The Full High-Dimensional Algorithm

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.