In studying the genetic basis of a disease, it is now common to select a set of relevant genes G, and to measure how strongly they are expressed in cell samples from a group of patients, some healthy and some ill.1 The expression level of a gene is mapped to a value in [-1,1]. Each patient’s data is then a vector with one entry per gene, or equivalently, a point in R|G|. The size of G is frequently in the thousands, which makes the data high-dimensional by present standards.
Analyzing data in a high-dimensional space Rd is rife with challenges. First, many statistical estimators are accurate only when the number of data points (call it n) is orders of magnitude larger than d. Some models, such as histograms, have error rates of the form n−1/d, so that to halve the error, you need 2d times as much data: bad news even for double-digit d. A second difficulty is computational, as typified by the ever-important 2-means clustering problem: divide a data set into two groups, each with a designated center, to minimize the average squared distance from data points to their closest centers. Naive approaches run in time nd, which is astronomical even for smallish d, and NP-hardness results have dampened hopes for dramatically better algorithms. As a result, such problems are attacked with heuristics that offer no guarantees on their solutions. Understanding the quality of these schemes is doubly tricky because they are often justified on intuitive grounds, inevitably informed by experiences of 2D and 3D space—while high dimensional spaces are full of strange and counterintuitive effects.
Such complications are collectively referred to as the curse of dimension: A superstition that the murky realm of high dimension brings bad luck. But mathematical analysis is starting to clear away the haze. A pleasant surprise is that the counterintuitive geometry of high-dimensional space, once properly characterized, can be exploited to defeat other facets of the curse.
Even a solid ball—the simplest of objects—has unusual aspects in high dimension. For d=2 or d=3, a set of points picked at random from the unit ball Bd = {x in Rd: ||x|| ≤ 1} will have some significant fraction near the origin, say within distance ½ of it. But we wouldn’t find even one such point in high dimension unless we were to draw exponentially many points from the ball. This is because points at distance r < 1 from the origin constitute an rd fraction of Bd, and this fraction goes rapidly to zero with rising dimension. For large d, the supposedly filled-in ball Bd is in fact well approximated by a thin, hollow shell, {x in Rd: 1−ε≤ ||x|| ≤ 1} for ε = O(1/d).
Here’s another curiosity. Suppose we need lots of vectors in Rd that are orthogonal (at right angles) to each other. How many can we get? Exactly d, because this is the maximum number of linearly independent vectors. But if we only need them approximately orthogonal—with angles that need not be exactly 90 degrees, but 90±ε—then we can find exponentially many vectors. A collection of exp(O(ε2 d)) vectors picked at random from the surface of Bd will with high probability satisfy the angle constraint.
The curse of dimension: A superstition that the murky realm of high dimension brings bad luck. But mathematical analysis is starting to clear away the haze. A pleasant surprise is that counterintuitive geometry can defeat other facets of the curse.
These examples hint at the strangeness of high-dimensional space. However, such effects do not directly help with data analysis because they pertain to very specialized sets of points—those chosen randomly from the unit ball—whereas real data sets might look different. The trick is to take an arbitrary data set and then add randomness to it in such a way that the outcome is helpful and predictable.
An early groundbreaking result of this kind was Dvoretsky’s theorem.2 Let K be a convex body in Rd: for instance, the feasible region of a linear program with d variables. K could be enormously complicated. But a random slice through K (of appropriate dimension) will with high probability be almost ellipsoidal. More precisely, it will contain an ellipsoid E and be contained within (1+o(1))E.
A more recent result is the Johnson-Lindenstrauss theorem.3 Take any n points in Euclidean space of arbitrarily high dimension. If they are projected into a random subspace of O(log n) dimensions, distances between points will with high probability be almost perfectly preserved. Since clustering and other forms of data analysis frequently depend only on interpoint distances, the dimension of data sets can automatically be reduced to O(log n).
The following paper by Nir Ailon and Bernard Chazelle entitled "Faster Dimension Reduction" demonstrates an ingenious variant of this theorem that permits the projection to be achieved especially fast.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment