Systems and Networking Research highlights

Technical Perspective: Robust Statistics Tackle New Problems

Posted May 1 2021

Article
References
Author
Footnotes

The following paper represents the beginning of a long and productive line of work on robust statistics in high dimensions. While robust statistics has long been studied, going back at least to Tukey,⁶ the recent revival centers on algorithmic questions that were largely unaddressed by the earlier statistical work.

Robust statistics centers on the question of how to extract information from data that may have been corrupted in some way. The most common form of robustness, also considered here, is robust to outliers: some fraction of the data has been removed and replaced with arbitrary, erroneous points. A familiar instance of robust statistics is using the median instead of the mean, since the median is less sensitive to extreme points, while in contrast a single overly large value could completely skew the mean.

There are several generalizations of the median to higher dimensions, each with their own robustness properties. These include the geometric median (which finds the point with minimum average Euclidean distance to the input dataset) and the Tukey median (which finds the "deepest" point in the dataset). However, the geometric median is fragile when the dimension is large, and the Tukey median is NP-hard to compute. In general, in the older statistics literature there were no estimators that were both efficient to compute and worked well in high-dimensional settings, although Donoho¹ and Donoho and Gasko² did construct several estimators toward this general end whose ideas are still relevant today.

This brings us to the current paper by Diakonikolas et al., which constructs efficient robust estimators for several problems including robust mean estimation of a Gaussian. For this problem, all prior robust estimators either had unfavorable error in high dimensions or could not be computed in polynomial time (concurrent work by Lai et al.⁵ also develops a polynomial-time estimator with slightly worse bounds; and earlier work by Klivans et al.⁴ introduces related ideas for robust classification).

While there are a number of variations, there are two key techniques in this paper that underlie many of the results. The first, which has been most commonly used in subsequent work, is the "filtering algorithm." The basic idea is to check if your distribution has some desirable property (such as bounded covariance), such that if it did we could perform robust recovery straightforwardly. If the property holds, then we are done; otherwise, we hope that a certificate of violation of the property will allow us to "filter out" bad points and then try again. In this way, we must eventually succeed: the desired property will eventually hold since there are only a finite number of bad points.

The second technique constructs an "approximate separation oracle" for a certain convex program, such that approximately solving the convex program leads to an accurate robust estimate. The twist is the convex program depends on the quantity that we are supposed to estimate in the first place! But this is where the approximate separation oracle comes in—it turns out to be possible to construct this separation oracle without knowing the quantity itself. This particular technique has not been as widely employed as the filtering algorithm, but is spiritually related to later ideas such as the non-convex analysis of Zhu et al.⁷

The following paper constructs efficient robust estimators for several problems including robust mean estimation of a Gaussian.

Since these early papers, researchers have explored a number of algorithmic questions including robust classification and regression, robustness in list-decoding models, sum-of-squares proofs for robustness, and non-convex optimization for robust recovery. The recent work has also moved beyond algorithmic questions to unearth new statistical insights, such as new estimators that work under fairly weak statistical assumptions, connections to minimum distance functionals introduced by Donoho and Liu,³ and robustness to new forms of corruptions defined by transportation metrics.

For those interested in learning more, there are now several tutorials, expositions, and courses on these topics, including the thesis of one of the authors, a related course at University of Washington, and my own lecture notes.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Technical Perspective: Robust Statistics Tackle New Problems

View in the ACM Digital Library

DOI

10.1145/3453937

May 2021 Issue

Published: May 1, 2021

Vol. 64 No. 5

Page: 106

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM May 3 2024

Pioneering Sustainable IT with Green Computing

Alex Williams

Architecture and Hardware

News May 2 2024

3D Printing Finds a Home

Samuel Greengard

Architecture and Hardware

Credit: Shutterstock 3D printer printing a structure

BLOG@CACM May 1 2024

HiPEAC’s Vision for the Future

Tullio Vardanega and Marc Duranton

Computing Profession

Credit: Roger Castro, Monzón HiPEAC Vision 2024 Next Computing Paradigm

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Technical Perspective: Robust Statistics Tackle New Problems

DOI

May 2021 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.