Systems and Networking Research highlights

Technical Perspective: Robust Statistics Tackle New Problems

Posted May 1 2021

Article
References
Author
Footnotes

The following paper represents the beginning of a long and productive line of work on robust statistics in high dimensions. While robust statistics has long been studied, going back at least to Tukey,⁶ the recent revival centers on algorithmic questions that were largely unaddressed by the earlier statistical work.

Robust statistics centers on the question of how to extract information from data that may have been corrupted in some way. The most common form of robustness, also considered here, is robust to outliers: some fraction of the data has been removed and replaced with arbitrary, erroneous points. A familiar instance of robust statistics is using the median instead of the mean, since the median is less sensitive to extreme points, while in contrast a single overly large value could completely skew the mean.

There are several generalizations of the median to higher dimensions, each with their own robustness properties. These include the geometric median (which finds the point with minimum average Euclidean distance to the input dataset) and the Tukey median (which finds the "deepest" point in the dataset). However, the geometric median is fragile when the dimension is large, and the Tukey median is NP-hard to compute. In general, in the older statistics literature there were no estimators that were both efficient to compute and worked well in high-dimensional settings, although Donoho¹ and Donoho and Gasko² did construct several estimators toward this general end whose ideas are still relevant today.

This brings us to the current paper by Diakonikolas et al., which constructs efficient robust estimators for several problems including robust mean estimation of a Gaussian. For this problem, all prior robust estimators either had unfavorable error in high dimensions or could not be computed in polynomial time (concurrent work by Lai et al.⁵ also develops a polynomial-time estimator with slightly worse bounds; and earlier work by Klivans et al.⁴ introduces related ideas for robust classification).

While there are a number of variations, there are two key techniques in this paper that underlie many of the results. The first, which has been most commonly used in subsequent work, is the "filtering algorithm." The basic idea is to check if your distribution has some desirable property (such as bounded covariance), such that if it did we could perform robust recovery straightforwardly. If the property holds, then we are done; otherwise, we hope that a certificate of violation of the property will allow us to "filter out" bad points and then try again. In this way, we must eventually succeed: the desired property will eventually hold since there are only a finite number of bad points.

The second technique constructs an "approximate separation oracle" for a certain convex program, such that approximately solving the convex program leads to an accurate robust estimate. The twist is the convex program depends on the quantity that we are supposed to estimate in the first place! But this is where the approximate separation oracle comes in—it turns out to be possible to construct this separation oracle without knowing the quantity itself. This particular technique has not been as widely employed as the filtering algorithm, but is spiritually related to later ideas such as the non-convex analysis of Zhu et al.⁷

The following paper constructs efficient robust estimators for several problems including robust mean estimation of a Gaussian.

Since these early papers, researchers have explored a number of algorithmic questions including robust classification and regression, robustness in list-decoding models, sum-of-squares proofs for robustness, and non-convex optimization for robust recovery. The recent work has also moved beyond algorithmic questions to unearth new statistical insights, such as new estimators that work under fairly weak statistical assumptions, connections to minimum distance functionals introduced by Donoho and Liu,³ and robustness to new forms of corruptions defined by transportation metrics.

For those interested in learning more, there are now several tutorials, expositions, and courses on these topics, including the thesis of one of the authors, a related course at University of Washington, and my own lecture notes.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Technical Perspective: Robust Statistics Tackle New Problems

View in the ACM Digital Library

DOI

10.1145/3453937

May 2021 Issue

Published: May 1, 2021

Vol. 64 No. 5

Page: 106

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Jan 2 2025

Farming With AI

Mark Halper

Architecture and Hardware

BLOG@CACM Dec 30 2024

Is Computer Science a Profession? Should It Be?

Robin K. Hill

Architecture and Hardware

man standing between infinite lines of code, illustration

News Dec 27 2024

Where Art and Tech Click: Algorithmic Photography

Mark Halper

Artificial Intelligence and Machine Learning

birds flying against a blue sky in an algorithmic photo

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Technical Perspective: Robust Statistics Tackle New Problems

DOI

May 2021 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.