Sign In

Communications of the ACM

ACM TechNews

A Data-Cleaning Tool For Building Better Prediction Models

Researchers show how ActiveClean rapidly improves model accuracy.

Software developed by computer scientists at Columbia University and the University of California, Berkeley, handles much of the dirty work of cleaning massive datasets.

Credit: Columbia University

Researchers at Columbia University and the University of California, Berkeley have developed software that takes humans out of the most error-prone steps of cleaning big data.

The research team says ActiveClean is designed to analyze a user's prediction model to decide which mistakes to edit first, while updating the model as it works.

The researchers note the system uses machine learning to analyze a model's structure to understand what sorts of errors will throw the model off most. They say ActiveClean targets that data first, in decreasing priority, and cleans just enough data to give users assurance their model will be reasonably accurate. The researchers say with each pass, users see their model improve.

With no data cleaning, a model trained on the Dollars for Docs dataset could predict an improper donation just 66% of the time. However, ActiveClean raised the detection rate to 90% by cleaning just 5,000 records, according to the researchers. They say the active learning method required 10 times as much data, or 50,000 records, to reach a detection rate comparable to ActiveClean.

"Dirty data is pervasive and prevents people from doing useful things," says Eugene Wu, a member of Columbia's Data Science Institute. "This is our first step towards automating the data-cleaning process."

From Columbia University
View Full Article


Abstracts Copyright © 2016 Information Inc., Bethesda, Maryland, USA


No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account