Sign In

Communications of the ACM

ACM TechNews

System Cleans Messy Data Tables Automatically


View as: Print Mobile App Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
The PClean system.

Massachusetts Institute of Technology researchers have created a new system that automatically cleans dirty data.

Credit: MIT News

A system developed by researchers at the Massachusetts Institute of Technology (MIT) automatically cleans "dirty data" of things such as typos, duplicates, missing values, misspellings, and inconsistencies.

PClean combines background information about the database and possible issues with common-sense probabilistic reasoning to make judgment calls for specific databases and error types.

Its repairs are based on Bayesian reasoning, which applies probabilities based on prior knowledge to ambiguous data to determine the correct answer, and can provide calibrated estimates of its uncertainty.

The researchers found that PClean, with just 50 lines of code, outperformed benchmarks in both accuracy and runtime.

From MIT News
View Full Article

 

Abstracts Copyright © 2021 SmithBucklin, Washington, DC, USA


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account