acm-header
Sign In

Communications of the ACM

ACM TechNews

Data Mining the Past


View as: Print Mobile App Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
Digitizing a newspaper article.

The algorithm ranks peoples names by importance based on a number of attributes, including the context of the name, title before the name, article length, and how frequently the name was mentioned in an article. The algorithm learns these attributes from the text.

Credit: townswebarchiving.com

Researchers at the State University of New York at Buffalo and India's International Institute of Information Technology Bangalore have developed an algorithm to convert old newspapers into searchable data by identifying and ranking people's names in order of their importance.

The researchers worked with the New York Public Library to analyze over 14,000 articles from The Sun published in November and December of 1894.

The algorithm keys on attributes exclusively from text produced by optical character recognition (OCR) software, like name context, title before the name, article length, and how often the name is mentioned in an article.

Because the OCR text was garbled, the researchers modeled the attributes statistically, and tested the algorithm on raw OCR-generated text and articles cleaned up manually by schoolchildren.

They found it could rank names very precisely, even from the OCR text, when compared to the cleaned-up versions.

From University at Buffalo News Center
View Full Article

 

Abstracts Copyright © 2021 SmithBucklin, Washington, DC, USA


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account