Sign In

Communications of the ACM

ACM Careers

The Plan to Mine the World's Research Papers

View as: Print Mobile App Share:
Carl Malamud in front of the JNU data depot

Carl Malamud in front of the data store of 73 million articles that he plans to let scientists text mine.

Credit: Smita Sharma / Nature

Carl Malamud is on a crusade to liberate information locked up behind paywalls. He has spent decades publishing copyrighted legal documents, from building codes to court records, and then arguing that such texts represent public-domain law that ought to be available to any citizen online. Now, the 60-year-old American technologist is turning his sights on a new objective: freeing paywalled scientific literature.

Over the past year, Malamud has—without asking publishers—teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi.

No one will be allowed to read or download work from the repository, because that would breach publishers' copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world's scientific literature to pull out insights without actually reading the text.

The unprecedented project could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis.

But the depot's legal status isn't yet clear.

From Nature
View Full Article


No entries found