Sign In

Communications of the ACM

ACM Careers

NSF Provides $10.5 Million Grant to Develop Software for Un-Curated Data

Big Data, illustration

The U.S. National Science Foundation has awarded a team of researchers more than $10 million over the next five years to develop software to manage and make sense of vast amounts of digital scientific data.

Boston University Assistant Professor of Earth & Environment Michael Dietze and researchers in Dietze's Ecological Forecasting lab will be applying the software tools the team develops to two case studies focused on the responses of forests to climate change. In particular, both case studies aim to improve the representation of forests and other vegetation in Earth System Models, which is currently the second largest source of uncertainty in climate change projections behind cloud physics.

The first case study will focus on extracting vegetation data from historical land survey information collected by the US General Land Office in 1800's under the Homestead Act. Machine learning will be used to interpret hand-written survey notes and convert them to maps of pre-settlement forest composition and biomass. This data will form an important baseline from which to calibrate earth system models and judge the impacts of human land use and climate change.

The second case study will focus on curating and synthesizing contemporary data on the forest carbon cycle collected by individual researchers around the world. These data are 'small-data' when viewed individually, but collectively their information have the transformative potential of 'big data' to improve and calibrate Earth System Models. This case study will build upon an ongoing bioinformatics project in the Ecological Forecasting lab, the Predictive Ecosystem Analyzer, which has developed tools for making forest models more accessible and their analysis more automated. Within the Department of Earth and Environment, Professor Dietze and his Ecological Forecasting lab specialize in assimilating data into forest models in order to improve predictions of how climate change, forest management, and natural disturbance will affect forest biodiversity and the carbon cycle.

"The information age has made it trivial for anyone to create and share vast amounts of digital data, including unstructured collections of images, video, and audio as well as documents and spreadsheets," says project leader Kenton McHenry, who leads the Image and Spatial Data Analysis division of the National Center for Supercomputing Applications (NCSA) at the University of Illinois along with project co-principal investigator Jong Lee. "But the ability to search and use the contents of digital data has become exponentially more difficult."

That's because digital data become trapped in outdated, difficult-to-read file formats and because metadata — the critical data about the data, such as when and how and by whom it was produced — is nonexistent.

The NCSA team, partnering with faculty at the University of Illinois at Urbana-Champaign, Boston University, and the University of North Carolina at Chapel Hill, will develop two services to make the contents of un-curated data collections accessible. The Data Access Proxy will chain together open/save operations within software applications in order to seamlessly transform unreadable files into readable ones. The Data Tilling Service will serve as a framework for content analysis tools in order to automatically assign metadata to files within un-curated collections.

McHenry likens these two services to the Domain Name Service (DNS), which makes the Internet easily navigable by translating domain names, like, into the numerical IP addresses needed to locate computer devices and services and the information they provide.

"The two services we're developing are like a DNS for data, translating inaccessible un-curated data into information," he says.

Rather than starting from scratch and constructing a single piece of software, the NCSA team is building on their previous software development work and aims to use every possible source of automatable help already in existence. By patching together these various components, they plan to build a "super mutt" of software, which they call "Brown Dog."

The initial targets for the software are projects in geoscience, biology, engineering, and social science, but McHenry says the software could also be broadly useful to help manage individuals' ever-growing collections of photos, videos, and unstructured/un-curated data on the web.

Project collaborators include Barbara Minsker, professor and Arthur and Virginia Nauman Faculty Scholar and Faculty Affiliate at NCSA, and Praveen Kumar, Lovell Professor, both in the Department of Civil and Environmental Engineering at the University of Illinois at Urbana-Champaign; Michael Dietze, assistant professor in the Department of Earth and Environment at Boston University; and Richard Marciano, director of the Sustainable Archives & Leveraging Technologies lab at the University of North Carolina at Chapel Hill. The team also includes William Sullivan, Arthur Schmidt, and Jerome McDonough at the University of Illinois at Urbana-Champaign; Jay Alameda at NCSA; Luigi Marini and Rob Kooper as senior development staff and software architects within ISDA, and Dave Mattson as project manager.


No entries found