Viewpoint: From Teragrid to Knowledge Grid

Over the last decade, technologists have continued to push the envelope by creating more powerful computers and greater disk storage capacities. Increases in network technology have allowed programmers to link resources at increasingly sophisticated levels; the Internet let computers communicate, the Web brought tremendous amounts of data online, and today the global grid community is integrating distributed resources into cohesive, virtual supercomputers. The next challenge will be to extract knowledge from terabytes and even petabytes of online data collections using grid technologies and grid applications "intelligent" enough to access, integrate, and analyze these collections.

The dramatic possibilities for today’s computing environment have evolved from the first electronic computers, which were built to solve crytography and ballistic firing trajectory problems for the military. These devices fulfilled a disciplinary need not easily or cost-effectively solved by people alone. The first computers focused on solving specific problems but quickly evolved into general-purpose machines. In the early 1960s Fortran IV was standardized, and scientific computing began in earnest. Since then, computing hardware and software have enabled advances in science as well as a mechanism for tackling the problems of modern life. We can now undertake terascale and even petascale problems requiring complex combinations of computation, data management, input from remote instruments, and visualization.

What we must achieve is the ability to harness the power of computers and networks to make sense of the mountains of data increasingly available in electronic form. The first decade of the new millennium is the "Data Decade," where data growth is outpacing computational growth, and many of the most important advances in science and engineering will result from the tight coupling of computation and online analysis and synthesis of massive data collections. Biomedical images, repositories of proteins, stream gauge measurements of our waterways, digital maps of the world and the universe, and the search for basic elements of matter are just some examples of collections that will provide raw material for online search and analysis. The acceleration of this torrent of data as well as our increasing need to successfully curate, analyze, and synthesize this data to achieve scientific advances mandates a paradigm shift in the way we structure critical data-oriented applications.

The Data Decade will have a great impact on the next generation of computing hardware and software. The evolving importance of data management mandates that we consider not only teraflops and petaflops as metrics of importance, but terabytes and petabytes as well. Development of computational platforms that provide easy and online access to data will be critical to an increasing number of applications. Understanding and making scientific inferences with enormous distributed data collections via simulation, modeling, and analysis are some of the challenges that will be addressed directly by the TeraGrid project recently funded by the NSF.

The TeraGrid was proposed in response to a National Science Foundation solicitation and awarded in August 2001. The NSF solicitation outlined a distributed terascale facility (DTF) that embodied the agency’s vision of a national cyberinfrastructure. The $53 million award was made to the San Diego Supercomputer Center at the University of California, San Diego, the National Center for Supercomputing Applications at the University of Illinois, Urbana-Champaign, Caltech, and Argonne National Lab to deploy a facility whose initial nodes will be at these four institutions.

More than half of a petabyte of disk storage and a 40Gb/s national optical backbone complements 13 teraflops of aggregate compute power over the four sites. The architecture is based on commodity clusters, huge repositories of spinning disks, and networks 16 times faster than today’s fastest research networks. The software architecture recognizes the worldwide movement toward Linux and the increasing acceptance of Globus by the grid community. The TeraGrid provides the cornerstone of a national grid infrastructure and involves extensive software development by the four principal sites and the many participants in NSF’s Partnerships for Advanced Computational Infrastructure (PACI) program.

The TeraGrid is an ambitious and exciting project. One of the most difficult challenges for application developers will be to leverage the immense capacity of the TeraGrid to provide online access to massive amounts of data. In particular, the synthesis of knowledge from data is among the most challenging applications for the TeraGrid. This class of applications allows the TeraGrid to achieve its potential and enable the full use of the TeraGrid as "knowledge grid."

Development of a knowledge grid will require the design and deployment of sophisticated tools that allow application developers to synthesize knowledge from data through mining, inference, and other techniques. The development of knowledge synthesis tools and services enables scientists to focus on culling usable information from the mass of available data. Contrast the tools used early on for Web searches with today’s sophisticated search engines. The first search tools provided targeted, text-only information whereas modern search engines can make inferences, answer questions, and draw conclusions from masses of raw data. If the ability to synthesize data to provide useful and usable information is integrated with the ability to perform sophisticated large-scale computation, a generation of new and revolutionary results will be enabled.

Computer and computational scientists at the San Diego Supercomputer Center, within the PACI program, and throughout the science and engineering community are developing the next-generation systems to allow the TeraGrid to function as a knowledge grid. For example, a Biomedical Informatics Research Network (BIRN) (www.nbirn.net) is being established to allow brain researchers at geographically distributed advanced imaging centers to share data acquired at multiple scales or at the same scale but from different subjects. Some of these images constitute an evolving reference set that can then be used as essential data in assessing status of therapies for neurological disorders such as multiple sclerosis or Alzheimer’s disease. Such applications are made possible by the availability of a distributed data grid with hundreds of terabytes of data online, enabling the TeraGrid to be used fully as a knowledge grid.

A knowledge grid is the convergence of a comprehensive computational infrastructure along with the scientific data collections and applications for routinely supporting the synthesis of knowledge from that data. Grid-enabled software tools will be developed and deployed on terascale and petascale hardware. The TeraGrid is the critical starting point in the development of an infrastructure that responsibly meets the needs of the Data Decade. The integration of the data, computing, and networking hardware, the development of the software, and the coordination of a large and distributed human infrastructure provide difficult challenges, but the promise of the knowledge grid is vast, and the community as a whole must rise to this challenge in the Data Decade and lead us into a new decade of knowledge.

Viewpoint: From Teragrid to Knowledge Grid

DOI

November 2001 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Viewpoint: From Teragrid to Knowledge Grid

DOI

November 2001 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.