Figure. Deforestation in the Amazon Basin, Santa Cruz, Bolivia. Acquired in the false colors of the Landsat-7 satellite; healthy vegetation appears bright red. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)
As research on so many fronts is becoming increasingly dependent on computation, all science, it seems, is becoming computer science,” heralded science journalist George Johnson [7] in 2001. The U.K. Research Councils define e-science as “large-scale science carried out through distributed global collaborations enabled by networks, requiring access to very large data collections, very large-scale computing resources, and high-performance visualization.”1 These advanced computing technologies enable discipline scientists to study and better understand complex systems—physical, geological, biological, environmental, atmospheric—from the micro to the macro level, in both time and space.
This growing dependence on information technology and the benefits to research and society at large from persistent collaboration over even intercontinental distances, coupled with the ability to process, disseminate, and share information on unprecedented scales, has prompted U.S. federal agencies, notably the National Science Foundation (NSF), Department of Energy (DOE), National Institutes of Health (NIH), and NASA, to fund the cyberinfrastructure needed to empower e-science research and allied education [1].
The high-energy and nuclear physics (HENP) community is the most advanced in its efforts to develop globally connected, Grid-enabled, data-intensive systems [2]. HENP experiments are breaking new ground in our common understanding of the unification of forces, the origin and stability of matter, and the structures and symmetries governing the nature of matter and space-time in the universe. Among the principal goals at the high-energy frontier are finding the mechanism responsible for mass in the universe, the Higgs particles associated with mass generation, and the fundamental mechanism that led to the predominance of matter over antimatter in the observable cosmos (see Figure 1).
Experimentation at increasing energy scales, along with the increasing sensitivity and complexity of measurements, have increased the scale and cost of detectors and particle accelerators, along with the size and geographic dispersion of scientific collaborations. The largest collaborations today include the Compact Muon Solenoid and A Toroidal LHC ApparatuS (ATLAS), each building experiments for the European Laboratory for Particle Physics (CERN) Large Hadron Collider (LHC) program that involve 2,000 physicists from 150 institutions in 36 countries. The current generation of experiments include BaBar at the Stanford Linear Accelerator Center (SLAC) and Dzero and the Collider Detector at Fermilab at the Fermi National Accelerator Laboratory, Batavia, IL.2
To improve the scientific community’s understanding of the fundamental constituents of matter and the nature of space-time itself, researchers must isolate and measure rare events. The Higgs particles, thought to be responsible for mass in the universe, are typically produced in only one of 1013 interactions. A new generation of particles at the upper end of the LHC’s energy reach may be produced only at the rate of a few events per year, or one event in 1015 to 1016 events.
The key to discovery is the ability to detect a signal with high efficiency and in many cases (such as the Higgs searches) also measure the energies and topology of the signal events precisely while suppressing large, potentially overwhelming, backgrounds. Physicists want to scan through the data repeatedly, devising new or improved methods of measuring and isolating the “new physics” events with increasing selectivity as they progressively learn how to suppress the backgrounds. The initial strategies for separating out the Higgs and other signals will initially rely on massive simulations performed with distributed Grid facilities. These strategies will evolve as the LHC data accumulates and a more precise picture of the backgrounds needing to be suppressed is obtained by studying the data itself.
The basic problem for e-scientists is how to discover new interactions from particle collisions, down to the level of a few interactions per year out of the 1016 produced.
HENP scientists designing these experiments expect their data volumes to increase from the multi-petabyte to the exabyte (1018B) range within the next 10 to 15 years; at the same time, the corresponding network speed requirements on each of the major links used in the field are expected to increase from 10Gbps to the Tbps range. Shaping the future cyberinfrastructure, HENP researchers are working with computer scientists to codevelop advanced networking testbeds (see the article by DeFanti et al. in this section) and Grid middleware systems (see the article by Foster and Grossman in this section).
Model Cyberinfrastructure
While the term e-science is relatively new, the concept of computational science was first popularized with the advent of the national supercomputer centers in 1986. Since becoming operational in 1994, NASA’s earth Observing System Data and Information System (EOSDIS) has managed data from the agency’s earth science research satellites and field measurement programs at eight Distributed Active Archive Centers around the U.S. and at six foreign sites, providing data archiving, distribution, and information management services. Today, it manages and distributes data from EOS missions (Landsat-7, QuikSCAT, Terra, and ACRIMSAT); pre-EOS missions (UARS, SeaWIFS, TOMS-EP, TOPEX/Poseidon, and TRMM); and Earth Science Enterprise legacy data. The Terra spacecraft alone produces 194GB of raw data per day, and Landsat-7 produces 150GB per day—numbers that can quadruple after the data is processed. Volumes have only increased since 2000, when EOSDIS supported approximately 104,000 users and filled 3.4 million product requests. EOSDIS makes a strong case that the needs of a globally distributed research community, dependent on scientific instruments collecting terabytes of data daily, is manageable only through a distributed cyberinfrastructure. HENP, along with other discipline sciences, is extending and modernizing this model.
The LHC experiments, for example, adopted the Data Grid Hierarchy, or a structured ensemble of computing and data-handling facilities interconnected by networks, developed at the California Institute of Technology (see Figure 2). Data generated at the experiment is filtered, processed in real time, and stored at the rate of 1001,500Mbps during much of the year (typically 200 days), resulting in petabytes per year of stored and processed binary data accessed and processed repeatedly by worldwide collaborators. Following initial processing and storage at the Tier0 facility at CERN, data is distributed over high-speed networks to approximately 10 national Tier1 centers in the U.S., Europe, and elsewhere. The data is further processed, analyzed, and stored at approximately 50 Tier2 regional centers, each serving small- to medium-size countries or regions of larger countries, including the U.S., the U.K., and Italy. Data subsets are accessed from and further analyzed by physics groups through one of hundreds of Tier3 work-group servers and/or thousands of Tier4 desktops worldwide. Being able to use this global ensemble of systems depends on the development of Data Grids, or distributed data stores connected via high-speed networks, capable of managing and marshalling Tier-N resources and supporting collaborative software development around the world.
The anticipated data rates and network bandwidths in Figure 2 correspond to a conservative baseline formulated using an evolutionary view of network technologies. Given the price/performance of available networking today, estimates indicate that worldwide scientific network needs will reach 10Gbps within the next two to three years, followed by scheduled and dynamic use of multiple 10Gbps wavelengths by the time LHC is scheduled to begin operation in 2007.
Hundreds of Petabytes, then Exabytes
HENP data will increase from petabytes in 2002 to hundreds of petabytes by 2007 and exabytes (1018 bytes) by 2012 to 2015. To build a useful yet flexible distributed system, data-intensive Grid middleware and services, as well as much larger network bandwidths, are required, so typical data transactions drawing 110TB and eventually 100TB subsamples from multi-petabyte data stores can be completed in a few minutes.
To understand the potential of HENP applications for overwhelming future planned networks, note that the compacted stored data are pre-filtered by a factor of 106 to 107 by the “Online System” (a large cluster of up to thousands of CPUs filtering data in real time) in Figure 2. But real-time filtering risks throwing away data from subtle new interactions that do not fit preconceptions or existing and proposed theories. The basic problem for e-scientists is how to discover new interactions from the particle collisions, down to the level of a few interactions per year out of the 1016 produced. In an ideal world, every collision produced in the detector would be analyzed. However, the ability to analyze every event without prefiltering is beyond both current and foreseen states of network and computing technologies.
These new systems might also lead to new modes of interaction between people and the persistent information in their daily lives.
Completing transactions in minutes rather than hours is necessary for avoiding bottlenecks resulting in up to hundreds of requests per day or thousands of requests left pending over long periods. Transactions on this scale correspond to data throughput of 10Gbps to 1Tbps for 10-minute transactions and up to 10Tbps for one-minute transactions.
The HENP community is thus a principal driver, architect, and codeveloper of Data Grids for defining middleware tools and techniques for data-intensive manipulation and analysis.3 It is also a principal driver, architect, and codeveloper of networking infrastructure, tools, and techniques for end-to-end data transmission. Recent activities include:
- In June 2003, a Caltech/CERN team achieved 0.94Gbps sustained throughput with a single IPv6 stream over a distance of 7,000 kilometers (Chicago to Geneva).4
- In February 2003, an international team of physicists and computer scientists transferred 1TB of data across 10,037 kilometers in less than an hour from SLAC in Sunnyvale, CA, to CERN in Geneva, sustaining a TCP single-stream rate of 2.38Gbps. This throughput is equivalent to transferring a full CD in 2.3 seconds, 1,565 CDs per hour, 200 full-length DVD movies in an hour, or a DVD in 18 seconds.5
- In November 2002 at the SC 2002 conference in Baltimore, Caltech used the new FAST TCP stack6 to achieve 8.6Gbps throughput over a 10,000 km path between Sunnyvale and Amsterdam, transferring 22TB of data in six hours in 10 TCP streams. (For more on FAST, see the article by Falk et al. in this section [6, 9].)
Achievable throughput will soon reach the limits of networks based on statically routed and switched paths. In the longer term, within 10 years, intelligent photonics, along with the dynamic use of wavelengths and construction and tearing down of wavelength paths through wavelength routing, represent a natural match for the peer-to-peer interactions required for data-intensive science. Integrating intelligent photonic switching with advanced protocols is an effective basis for using network infrastructures, wavelength by wavelength, and promises to put terabit networks within the reach, technically and financially, of scientists worldwide.
Conclusion
While HENP is a pioneer in cyberinfrastructure design, other major e-science efforts are also under way; two of them—one in biology and medical research, the other in earth science—are outlined in the sidebar “Emerging Cyberinfrastructure Communities.” The NIH-supported Biomedical Informatics Research Network (BIRN) project, which began in 2001, enables biomedical researchers and neuroscientists throughout the U.S. to collaborate in the study of brain disorders and obtain better statistics on the morphology of disease processes, ranging from multiple sclerosis to schizophrenia, by standardizing and cross-correlating data from many different imaging systems at scales from the molecular to the whole brain (see www.nbirn.net). EarthScope, funded in 2002 by NSF, enables geoscientists to better observe the structure and ongoing deformation of the North American continent by obtaining data from a network of multipurpose geophysical instruments and observatories (see www.earthscope.org).
The wealth of information promised by these pioneering efforts means new challenges in data acquisition, controlled secure sharing of access to distributed databases, distributed data processing, managed distribution, large-scale multidimensional visualization, and interdisciplinary collaboration across national and international networks on a scale unprecedented in the history of science.
Meanwhile, e-science faces unprecedented challenges in terms of: the data-intensiveness of the work (as the data being processed increases from terabytes to petabytes to exabytes); the complexity of the data (extracting detail from data sets generated by instruments); the timeliness of data transfers (whether bulk transfers for remote storage, smaller for distributing computing and analysis, or real-time for collaboration); and the global extent and complexity of the collaborations by international teams exploring and analyzing data-intensive research in fundamentally new ways.
An integrated cyberinfrastructure promises the first distributed systems environment serving virtual organizations on a global scale. The new information technologies derived from enabling e-science communities can thus affect industrial and commercial operations as well. Resilient self-aware systems, supporting large volumes of robust terabyte and larger transactions and adapting to a changing workload, can provide a strong foundation for the distributed data-intensive business processes of multinational corporations.
These new systems might also lead to new modes of interaction between people and the persistent information in their daily lives. Learning to provide, efficiently manage, and absorb this information in a persistent, collaborative environment will profoundly affect everyone in terms of commerce, communications, health care, and entertainment, not just scientists and their experiments.
Figure. Namib-Naukluft National Park, Namib Desert, Namibia. Coastal winds create the tallest sand dunes in the world here, some reaching 980 feet. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)
Figure. The Anti-Atlas Mountains, part of the Atlas Mountain Range in southern Morocco. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)
Figures
Figure. Deforestation in the Amazon Basin, Santa Cruz, Bolivia. Acquired in the false colors of the Landsat-7 satellite; healthy vegetation appears bright red. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)
Figure. Namib-Naukluft National Park, Namib Desert, Namibia. Coastal winds create the tallest sand dunes in the world here, some reaching 980 feet. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)
Figure. The Anti-Atlas Mountains, part of the Atlas Mountain Range in southern Morocco. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)
Figure 1. Simulated decay of the Higgs Boson into four Muons. (bottom) The high-momentum charged particles in the Higgs event. (top) How the event would appear in the Compact Muon Solenoid detector, submerged beneath many other background interactions. (Image created by the CMS Collaboration, see cmsinfo.cern.ch/Welcome.html.)
Figure. Biomedical and earth e-science. (top) Digital montage of a slice of a rat cerebellum, composed of 43,200 separate images acquired using a Biorad RTS 2000MP two-photon system attached to a Nikon TE30 microscope equipped with an Applied Precision, Inc. automated three-axis stage. It was fluorescently stained for inositol 1,4,5-trisphosphate receptor (IP3R), a type of intracellular calcium channel highly enriched in Purkinge cells (green); glial fibrillary acidic protein found in glial cells (red); and DNA within the nuclei of the cells (blue). (Image by T. Deerinck, S. Chow, J. Bouwer, H. Hakozaki, M. Martone, S. Peltier, and M. Ellisman of the National Center for Microscopy and Imaging Research, University of California, San Diego. (bottom) Interferometric phase map draped on top of a digital topography model of a major earthquake in the Mojave Desert combining scans made by space-borne synthetic aperture radar before and after the earthquake. The white line denotes the observed surface rupture. (Image by Y. Fialko of the Cecil and Ida Green Institute of Geophysics and Planetary Physics, Scripps Institution of Oceanography, University of California, San Diego. The interferogram was processed using the Jet Propulsion Laboratory/Caltech Repeat Orbit Interferometry PACKage software. Original SAR data by the European Space Agency, distributed by Eurimage, Italy, and acquired via the WInSAR Consortium with funding from NSF, NASA, and USGS.)
Join the Discussion (0)
Become a Member or Sign In to Post a Comment