Research and Advances
Architecture and Hardware Blueprint for the future of high-performance networking

Data-Intensive E-Science Frontier Research

Large-scale e-science, including high-energy and nuclear physics, biomedical informatics, and Earth science, depend on an increasingly integrated, distributed cyberinfrastructure serving virtual organizations on a global scale.
Posted
  1. Introduction
  2. Model Cyberinfrastructure
  3. Hundreds of Petabytes, then Exabytes
  4. Conclusion
  5. References
  6. Authors
  7. Footnotes
  8. Figures
  9. Sidebar: Emerging Cyberinfrastructure Communities
  10. Figure

Figure. Deforestation in the Amazon Basin, Santa Cruz, Bolivia. Acquired in the false colors of the Landsat-7 satellite; healthy vegetation appears bright red. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)

As research on so many fronts is becoming increasingly dependent on computation, all science, it seems, is becoming computer science,” heralded science journalist George Johnson [7] in 2001. The U.K. Research Councils define e-science as “large-scale science carried out through distributed global collaborations enabled by networks, requiring access to very large data collections, very large-scale computing resources, and high-performance visualization.”1 These advanced computing technologies enable discipline scientists to study and better understand complex systems—physical, geological, biological, environmental, atmospheric—from the micro to the macro level, in both time and space.

This growing dependence on information technology and the benefits to research and society at large from persistent collaboration over even intercontinental distances, coupled with the ability to process, disseminate, and share information on unprecedented scales, has prompted U.S. federal agencies, notably the National Science Foundation (NSF), Department of Energy (DOE), National Institutes of Health (NIH), and NASA, to fund the cyberinfrastructure needed to empower e-science research and allied education [1].

The high-energy and nuclear physics (HENP) community is the most advanced in its efforts to develop globally connected, Grid-enabled, data-intensive systems [2]. HENP experiments are breaking new ground in our common understanding of the unification of forces, the origin and stability of matter, and the structures and symmetries governing the nature of matter and space-time in the universe. Among the principal goals at the high-energy frontier are finding the mechanism responsible for mass in the universe, the Higgs particles associated with mass generation, and the fundamental mechanism that led to the predominance of matter over antimatter in the observable cosmos (see Figure 1).

Experimentation at increasing energy scales, along with the increasing sensitivity and complexity of measurements, have increased the scale and cost of detectors and particle accelerators, along with the size and geographic dispersion of scientific collaborations. The largest collaborations today include the Compact Muon Solenoid and A Toroidal LHC ApparatuS (ATLAS), each building experiments for the European Laboratory for Particle Physics (CERN) Large Hadron Collider (LHC) program that involve 2,000 physicists from 150 institutions in 36 countries. The current generation of experiments include BaBar at the Stanford Linear Accelerator Center (SLAC) and Dzero and the Collider Detector at Fermilab at the Fermi National Accelerator Laboratory, Batavia, IL.2

To improve the scientific community’s understanding of the fundamental constituents of matter and the nature of space-time itself, researchers must isolate and measure rare events. The Higgs particles, thought to be responsible for mass in the universe, are typically produced in only one of 1013 interactions. A new generation of particles at the upper end of the LHC’s energy reach may be produced only at the rate of a few events per year, or one event in 1015 to 1016 events.

The key to discovery is the ability to detect a signal with high efficiency and in many cases (such as the Higgs searches) also measure the energies and topology of the signal events precisely while suppressing large, potentially overwhelming, backgrounds. Physicists want to scan through the data repeatedly, devising new or improved methods of measuring and isolating the “new physics” events with increasing selectivity as they progressively learn how to suppress the backgrounds. The initial strategies for separating out the Higgs and other signals will initially rely on massive simulations performed with distributed Grid facilities. These strategies will evolve as the LHC data accumulates and a more precise picture of the backgrounds needing to be suppressed is obtained by studying the data itself.


The basic problem for e-scientists is how to discover new interactions from particle collisions, down to the level of a few interactions per year out of the 1016 produced.


HENP scientists designing these experiments expect their data volumes to increase from the multi-petabyte to the exabyte (1018B) range within the next 10 to 15 years; at the same time, the corresponding network speed requirements on each of the major links used in the field are expected to increase from 10Gbps to the Tbps range. Shaping the future cyberinfrastructure, HENP researchers are working with computer scientists to codevelop advanced networking testbeds (see the article by DeFanti et al. in this section) and Grid middleware systems (see the article by Foster and Grossman in this section).

Back to Top

Model Cyberinfrastructure

While the term e-science is relatively new, the concept of computational science was first popularized with the advent of the national supercomputer centers in 1986. Since becoming operational in 1994, NASA’s earth Observing System Data and Information System (EOSDIS) has managed data from the agency’s earth science research satellites and field measurement programs at eight Distributed Active Archive Centers around the U.S. and at six foreign sites, providing data archiving, distribution, and information management services. Today, it manages and distributes data from EOS missions (Landsat-7, QuikSCAT, Terra, and ACRIMSAT); pre-EOS missions (UARS, SeaWIFS, TOMS-EP, TOPEX/Poseidon, and TRMM); and Earth Science Enterprise legacy data. The Terra spacecraft alone produces 194GB of raw data per day, and Landsat-7 produces 150GB per day—numbers that can quadruple after the data is processed. Volumes have only increased since 2000, when EOSDIS supported approximately 104,000 users and filled 3.4 million product requests. EOSDIS makes a strong case that the needs of a globally distributed research community, dependent on scientific instruments collecting terabytes of data daily, is manageable only through a distributed cyberinfrastructure. HENP, along with other discipline sciences, is extending and modernizing this model.

The LHC experiments, for example, adopted the Data Grid Hierarchy, or a structured ensemble of computing and data-handling facilities interconnected by networks, developed at the California Institute of Technology (see Figure 2). Data generated at the experiment is filtered, processed in real time, and stored at the rate of 100–1,500Mbps during much of the year (typically 200 days), resulting in petabytes per year of stored and processed binary data accessed and processed repeatedly by worldwide collaborators. Following initial processing and storage at the Tier0 facility at CERN, data is distributed over high-speed networks to approximately 10 national Tier1 centers in the U.S., Europe, and elsewhere. The data is further processed, analyzed, and stored at approximately 50 Tier2 regional centers, each serving small- to medium-size countries or regions of larger countries, including the U.S., the U.K., and Italy. Data subsets are accessed from and further analyzed by physics groups through one of hundreds of Tier3 work-group servers and/or thousands of Tier4 desktops worldwide. Being able to use this global ensemble of systems depends on the development of Data Grids, or distributed data stores connected via high-speed networks, capable of managing and marshalling Tier-N resources and supporting collaborative software development around the world.

The anticipated data rates and network bandwidths in Figure 2 correspond to a conservative baseline formulated using an evolutionary view of network technologies. Given the price/performance of available networking today, estimates indicate that worldwide scientific network needs will reach 10Gbps within the next two to three years, followed by scheduled and dynamic use of multiple 10Gbps wavelengths by the time LHC is scheduled to begin operation in 2007.

Back to Top

Hundreds of Petabytes, then Exabytes

HENP data will increase from petabytes in 2002 to hundreds of petabytes by 2007 and exabytes (1018 bytes) by 2012 to 2015. To build a useful yet flexible distributed system, data-intensive Grid middleware and services, as well as much larger network bandwidths, are required, so typical data transactions drawing 1–10TB and eventually 100TB subsamples from multi-petabyte data stores can be completed in a few minutes.

To understand the potential of HENP applications for overwhelming future planned networks, note that the compacted stored data are pre-filtered by a factor of 106 to 107 by the “Online System” (a large cluster of up to thousands of CPUs filtering data in real time) in Figure 2. But real-time filtering risks throwing away data from subtle new interactions that do not fit preconceptions or existing and proposed theories. The basic problem for e-scientists is how to discover new interactions from the particle collisions, down to the level of a few interactions per year out of the 1016 produced. In an ideal world, every collision produced in the detector would be analyzed. However, the ability to analyze every event without prefiltering is beyond both current and foreseen states of network and computing technologies.


These new systems might also lead to new modes of interaction between people and the persistent information in their daily lives.


Completing transactions in minutes rather than hours is necessary for avoiding bottlenecks resulting in up to hundreds of requests per day or thousands of requests left pending over long periods. Transactions on this scale correspond to data throughput of 10Gbps to 1Tbps for 10-minute transactions and up to 10Tbps for one-minute transactions.

The HENP community is thus a principal driver, architect, and codeveloper of Data Grids for defining middleware tools and techniques for data-intensive manipulation and analysis.3 It is also a principal driver, architect, and codeveloper of networking infrastructure, tools, and techniques for end-to-end data transmission. Recent activities include:

  • In June 2003, a Caltech/CERN team achieved 0.94Gbps sustained throughput with a single IPv6 stream over a distance of 7,000 kilometers (Chicago to Geneva).4
  • In February 2003, an international team of physicists and computer scientists transferred 1TB of data across 10,037 kilometers in less than an hour from SLAC in Sunnyvale, CA, to CERN in Geneva, sustaining a TCP single-stream rate of 2.38Gbps. This throughput is equivalent to transferring a full CD in 2.3 seconds, 1,565 CDs per hour, 200 full-length DVD movies in an hour, or a DVD in 18 seconds.5
  • In November 2002 at the SC 2002 conference in Baltimore, Caltech used the new FAST TCP stack6 to achieve 8.6Gbps throughput over a 10,000 km path between Sunnyvale and Amsterdam, transferring 22TB of data in six hours in 10 TCP streams. (For more on FAST, see the article by Falk et al. in this section [6, 9].)

Achievable throughput will soon reach the limits of networks based on statically routed and switched paths. In the longer term, within 10 years, intelligent photonics, along with the dynamic use of wavelengths and construction and tearing down of wavelength paths through wavelength routing, represent a natural match for the peer-to-peer interactions required for data-intensive science. Integrating intelligent photonic switching with advanced protocols is an effective basis for using network infrastructures, wavelength by wavelength, and promises to put terabit networks within the reach, technically and financially, of scientists worldwide.

Back to Top

Conclusion

While HENP is a pioneer in cyberinfrastructure design, other major e-science efforts are also under way; two of them—one in biology and medical research, the other in earth science—are outlined in the sidebar “Emerging Cyberinfrastructure Communities.” The NIH-supported Biomedical Informatics Research Network (BIRN) project, which began in 2001, enables biomedical researchers and neuroscientists throughout the U.S. to collaborate in the study of brain disorders and obtain better statistics on the morphology of disease processes, ranging from multiple sclerosis to schizophrenia, by standardizing and cross-correlating data from many different imaging systems at scales from the molecular to the whole brain (see www.nbirn.net). EarthScope, funded in 2002 by NSF, enables geoscientists to better observe the structure and ongoing deformation of the North American continent by obtaining data from a network of multipurpose geophysical instruments and observatories (see www.earthscope.org).

The wealth of information promised by these pioneering efforts means new challenges in data acquisition, controlled secure sharing of access to distributed databases, distributed data processing, managed distribution, large-scale multidimensional visualization, and interdisciplinary collaboration across national and international networks on a scale unprecedented in the history of science.

Meanwhile, e-science faces unprecedented challenges in terms of: the data-intensiveness of the work (as the data being processed increases from terabytes to petabytes to exabytes); the complexity of the data (extracting detail from data sets generated by instruments); the timeliness of data transfers (whether bulk transfers for remote storage, smaller for distributing computing and analysis, or real-time for collaboration); and the global extent and complexity of the collaborations by international teams exploring and analyzing data-intensive research in fundamentally new ways.

An integrated cyberinfrastructure promises the first distributed systems environment serving virtual organizations on a global scale. The new information technologies derived from enabling e-science communities can thus affect industrial and commercial operations as well. Resilient self-aware systems, supporting large volumes of robust terabyte and larger transactions and adapting to a changing workload, can provide a strong foundation for the distributed data-intensive business processes of multinational corporations.

These new systems might also lead to new modes of interaction between people and the persistent information in their daily lives. Learning to provide, efficiently manage, and absorb this information in a persistent, collaborative environment will profoundly affect everyone in terms of commerce, communications, health care, and entertainment, not just scientists and their experiments.

Figure. Namib-Naukluft National Park, Namib Desert, Namibia. Coastal winds create the tallest sand dunes in the world here, some reaching 980 feet. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)

Figure. The Anti-Atlas Mountains, part of the Atlas Mountain Range in southern Morocco. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)

Back to Top

Back to Top

Back to Top

Back to Top

Figures

UF1 Figure. Deforestation in the Amazon Basin, Santa Cruz, Bolivia. Acquired in the false colors of the Landsat-7 satellite; healthy vegetation appears bright red. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)

UF2 Figure. Namib-Naukluft National Park, Namib Desert, Namibia. Coastal winds create the tallest sand dunes in the world here, some reaching 980 feet. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)

UF3 Figure. The Anti-Atlas Mountains, part of the Atlas Mountain Range in southern Morocco. (U.S. Geological Survey Earth Resources Observation Systems Data Center and NASA Landsat Project Science Office)

F1 Figure 1. Simulated decay of the Higgs Boson into four Muons. (bottom) The high-momentum charged particles in the Higgs event. (top) How the event would appear in the Compact Muon Solenoid detector, submerged beneath many other background interactions. (Image created by the CMS Collaboration, see cmsinfo.cern.ch/Welcome.html.)

F2 Figure 2. The large hadron collider Data Grid hierarchy.

Back to Top

UF1-4 Figure. Biomedical and earth e-science. (top) Digital montage of a slice of a rat cerebellum, composed of 43,200 separate images acquired using a Biorad RTS 2000MP two-photon system attached to a Nikon TE30 microscope equipped with an Applied Precision, Inc. automated three-axis stage. It was fluorescently stained for inositol 1,4,5-trisphosphate receptor (IP3R), a type of intracellular calcium channel highly enriched in Purkinge cells (green); glial fibrillary acidic protein found in glial cells (red); and DNA within the nuclei of the cells (blue). (Image by T. Deerinck, S. Chow, J. Bouwer, H. Hakozaki, M. Martone, S. Peltier, and M. Ellisman of the National Center for Microscopy and Imaging Research, University of California, San Diego. (bottom) Interferometric phase map draped on top of a digital topography model of a major earthquake in the Mojave Desert combining scans made by space-borne synthetic aperture radar before and after the earthquake. The white line denotes the observed surface rupture. (Image by Y. Fialko of the Cecil and Ida Green Institute of Geophysics and Planetary Physics, Scripps Institution of Oceanography, University of California, San Diego. The interferogram was processed using the Jet Propulsion Laboratory/Caltech Repeat Orbit Interferometry PACKage software. Original SAR data by the European Space Agency, distributed by Eurimage, Italy, and acquired via the WInSAR Consortium with funding from NSF, NASA, and USGS.)

    1. Atkins, D. Revolutionizing Science and Engineering Through Cyberinfrastructure. Report of the National Science Foundation Blue Ribbon Advisory Panel on Cyberinfrastructure. NSF, Arlington, VA, Jan. 2003; see www.communitytechnology.org/nsf_ci_report.

    2. Bunn, J. and Newman, H. Data-intensive grids for high-energy physics. In Grid Computing: Making the Global Infrastructure a Reality, F. Berman, G. Fox, and T. Hey, Eds. John Wiley & Sons, Inc., New York, 2003.

    3. Carlson, R. EarthScope Workshop Report: Scientific Targets for the World's Largest Observatory Pointed at the Solid Earth. Carnegie Institution, Washington, D.C. (Snowbird, UT, Oct. 10–12, 2001) (report published Mar. 2002); see www.earthscope.org/ assets/es_wksp_mar2002.pdf.

    4. Henyey, T. EarthScope Project Plan: A New View into Earth. EarthScope Working Group, Sept. 2001; see www.earthscope.org/ assets/es_proj_plan_hi.pdf.

    5. Hornberger, G. Review of EarthScope Integrated Science. Committee on the Review of EarthScope Science Objectives and Implementation Planning. National Research Council, Washington, D.C., 2001; see www.nap.edu/books/0309076447/html/.

    6. Jin, C., Wei, D., Low, S., Buhrmaster, G., Bunn, J., Choe, D., Cottrell, R., Doyle, J., Newman, H., Paganini, F., Ravot, S., and Singh, S. FAST kernel: Background theory and experimental results. Presented at the First International Workshop on Protocols for Fast Long-Distance Networks (CERN, Geneva, Switz., Feb. 3–4, 2003); see netlab. caltech.edu/pub/papers/pfldnet.pdf.

    7. Johnson, G. All science is computer science. The New York Times (Mar. 25, 2001); see www.nytimes.com/2001/03/25/ weekinreview/25JOHN.html.

    8. Lee, D., Lin, A., Hutton, T., Akiyama, T., Shinji, S., Lin, F.-P., Peltier, S., and Ellisman, M. Global telescience featuring IPv6 at iGrid 2002. J. Future Gen. Comput. Syst. 19, 6 (Aug. 2003), 1031–1040.

    9. Low, S. Duality model of TCP and queue management algorithms. IEEE/ACM Trans. on Network. 11, 4 (Aug. 2003), 525–536; see netlab.caltech.edu/pub/papers/duality.ps.

    10. Newman, H., Legrand, I., and Bunn, J. A distributed agent-based architecture for dynamic services. Presented at CHEP Computing in High-Energy Physics 2001 (Beijing, Sept. 3–7, 2001); see clegrand.home.cern.ch/clegrand/CHEP01/chep01_10-010.pdf.

    11. Newman, H. and Legrand, I. A Self-Organizing Neural Network for Job Scheduling in Distributed Systems. CMS Note 2001/009, The Compact Muon Solenoid Experiment, Jan. 8, 2001; see clegrand.home.cern.ch/clegrand/SONN/note01_009.pdf.

    12. Ravot, S. GridDT. Presented at the First International Workshop on Protocols for Fast Long-Distance Networks (CERN, Geneva, Feb. 3–4, 2003); see datatag.web.cern.ch/datatag/pfldnet2003/slides/ravot.ppt.

    The HENP research cited here is supported by U.S. DOE Office of Science's Office of High Energy and Nuclear Physics award #DE-FG03-92ER40701. PPDG is supported by the DOE Office of Science HENP and Office of Mathematical, Information, and Computational Sciences, award #DE-FC02-01ER25459. The NSF supports Strategic Technologies for the Internet (Multi-Gbps TCP) award #ANI-0230967; iVDGL subcontract #UF01087 to NSF grant # PHY-0122557; and CMS Analysis: An interactive Grid-Enabled Environment (CAIGEE) award # PHY-0218937.

    The HENP advanced network activities cited here were also made possible through the generous support of Cisco Systems, Level(3) Communications, and Intel.

    The BIRN research is supported by the National Center for Research Resources of the NIH by awards to all participating sites, as well as by NIH support for UCSD's National Biomedical Computational Resource (P41 RR08605) and National Center for Microscopy and Imaging Research (P41-RR04050). BIRN also leverages cyberinfrastructure being developed under NSF's support for the National Partnership for Advanced Computational Infrastructure (#ASC 975249).

    The U.S. Congress appropriated funding for EarthScope in 2002. NSF Program Solicitation 03-567 called for proposals to conduct research and education associated with EarthScope in accordance with the scientific targets defined by the EarthScope Planning Committee.

    1www.research-councils.ac.uk/escience/

    2CMS cmsdoc.cern.ch; ATLAS atlasexperiment.org; LHC lhc.web.cern.ch/lhc; BaBar www-public.slac.stanford.edu/babar; D0 www-d0.fnal.gov; CDF www-cdf.fnal.gov.

    3GriPhyN www.griphyn.org; PPDG www.ppdg.net; iVDGL www.ivdgl.org; EU DataGrid www.eu-datagrid.org; LHC Computing Grid lcg.web.cern.ch/LCG; UK e-Science Programme www.rcuk.ac.uk/escience; EU DataTAG datatag.web.cern.ch.

    4archives.internet2.edu/guest/archives/I2-NEWS/log200306/msg00003.html

    5datatag.web.cern.ch/datatag/speed_record.html

    6The Fast Active-Queue-Management Scalable TCP (FAST) algorithm; netlab.caltech.edu/FAST

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More