The ultimate computers in our long-term future will deliver exaflops-scale performance (or greater) and will look very different from today’s microprocessors and massively parallel computers. Ironically, however, their alien structures and operational behavior can be inferred from the same technology trends driving development of today’s conventional computing systems.
A vision of future computer architectures that are direct extrapolations of current trends is easily inspired by the explosive growth of today’s computer performance, price-performance, and applications (driven by Moore’s Law for device technology), as well as the more dramatic paradigm shifts brought on by the Internet, the Web, and grids. Yet an examination of these trends also reveals the possibility of something quite different in how we’ll organize, design, and fabricate our largest computers in the future. They even set the stage for a revolution in computer architecture that may displace the venerable and highly successful “von Neumann model” and its predominance over the past 50 years.
One class of innovative computing system being explored today by computer scientists at the California Institute of Technology’s Center for Advanced Computing Research is the continuum computer architecture (CCA), an ultra-fine-grain uniform structure that approximates a continuous 3D execution medium enabled through next-generation submicron logic devices. Future computers—whether major exaflops engines used to design and simulate everything from controlled fusion reactors to rapid-response medicines, to compact low-power robot brains for autonomous control of spacecraft, airplanes, automobiles, and homes, to embedded smartware in our clothes and bodies—may look less like today’s microprocessors and much more like CCAs.
Several concurrent trends in semiconductor and other technologies will force a rethinking of the physical structure and logical operation of parallel computer systems. Lithographic feature size will be driven below 0.05 microns by 2010, increasing the number of devices per unit area by at least an order of magnitude by today’s standards. Combined with increases in chip area, perhaps even toward wafer-scale integration, total chip capacity will increase from a factor of 100 to as much as 1,000. On-chip (local) clock speeds are likely to increase by a factor of 10, although chipwide and off-chip clock rate gains will be more modest at 3X.
Another advance is the merger of DRAM cell technology with CMOS logic devices to permit memory and processing on the same die, enabling new classes of structure impossible only a few years ago; IBM and Micron Technology fabrication facilities have already managed such a merger. The cost is still somewhat high, so its application is limited today; and the logic clock rate for this combined technology is not quite as high as the fastest pure CMOS logic available. Still, it is both functional and practical. Even external pin counts and speeds will increase steadily with thousands of pins possible per chip operating at a few GHz. Direct on-chip optical fiber links with time-division and wave-division multiplexing may ultimately deliver Tbps data rates per channel. Even more exotic technologies are possible as well.
The implications of these trends in combination with the limitations imposed by speed-of-light and design complexity will catalyze a renaissance in novel hardware design emphasizing locality of action, massive parallelism, replication of structure, nearest-neighbor communications, and decentralized control.
CCA responds to and exploits them while circumventing many of the constraints on scalability implicit in conventional design practices. Ultimately, it will merge the historically separate functions of logic, communications, and storage into a single composite fine-grain physical element (called a “fonton”) from which a global uniform 3D computing medium (called a “simultac”) may be realized through replication. A fonton holds a few words of data, each tagged with its own virtual name, and associates some logic directly to them; the result is that basic operations may be performed on any data element directly anywhere in the system at any time. A “word” is a collection of bytes making up a single instance of some data type that can be manipulated atomically. In the most general sense, word length (measured in bytes) varies depending on the type of data. A character may be 1B long, a floating point may be 8B long, and a record with a number of fields may be much larger.
CCA will merge the historically separate functions of logic, communications, and storage into a single composite fine-grain physical element from which a global uniform 3D computing medium may be realized through replication.
A fonton supports nearest-neighbor communications and can function in the role of a routing node for migrating information packets (called “burtons”) in a 3D mesh interconnect topology. At one instance, a simultac looks like a giant distributed memory; at another, it appears to be a giant 3D mesh network; then again, it can be a massive honeycomb of simple logic units connected in pipelines to their nearest neighbors with all three modes operating concurrently. Conceptually, this structure can be extended in all three dimensions indefinitely, limited only by technology, power and cooling, and ultimately cost.
What most distinguishes the operation of this new class of system from conventional computers is the way it handles data and controls execution. Unlike today’s massively parallel processing systems, it needs no fixed location for data and no program counters controlling instruction sequencing. Instead, large data structures are distributed throughout the 3D storage environment; they migrate incrementally over time through the simultac by diffusion, as the logical structure changes and new space is required or locality requirements change. Streams of instructions—themselves a special kind of data structure—either move across the data, producing intermediate results, or sit distributed in a row of fontons through which data streams pass, also causing intermediate results to be created. Data pointers and program branches work together as a single unified semantic element to direct execution to data.
Operations occur as quickly as they possibly can, limited only by latency, without resource constraints. They provide the potential for extremely high throughput with a simple execution model that exploits all levels of program and data parallelism.
CCA-type structures, which may become commonplace by 2020, have other advantages too. Because data and program elements are dynamically and adaptively allocated to fontons, broken fontons can be isolated on the fly while the rest of the system performs normally, thus providing graceful degradation in the presence of either transient or permanent faults. The side surfaces of these enormous cubes provide exceptionally high I/O bandwidth for such applications as real-time image processing and rapid data set swapping from external mass storage. The high availability of execution logic supports real-time functionality as well.
More important than the specific details of CCA is that it provides but one example of the future possibilities in achieving extraordinary levels of information processing through innovative structures enabled by major advances in device technology. Such advances include nanotechnology, quantum dots, and superconductor logic, none of which of direct practical value today, but could (within 20 years) provide the technologies needed to build future exaflops CCA computers.
Figure. Glimpses of our future. Image from the simulation of a likely scenario for the collision of the Milky Way and Andromeda galaxies in about three billion years. John Dubinski, University of Toronto, and Lars E. Hernquist, Harvard-Smithsonian Center for Astrophysics. Courtesy NASA and The Hubble Heritage Team (The Space Telescope Science Institute operated by the Association of Universities for Research in Astronomy, Inc. for NASA, under contract with the Goddard Spcace Flight Center, Greenbelt, MD).