Research and Advances
Architecture and Hardware Contributed articles

The Revolution Inside the Box

How changes in computer architecture are about to impact everyone in the IT business.
Posted
  1. Introduction
  2. Part I: The End of Innocence
  3. Part II: The Architecture Research Community
  4. The Most Exciting Time
  5. References
  6. Author
  7. Footnotes
  8. Figures
  9. Sidebar: Computer Architecture in Education

Computer architecture research is undergoing a renewed vitality. No longer is the road ahead clear for microprocessors. Indeed, a decade ago the road seemed straightforward: deeper pipelines, more complex microprocessors, and little change to the core instruction set architecture. No longer. For a variety of technological reasons, manufacturers have embraced multicore CPUs for the mainstream of desktop computing. such a change represents the biggest single risk these vendors have taken in decades, as they are now expecting software developers to embrace a programming model they have been reluctant to target in the past.

In this article I look back on computer architecture research over the past 10 years, including what accounted for this change and what will happen because of it. In addition, I survey the field of computer architecture research, looking at what types of problems we once thought were important to explore and how those problems are exacerbated or mitigated in the future.

Seven years ago, when I started as a young assistant professor, my computer science colleagues felt computer architecture was a solved problem. Words like “incremental” and “narrow” were often used to describe research under way in the field. In some ways, who could blame them? To a software developer, the hardware/software interface—the very core of the computer architecture research field—had remained unchanged for most of their professional lifetimes. Even the key microarchitectural innovations (pipelining, branch prediction, caching, and others) appeared to be created long ago. From the perspective of the rest of computer science, architecture was a solved problem. This perception of computer architecture research had some very real consequences. NSF folded the computer architecture (CSA) program together with a grab bag of areas from VLSI to graphics into an omnibus “computing processes and artifacts” cluster. Large-scale DARPA programs to fund innovative architecture research in academia have recently wound down.

Around 2000, I would also characterize the collective mood of researchers in computer architecture as overly self-critical and bored of examining certain core topics in the field. The outside perspective of computer architecture had become the inside one. We would bemoan our field, nicknaming our premier technical conference as the “International Symposium on Cache Architecture,” instead of its true title “Computer Architecture.” We amusingly called our own innovations “yet another”12 take on an old problem.

Back to Top

Part I: The End of Innocence

In 2000, the roadmap ahead for desktop processing seemed clear to many. Processors with ever deeper pipelines and faster clock frequencies would scale performance into the future.1,17 Researchers, myself among them, focused on the consequences of this, such as the wire delay problem. It was hypothesized that clock frequencies would grow so fast and wires so slow, that it would take tens of cycles to send information across a large chip. The microarchitectures we build and ship today really are not equipped to work under such delay constraints.

However, faster clocks and deeper pipelines ran into more fundamental problems. More deeply pipelined processors are extremely complex to design and validate. As designers struggled to manage this complexity they also were obtaining diminishing performance returns from the approach. More pipeline stages increased the length of critical loops3 in the processor, lengthening the number of cycles on the critical path of execution. Finally, while those pipeline stages enabled processors to be clocked faster, a linear increase in clock frequency creates a cubic increase in power consumption.6 The power that a commodity desktop processor can consume and still be economically viable is capped by packaging and cooling costs, which assert a downward pressure on clock frequency.

Collectively, these effects manifested themselves as a distinct change in the growth of processor frequency in 2004 (as indicated in Figure 1). Intel, in fact, stepped back from aggressive clock scaling in the Pentium 4 and with later products such the Core 2. AMD never attempted to build processors of the same frequency as Intel, but consequently suffered in the marketing game, whereby consumers erroneously assume frequency is the only indicator of CPU performance.

Clock frequency is clearly not the same thing as performance. CPU performance must be measured by observing the execution time of real applications. Reasonable people can argue about the validity of the SPEC benchmark suite. Most would admit it under-represents memory and I/O effects. Nevertheless, when we consider much larger trends in performance over several years it is a reliable indicator of the progress computer architects and silicon technology have made.

Figure 2 depicts CPU performance from 1982 to 2007, as measured by several different generations of SPEC integer benchmarks. The world changed by June 2004. Examining this 25-year time span, and now with four years of hindsight, it’s clear we have a problem. We no longer are able to exponentially improve the rate of performance of single threaded applications.

The fact that we have been able to improve performance rates in the past has been a tremendous boon for the IT industry. Imagine if other industries, such as the auto or airline business, had at their core a driving source of exponential improvement. How would the auto or airline industries change if miles per gallon or transport speed doubled every two years? Exponential performance improvement drives down cost and increases the user experience by enabling ever richer applications. In fact, manufacturing, materials, architects, and compiler writers have been so effective at translating Moore’s Law exponential increase in chip resources23 into exponential performance improvements, that many people erroneously use the terms interchangeably. The question before us as a research field and an industry is, now that we no longer know how to translate Moore’s Law growth in available silicon area per unit dollar into exponential performance increases at relatively fixed cost, what are we going to do instead?

A Savior? Processor manufacturers have bet their future on a relatively straightforward (for them) solution. That is, if we can’t make one core execute a thread any faster, let’s just place two cores on the die and modify the software to utilize the extra core. In the next generation, place four cores. The generation after that, eight, and so on. From a manufacturing standpoint, multicore or “manycore,” as this approach is called, has several attractive qualities. First, we know how to build systems with higher peak performance. If the software can utilize them, then more cores per die will equate to improved performance. Unlike single-threaded performance, where we really have no clear ideas left to scale performance, multicore appears to offer us a path to salvation.

Second, again, if the software is there, a host of technological problems are mitigated by multicore. For example, as long as thread communication is kept to a minimum, it is more energy efficient to complete a fixed task using multiple threads, compared to executing one thread faster. Multiple smaller, simpler cores are easier to design than larger complex ones, thus mitigating the design and verification costs. Reliability, a growing problem in processor design, also becomes easier: simply place redundant cores on the die and post-fabrication route requests for defective units to one of the redundant cores, much as we do today with DRAMs. Or, even simpler, map them out entirely and sell a lower-cost part to a different market segment, as Sun Microsystems now does. Finally, wire delay—that grand challenge that motivated a flurry of research almost a decade ago—is also mitigated: simpler cores are smaller and clock frequency can be reduced as performance can be had through thread-level parallelism.

All of this sounds fantastic, except for one thing: it is predicated on the software being multithreaded. Just as important for future scalability, thread parallelism must be found in software at a rate commensurate with Moore’s Law, which means if today we must find four independent threads of computation, in two years there must be eight, and two years after that 16.


The question before us as a research field and an industry is, now that we no longer know how to translate Moore’s Law growth in available silicon area per unit dollar into exponential performance increases at relatively fixed chip cost, what are we going to do instead?


Processor manufacturers are not asking a small favor from software developers. From a programmer’s perspective, multicore CPUs currently look no different than Symmetric Multiprocessors (SMPs) that have been around for decades. Such systems are not widely deployed on the home and business desktop, for good reason. They cost more and there isn’t a significant performance advantage for the applications these users employ. So a reasonable question then is to ask: What makes us think this time it’s going to work?

An optimist will make the following arguments: First, the cost difference is now in the reverse direction. Assuming we could build a faster single-threaded core, it will cost more. Design, validation, cooling, and manufacturing will assure that fact. Second, we do know more about parallel programming now than ever before. Tools have actually improved, with methods to look for race conditions and automatically parallelize loops,10 and the resurgent interest in transactional programming will bear fruit. We’ve had many years of successful experience using parallelism in the graphics, server, and scientific computing domains. Third, and perhaps most importantly, it just has to work. For this reason software companies that need their products to achieve scalable performance must invest heavily into parallel programming. The hope is the commercial emphasis on parallel computing will create solutions.

A pessimist will counter thusly: Parallelism on the desktop has never worked because the technical requirements it takes to write threaded code just don’t align with the economic forces driving desktop software developers. Writing parallel code is more difficult than writing sequential code. It’s more error prone and difficult to debug, due to the non-determinism of thread memory interleavings. Furthermore, I have yet to meet anyone that thinks the industry will successfully parallelize its large legacy code bases. Once a large application has been designed for a single-threaded execution model, it is extremely difficult to tease it apart and parallelize it. What this means is programmers must feel the economic forces to create threaded code from day one, not as a revision of the code base. Parallelizing code must be as much of a priority as writing correct code, or achieving a certain time to market.

A more realistic view of the future is somewhere between these two extremes. Parallelizing legacy code is widely viewed as a dead-end, but building compelling add-ons to existing applications and then “bolting on” these features to legacy codes is possible. One does not need to change the entire code base of a word processor, for example, in order to bolt on a speech recognition engine that exploits multicore. Furthermore, some applications that drive sales of new machines, such as interactive video games, have ample data parallelism that is relatively easy to extract with stream-based programming.

Finally, programmers will end up writing parallel software without realizing that is what they are doing. For example, programmers who utilize SQL databases will see their application’s performance improve just by virtue of some other developer’s effort spent on parallelizing the database engine itself. Extending this idea further, building parallel frameworks that fit various application classes (business, Web services, games, and so on) will enable programmers to more easily exploit multicore processors without having to bite off the whole complexity of parallel programming.

Back to Top

Part II: The Architecture Research Community

Given this technology environment, what do computer architects currently research? To answer this question, it is best to look back over the last decade and understand what we thought were important research problems, and what happened to them.

The memory wall. A workshop, held in conjunction with the 1997 International Symposium of Computer Architecture (ISCA), focused on the memory wall and the research occurring on proposed solutions to it. The memory wall is the problem that accesses to main memory are significantly slower than computation. There are two aspects to it, a high latency to memory (hundreds of times the latency of a basic ALU operation inside a CPU) and a constrained bandwidth. Excitement at the time was over solutions that proposed placing computational logic in the DRAM.11,19,20,29,32 Such solutions never achieved broad acceptance in the marketplace because they required programmers to alter their software and they required DRAM manufacturers to restructure their business models. DRAM is a commodity, and businesses compete on cost. Adding logic to DRAM makes the devices expensive and system specific.


Parallelizing legacy code is widely viewed as a deadend, but building compelling addons to existing applications that take advantage of multicore, and then “bolting on” these features to legacy codes is possible.


While technically feasible, it is a different business that DRAM manufacturers chose not to enter. However, less radical solutions, such as prefetching, stream buffers,18 and ever larger on-chip caches,22 did take hold commercially. Moreover, programmers became more amenable to tuning their applications to the memory hierarchy architects provide them. Cache-conscious data structures and algorithms are an effective, yet burdensome, way to achieve performance.

The memory wall is still with us. Accessing DRAM continues to require hundreds of more cycles than performing a basic ALU operation. While the drop in the growth of processor clock speed means that memory latency is less of a growing concern, the switch to multicore actually presents new challenges with bandwidth and consistency. Having all these CPU cores on a single die means they will need a Moore’s Law growth in bandwidth to memory in order to operate efficiently. At the moment, we are not pin limited in providing this bandwidth, but we quickly will be; so we can expect a host of future research that looks at the memory wall again, but this time from a bandwidth, instead of a latency perspective.

Along with memory performance is the evolution of the memory model. In the past, it was thought providing a sequentially consistent system with reasonable performance was not possible. Hence, we devised a range of relaxed consistency approaches.30 It is natural for programmers to assume multicore systems are sequentially consistent, however, and recent work8 suggests that architectures can use speculation16 to provide it. Looking forward, as much as can be done, must be done, to make programming parallel systems as easy as possible. This author believes this will push hardware vendors toward providing sequentially consistent systems.

Power. ISCA 98 brought a whole new vocabulary to architects. Terms such as power, energy, energy-delay product, bips-per-watt, and so on, would henceforth be part of the research parlance, with real debate about what quantity was most important to optimize. The slide that defined the power problem34 depicted process generation on the x-axis, power per unit area on a log-scale on the y-axis, and various points depicting Intel processors, a hot plate, a nuclear reactor, a rocket nozzle, and the surface of the sun. The message from the slide was clear: change something, or processors would quickly have to be cooled by some technology capable of cooling the surface of the sun, clearly a ridiculous design point. The research on power was vast, starting with techniques for measuring power5,42 in microarchitectures. Since then, authors either included a power analysis of their work in their papers, or, reviewers would ask for it!

In reality, at the microarchitectural level, dynamic voltage/frequency scaling27 (DVFS)—a circuit technique that reduces operating frequency and supply voltage when processors are idle, or need only operate at reduced performance—and clock-gating are highly effective. Several ideas for reducing power beyond DVFS and clock-gating have been proposed, and do work, but the most bang for the buck comes from doing these two techniques well. Looking forward, power will continue to cap performance and drive design considerations. The macroscopic environment in which we consider power issues has changed slightly from 1998, however. The massive power consumption occurring in data-centers makes companies that operate them power-hungry, shopping for the best physical and regulatory environment in which to obtain cheap energy. These companies will benefit from multicore devices, as their software is task-parallel, and using multiple simple cores is a more energy efficient means to compute than with single complex CPUs. The usefulness of portable devices is also effectively constrained by power, as improvements in battery technology continue to lag in the single digits. Thus, architects will continue to consider power in their ideas, as it continues to be an important design consideration.

Design Complexity. In the mid-1990s, a community of architects began to focus on the complexity of modern CPU designs.31 Processors today contain approximately 1,000 times more core (non-cache) transistors than 30 years ago. It is just not possible to have a bug-free design for such a complex device with the engineering methodologies that we currently employ. Such complex designs are difficult to innovate, as design changes cannot be reasoned about locally. Moreover, large monolithic designs often have long wires, which consume power and constrain the clock cycle. Several projects were spurred by these motivations to propose fairly radical changes to the processing model.26,36,37 A wealth of less radical, more localized solutions were developed, among them ways to reduce instruction scheduling logic,14 reduce the complexity of out of order structures,9,38 and provide a perceived increase in processor issue width by coarser management of fetched instructions.4

Design complexity is still an issue today, but the switch to multicore has effectively halted the growth in core complexity. With processor vendors banking their future on multicore, they are expecting performance to come from additional cores, not more complex ones. Moreover, there are strong arguments to be made, that if the thread parallelism is available in the applications, then vendors will switch to more energy-efficient, simpler cores. In effect, the trend in core complexity could actually reverse. How far this trend will go no one knows, but if software does indeed catch up and become thread-parallel, then we could see heterogeneous multicore devices, with one or a handful of complex cores and a sea of simple, reduced-ISA ones43 that will provide the most performance per dollar and per watt.

Reliability. In 2001, concerns about both hard and soft faults began to appear at ISCA. Yet another new vocabulary term appeared in our literature: the high-energy particle. As silicon feature sizes shrink, the quantity of charge held on any particular wire in a microprocessor also is reduced. Normally this has a positive benefit (lower power, faster), but it also means that the charge on that line can be on the order of that induced by an alpha particle striking the silicon lattice. This is not a new problem, as “hardened” microprocessors have been built for decades for the space industry as electronics in space must operate without the natural high-energy particle absorption effect of the atmosphere. Now our earthbound devices must also deal with this problem. Architects proposed a cornucopia of techniques to deal with faults, from the radical, which proposed alternative processor designs2 and ways to use simultaneous multithreaded devices,24,39 to the more easily adoptable by industry, such as cache designs with better fault resilience. Important work also better characterized what parts of the microarchitecture are actually susceptible to a dynamic fault.

Reliability continues to play an important part of architecture research, but the future presents some differing technology trends. It is this author’s opinion that Moore’s Law will not stop anytime soon, but it won’t be because we shrink feature sizes down to a handful of atoms in width.44 Rather, diestacking will continue to provide ever more chip real estate. These dies will have a fixed (or even larger) feature size, and thus the growth in dynamic faults due to reduced feature sizes should actually stop. Moreover, if multicore does actually prove to be a market success, then reliability can be achieved without enormous complexity: processors with manufactured faults can be mapped out, and for applications that require high reliability, multiple cores can be used to redundantly perform the computation. Nevertheless, despite this positive long-term outlook, work to improve reliability will always have purpose, as improved reliability leads directly to improved yields (and in the future, improved performance if redundant cores are not required), and thus reduced costs.

Evaluation techniques. How architects do research has changed dramatically over the decades. When ISCA first started in the 1970s, papers typically provided paper designs and qualitative or simple analytical arguments to the idea’s effectiveness. Research techniques changed significantly in the early 1980s with the ability to simulate new architecture proposals, and thus provide quantitative evidence to back up intuition. Simulation and quantitative approaches have their place, but misused, they can provide an easy way to produce a lot of meaningless, but convincing looking data. Sadly, it is now commonly accepted in our community that the absolute value of any data presented in a paper isn’t meaningful. We take solace in the fact the trendsthe relative difference between two data points—likely have a corresponding difference in the real world.

As an engineer, this approach to our field is sketchy, but workable; as a scientist, this seems like a terrible place to be. It is nearly impossible to do something as simple as reproduce the results in a paper. Doing so from the paper alone requires starting from the same simulation infrastructure as the authors, implementing the idea as the authors did, and then executing the same benchmarks, compiled with the same compiler with the same settings, as the authors. Starting from scratch on this isn’t tractable, and the only real way to reproduce a paper’s results is to ask the authors to share their infrastructure. Another, more insidious problem with simulation is its too easy to make mistakes when implementing a component model. Because it is common, and even desirable, to separate functional ISA modeling from performance modeling, these performance model errors can go unnoticed, thus leading to entirely incorrect data and conclusions. Despite these drawbacks, quantitative data is seductive to reviewers, and simulation is the most labor-efficient way to produce it.

Looking forward, the picture is muddled. Simulation will continue to be the most important tool in the computer architect’s toolbox. The need to model ever more parallel architectures, however, will create the need to continue to explore different modeling techniques because, for the moment, the tools used in computer architecture research are built on single-threaded code bases. Thus, simulating an exponentially increasing number of CPU cores means an exponential increase in simulation time. Fortunately, several paths forward exist. Work on high level performance models13,28 provides accurate relative performance data quickly, suitable for coarsely mapping a design space. Techniques to sample35,41 simulation data enable architects to explore longer-running simulations with reasonable confidence. Finally, renewed interest in prototyping and using FPGAs for simulation40 will allow architects to explore ideas that require cooperation with language and application researchers as the speed of FPGA-based simulation is just fast enough to be usable by software developers.

There are several advantages to pre-built and shared tools for architecture research. They are enablers, allowing research groups to not start from scratch. Having shared tools has another benefit: the bugs and inaccuracies in those tools can be revealed and fixed over time. Shared tools also enable re-creating other people’s work easier. There has been, and there will continue to be, a downside to the availability of pre-built tools for architecture research, however. Just as SimpleScalar7 created a flood of research on super-scalar microarchitecture, the availability of pre-canned tools and benchmarks for CMPs will create a flood of research that is one delta away from existing CMP designs. But is this the type of research academics should be conducting? As academics, shouldn’t we be looking much farther downfield, to the places where industry is not yet willing to go? This is an age-old quandary in our community and will likely continue to be so. Such a debate will certainly continue to exist in our research community for the foreseeable future.

In the computer architecture field there is a cynical saying that goes something like “we design tomorrow’s systems with yesterday’s benchmarks.” This author finds this statement extreme, but there is some underlying merit to it. For example, there are far more managed-code and scripting language developers out there than C/C++ ones. Yet the majority of benchmarks used in our field are written in C. Fortunately, this is changing, with newer benchmarks such as SPECjvm and SPECjbb. Moreover, a few researchers are starting to focus on performance issues of managed and scripted code. Looking forward, there is a very real need for realistic multithreaded benchmarks. Recent work, suggests a kernel-driven approach is sufficient.33 As with the whole of architecture evaluation techniques, the jury is still out on what is the proper methodology.

Instruction-Level Parallelism. Finally, a large number of architects, myself among them, are still putting enormous effort into finding additional instruction-level parallelism (ILP). Some of these architects don’t have complete faith that multicore will be a success. Others recognize that improvements in single-threaded performance benefit multicore as well, as parts of applications will be sequential or require a few threads to execute quickly. Over the years, these researchers have sought to find ILP in every nook and cranny of the research space. Everything from new instruction set architectures, new execution models, to better branch predictors, caches, register management, instruction scheduling, and so on. The list of areas explored is endless.


People far older and wiser than me contend this is the most exciting time for architecture since the invention of the computer. What makes it exciting is that architecture is in the unique position of being at the center of the future of computer science and the IT industry.


Alongside the development of x86 microprocessors in the 1990s and 2000s, Intel and HP sunk enormous effort and dollars into developing another line of processors, Itanium,22 that gain performance from ILP. Itanium is a Very Long Instruction Word (VLIW) processor, in the mold of far earlier work on the subject.15 Such processors promise performance from ILP at reduced complexity compared to superscalar designs, by relying on sophisticated compilation technology. VLIW is a fine idea; it communicates more semantic knowledge about finegrained parallelism from the software to the hardware. If technically such an approach is useful, why don’t you have an Itanium processor on your desktop? In a nutshell, such processors never achieved a price point that fit well in the commodity PC market. Moreover, in order to maintain binary compatibility with x86, sophisticated binary translation mechanisms had to be employed. After such translation, existing code saw little to no performance benefit from executing on Itanium. Consumers were loath to spend more on a system that was no faster, if not slower, than the cheaper alternative, for the promise that someday faster native-code applications would arrive. There is a lesson here for multicore systems as well: without tangible benefits, consumers will not spend money on new hardware just for its technical superiority.

Does ILP still matter? This author would argue it does. As mentioned earlier, even parallel programs have sequential parts. Legacy code still matters to the IT industry. Is there more ILP to be had? This is a more difficult question to answer. The seminal work in this area21 suggests there is. Extracting it from applications, however, is no trivial matter. The low-hanging fruit was gone before I even entered the field! Aggressive speculation to address the memory wall, the inherent difficultly in predicting certain branches, and the false control and memory dependencies introduced by the imperative language programming model is required. This must be carried out by architectures that are simple to design and validate, lack monolithic control structures, and that are backward compatible, if not with the binaries, with the programming model.

What are we doing now? A look at the ISCA 2007 conference program provides a good overview of the type of research being done in our community. A survey of papers published in that year reveals the following: 18 papers focused on multicore (eight core and memory design, six transactional programming, four on-chip interconnect); six papers were focused on single-core devices and/or applications, six papers were focused on special-purpose or streaming/media devices, four papers were focused on power reduction and three were in the general area of “beyond CMOS” Figure 3 extends this data out for the last seven years of ISCA. This data extends the work of Hill,45 who tracked papers published in ISCA by category from 1973–2001. That data showed a precipitous rise and fall of interest in multiprocessor research, while data from the last seven years depicts a renewed and vigorous multiprocessor research environment.

Back to Top

The Most Exciting Time

In my lifetime, this is the most exciting time for computer architecture research; indeed, people far older and wiser than me46 contend this is the most exciting time for architecture since the invention of the computer. What makes it exciting is that architecture is in the unique position of being at the center of the future of computer science and the IT industry. Innovations in architecture will impact everything from education to determining who are the new winners and losers in the IT business. Central to this excitement for me as an academic, is there is no real clear way to proceed. Multicore devices are being sold, and parts of the software ecosystem will utilize them, but the research and product space is far more fluid and open to new ideas now than ever before. Thus, while we are central to the future directions of computer science, we really lack a clear vision for how to proceed. What could be better than that?

Back to Top

Back to Top

Back to Top

Back to Top

Figures

F1 Figure 1. CPU clock speed.

F2 Figure 2. SPEC integer CPU performance over a 25-year time span.

F3 Figure 3. Papers published in ISCA 2001–2006.

Back to Top

    1. Agarwal, V., Hrishikesh, M.S., Keckler, S.W., and Burger, D. Clock rate versus IPC: The end of the road for conventional microarchitectures. SIGARCH Comput. Archit. News 28, 2, (2000), 248–259.

    2. Austin, T.M. Diva: A reliable substrate for deep submicron microarchitecture design. Micro. 00 196, 1999.

    3. Borch, E. Tune, E., Manne, S., and Emer, J. Loose loops sink chips. In Proceedings of the Eighth International Symposium on High-Performance Computer Architecture. Feb. 2–6, 2002, 299–310.

    4. Bracy, A., Prahlad, P., and Roth, A. Dataflow mini-graphs: Amplifying superscalar capacity and bandwidth. In Proceedings of the 37th Annual IEEE/ ACM International Symposium on Microarchitecture. IEEE Computer Society, Washington, D.C., 2004, 18–29.

    5. Brooks, D., Tiwari, V., and Martonosi, M. Wattch: A framework for architectural-level power analysis and optimizations. SIGARCH Comput. Archit. News 28, 2, (2000), 83–94.

    6. Brooks, D.M., Bose, P., Schuster, S.E. Jacobson, H., Kudva, P.N. Buyuktosunoglu, A., Wellman, J-D., Zyuban, V., Gupta, M., and Cook, P.W. Power-aware microarchitecture: Design and modeling challenges for next-generation microprocessors. IEEE Micro 20, 6 (2000), 26–44.

    7. Burger, D., and Austin, T.M. The simplescalar tool set, version 2.0. SIGARCH Comput. Archit. News 25, 3 (1997), 13–25.

    8. Ceze, L., Tuck, J., Montesinos, P., and Torrellas, J. Bulksc: Bulk enforcement of sequential consistency. SIGARCH Comput. Archit. News 35, 2 (2007), 278–289.

    9. Cristal, A., Ortega, D., Llosa, J., and Valero, M. Out-of-order commit processors. hpca, 00:48, 2004.

    10. Dagum, R., Menon, L. Openmp: An industry standard api for shared-memory programming. Computational Science and Engineering 5, 11, (Jan–Mar 1998) 46–55.

    11. Draper, J., Chame, J., Hall, M., Steele, C., Barrett, T., LaCoss, J., Granacki, J., Shin, J., Chen, C., Kang, C.W., Kim, I., and Daglikoca, G. The architecture of the diva processing-in-memory chip. In Proceedings of the 16th International Conference on Supercomputing., ACM, NY, 2002, 14–25.

    12. Eden, T., Mudge, A.N., The YAGS branch prediction scheme. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture (Nov. 30–Dec. 2, 1998), 69–77.

    13. Eeckhout, L., Stougie, B., Bosschere, K.D., and John, L.K. Control flow modeling in statistical simulation for accurate and efficient processor design studies. SIGARCH Comput. Archit. News 32, 2 (2004), 350.

    14. Ernst, D., Hamel, A., and Austin, T. Cyclone: A broadcast-free dynamic instruction scheduler with selective replay. In Proceedings of the 30th Annual International Symposium on Computer Architecture (June 9–11, 2003), 253–262.

    15. Fisher, J.A. Very long instruction word architectures and the eli-512. In Proceedings of the 10th Annual International Symposium on Computer Architecture (Los Alamitos, CA, 1983). IEEE Computer Society Press, 140–150.

    16. Hill, M.D. Multiprocessors should support simple memory-consistency models. IEEE Computer 31, 8 (1998), 28–34.

    17. Hinton, G., Upton, M., Sager, D., Boggs, D., Carmean, D., Roussel, P., Chappell, T., Fletcher, T., Milshtein, M., Sprague, M., Samaan, S., and Murray., R. A 0.18-m CMOS ia-32 processor with a 4-ghz integer execution unit. IEEE Journal of Solid-State Circuits, 36, 11 (Nov. 2001),1617–1627.

    18. Jouppi, N.P. Improving direct-mapped cache performance by the addition of a small fully associative cache and prefetch buffers. SIGARCH Comput. Archit. News, 18, 3a (1990), 364–373.

    19. Kang, Y., Huang, W., Yoo, S.-M., Keen, D., Ge, Z., Lam, V., Torrellas, J., and Pattnaik, P. Flexram: Toward an advanced intelligent memory system. ICCD 00:192, 1999.

    20. Kogge, P., Sunaga, T., Miyataka, H., Kitamura, K., and Retter, E. Combined DRAM and logic chip for massively parallel systems. arvlsi 0:4, 1995.

    21. Lam, M.S., and Wilson, R.P. Limits of control flow on parallelism. In Proceedings of the 19th Annual International Symposium on Computer Architecture. ACM, NY, 1992, 46–57.

    22. McNairy, D., Soltis, C. Itanium 2 processor microarchitecture. IEEE Micro 23, 2 (Mar.–Apr. 2003), 44–55,

    23. Moore, G. Cramming more components onto integrated circuits. Electronics (Apr. 1965), 114–117.

    24. Mukherjee, S., Kontz, M., and Reinhardt, S., Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture (2002), 99–110.

    25. Mukherjee, S., Weaver, C., Emer, J., Reinhardt, S., and Austin, T., A systematic methodology to compute the architectural vulnerability factors for a highperformance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (Dec. 3–5, 2003), 29–40.

    26. Nagarajan, R., Sankaralingam, K., Burger, D., and Keckler, S.W. A design space evaluation of grid processor architectures. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, Washington, D.C. 2001, 40–51.

    27. Nielsen, L.S., and Niessen, C. Low-power operation using self-timed circuits and adaptive scaling of the supply voltage. IEEE Trans. Very Large Scale Integr. Syst., 2, 4 (1994), 391–397.

    28. Oskin, M., Chong, F.T., and Farrens, M. HlS: Combining statistical and symbolic simulation to guide microprocessor designs. SIGARCH Comput. Archit. News 28, 2 (2002), 71–82.

    29. Oskin, M., Chong, F.T., and Sherwood, T. Active pages: A computation model for intelligent memory. SIGARCH Comput. Archit. News 26, 3 (1998), 192–203.

    30. Pai, V.S., Ranganathan, P., Adve, S.V., and Harton, T. An evaluation of memory consistency models for shared-memory systems with ilp processors. SIGPLAN Notices 31, 9 (1996), 12–23.

    31. Palacharla, S. Complexity-effective superscalar processors. Ph.D. thesis, 1998.

    32. Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C., Thomas, R., and Yelick, K. A case for intelligent RAM. IEEE Micro 17, 2 (Mar.–Apr. 1997), 34–44.

    33. Patterson, D., Keutzer, K., Asanovic, K., Yelick, K., and Bodik, R. The landscape of parallel computing research: A view from Berkeley. 2007.

    34. Pollack, F., Keynote: New microarchitecure challenges in the coming generations of CMOS process technologies, 1999.

    35. Sherwood, T., Perelman, E., Hamerly, G., and Calder, B. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, NY, 2002, 45–57.

    36. Swanson, S., Michelson, K., Schwerin, A., and Oskin, M. Wavescalar. In Proceedings of the 36th Annual IEEE/ ACM International Symposium on Microarchitecture. IEEE Computer Society, Washington, D.C., 291.

    37. Taylor, M.B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F., Greenwald, B., Hoffman, F., Johnson, P., Lee, J.-W., Lee, W., Ma, A., Saraf, A., Seneski, M., Shnidman, N., Strumpen, V., Frank, M., Amarasinghe, S., and Agarwal, A. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro 22, 2 (2002), 25–35.

    38. Valero, M., Gonzalez, A., Topham, N.P., and Cruz, C. Multiple-banked register file architectures. isca 00:316, 2000.

    39. Vijaykumar, T., Pomeranz, I., and Cheng, K., Transientfault recovery using simultaneous multithreading. In Proceedings of the 29th Annual International Symposium on Computer Architecture (2002), 87–98.

    40. Wawrzynek, J., Patterson, D., Oskin, M., Lu, S.-L., Kozyrakis, C., Hoe, J. C., Chiou, D., and Asanovi, K. RAMP: Research accelerator for multiple processors. IEEE Micro 27, 2 (2007), 46–57.

    41. Wunderlich, R., Wenisch, T., Falsafl, B., and Hoe. J. Smarts: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture (June 9–11, 2003), 84–95.

    42. Ye, W., Vijaykrishnan, N., Kandemir, M., and Irwin, M.J. The design and use of simplepower: A cycle-accurate energy estimation tool. In Proceedings of the 37th Conference on Design Automation. ACM, NY (2000), 340–345

    43. http://www.news.com/2100-1006 3-6119618.html.

    44. http://www.itrs.net/.

    45. http://pages.cs.wisc.edu/markhill/mp2001.html.

    46. Personal communication with Burton Smith. Mark Oskin (oskin@cs.washington.edu) is an associate professor in the Department of Computer Science and Engineering at University of Washington, Seattle.

    DOI: http://doi.acm.org/10.1145/1364782.1364799

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More