On-chip hardware coherence can scale gracefully as the number of cores increases.
The following letter was published in the Letters to the Editor in the September 2012 CACM (http://cacm.acm.org/magazines/2012/9/154585).
To appreciate why a key assumption of "Why On-Chip Cache Coherence Is Here to Stay" by Milo M.K. Martin et al. (July 2012)that on-chip multicore architectures mandate local cachesmay be problematic, consider the following examples of a shared variable in a parallel program a processor would write into:
Example 1. No other processor seeks to read or write the variable, in which case little harm is done copying the variable to a location local to the processor (register or scratch pad), accessing it as needed, and then storing it back to a shared location or variable; and
Example 2. Other processors need to read from and/or write into that variable.
The two local-cache cases compared in the article, with or without cache coherence, require considerable traffic to ensure coherent access to the variable. However, if all write updates in a parallel program are done to a shared location using prefix-sum or other transactional-memory type instruction, traffic is proportional to the actual number of accesses to that variable by all processors. This way of performing write updates represents a significant improvement over the automatic cache coherence advocated by Martin et al. in which every access requires broader notification.
Overall, the bet Martin et al. advanced in the article on large private caches for parallel on-chip computing has yet to prove itself as a good allocation of silicon resources; for example, in a 1,000-core design, one more word in each private cache could mean 1,000 fewer words in shared cachenot necessarily a good deal in terms of ease of programming and overall performance.
My own recent research(1),(2) at the University of Maryland suggests the traditional emphasis on private caches could be the main reason programming current multicores is still too difficult for most programmers. Though the code-backward-compatibility argument is compelling for serial code, the difficulty of parallel programming in general, and for locality in particular, remains the biggest obstacle inhibiting adoption of multicores.
College Park, MD
(1) Vishkin, U. Computer Memory Architecture Methods for Hybrid Serial and Parallel Computers; U.S. Patents 7,707,388 and 8,145,879.
(2) Vishkin, U. Using simple abstraction to reinvent computing for parallelism. Commun. ACM 54, 1 (Jan. 2011), 7585.
Displaying 1 comment