Sign In

Communications of the ACM

Contributed articles

Why On-Chip Cache Coherence Is Here to Stay

colored-blocks cityscape

Credit: Dave Bollinger

On-chip hardware coherence can scale gracefully as the number of cores increases.

The full text of this article is premium content


CACM Administrator

The following letter was published in the Letters to the Editor in the September 2012 CACM (
--CACM Administrator

To appreciate why a key assumption of "Why On-Chip Cache Coherence Is Here to Stay" by Milo M.K. Martin et al. (July 2012)that on-chip multicore architectures mandate local cachesmay be problematic, consider the following examples of a shared variable in a parallel program a processor would write into:

Example 1. No other processor seeks to read or write the variable, in which case little harm is done copying the variable to a location local to the processor (register or scratch pad), accessing it as needed, and then storing it back to a shared location or variable; and

Example 2. Other processors need to read from and/or write into that variable.

The two local-cache cases compared in the article, with or without cache coherence, require considerable traffic to ensure coherent access to the variable. However, if all write updates in a parallel program are done to a shared location using prefix-sum or other transactional-memory type instruction, traffic is proportional to the actual number of accesses to that variable by all processors. This way of performing write updates represents a significant improvement over the automatic cache coherence advocated by Martin et al. in which every access requires broader notification.

Overall, the bet Martin et al. advanced in the article on large private caches for parallel on-chip computing has yet to prove itself as a good allocation of silicon resources; for example, in a 1,000-core design, one more word in each private cache could mean 1,000 fewer words in shared cachenot necessarily a good deal in terms of ease of programming and overall performance.

My own recent research(1),(2) at the University of Maryland suggests the traditional emphasis on private caches could be the main reason programming current multicores is still too difficult for most programmers. Though the code-backward-compatibility argument is compelling for serial code, the difficulty of parallel programming in general, and for locality in particular, remains the biggest obstacle inhibiting adoption of multicores.

Uzi Vishkin
College Park, MD


(1) Vishkin, U. Computer Memory Architecture Methods for Hybrid Serial and Parallel Computers; U.S. Patents 7,707,388 and 8,145,879.

(2) Vishkin, U. Using simple abstraction to reinvent computing for parallelism. Commun. ACM 54, 1 (Jan. 2011), 7585.

Displaying 1 comment

Log in to Read the Full Article

Sign In

Sign in using your ACM Web Account username and password to access premium content if you are an ACM member, Communications subscriber or Digital Library subscriber.

Need Access?

Please select one of the options below for access to premium content and features.

Create a Web Account

If you are already an ACM member, Communications subscriber, or Digital Library subscriber, please set up a web account to access premium content on this site.

Join the ACM

Become a member to take full advantage of ACM's outstanding computing information resources, networking opportunities, and other benefits.

Subscribe to Communications of the ACM Magazine

Get full access to 50+ years of CACM content and receive the print version of the magazine monthly.

Purchase the Article

Non-members can purchase this article or a copy of the magazine in which it appears.