Research and Advances
Architecture and Hardware Research highlights

Technical Perspective: Rethinking Caches For Throughput Processors

Posted
  1. Article
  2. Author
  3. Footnotes
Read the related Research Paper

Caches have been a mainstay of computer design for nearly 50 years and a continued subject of academic and industry research. Caches are intended to capture locality in the instructions or data of a program, enabling a computer system to provide the illusion of a large fast memory, when in fact the underlying hardware only provides physical structures that are small and fast (caches) or large and slow (main memory). CPU systems today have as many as four levels of cache spanning on-chip SRAM to off-chip embedded DRAM.

CPU systems have typically used caches as a means to hide memory latency. Without caching, a single-threaded program would spend the vast majority of its time stalled waiting for data to return from off-chip DRAM memory. However, throughput-oriented computing systems, such as vector processors and GPUs, are able to employ parallelism to tolerate memory latency, reducing the need for the latency reduction effects of a cache. GPUs in particular use massive multithreading to tolerate the latency; when one thread executes a load instruction that accesses main memory, other threads can execute, keeping the processor busy. Instead of being sensitive to memory latency, throughput-oriented systems tend to be sensitive to memory bandwidth. As a result, their memory hierarchies have traditionally been designed to employ caches to reduce DRAM bandwidth demand rather than to reduce latency.

These different objectives for the cache have led to different trade-offs in modern GPU and CPU systems. For example, each of the 15 streaming multiprocessors on a contemporary NVIDIA Kepler GPU has up to 2,000 threads sharing a 256KB register file and a total of 96KB of level-1 data cache. The 15 streaming multiprocessors are coupled to 1.5MB of on-chip level-2 cache. Thus, with the maximum number of threads executing, each thread has private access to 128 bytes of register file and a per-thread share of 48 bytes of level-1 cache and 50 bytes of level-2 cache. By contrast, the 12-core IBM Power8 processor provides about three orders of magnitude more cache capacity with 8KB and 64KB of L1 and L2 cache (respectively) per thread. While GPU caches focus on capturing shared data among many threads, CPU caches seek to capture temporal locality within a single thread.

However, as GPUs have become mainstream parallel processing engines, many applications targeting GPUs now have data locality more amenable to traditional caching. The challenge then is to design caching systems that exploit both spatial and temporal locality, without scaling the cache capacity by the orders of magnitude per thread required to match CPU architectures.

Historically, data locality optimizations have focused solely on the cache, exploring allocation policies, eviction policies, and basic cache organization. The work by Rogers et al. in the following paper takes a different approach for improving cache behavior and computer system performance by focusing on thread scheduling for multithreaded processors. The basic idea is pretty simple: when a thread has data locality, schedule it more often to increase the likelihood its data is kept in the cache. This approach has the effect of reducing the number of threads executing in a window of time, which (perhaps somewhat counterintuitively) increases throughput and overall performance. Thus, under some circumstances "less is more."


The work in the following paper takes a different approach for improving cache behavior and computer system performance by focusing on thread scheduling for multithreaded processors.


The architecture described in the paper has a number of virtues. First, it requires no input from the programmer or hints as to how many threads to keep in the working set. Second, it automatically adapts to the locality behavior of the program, ultimately reverting back to the baseline "maximum threads" approach when thread-level data locality is insufficient. In addition to the performance benefits shown in the study, this approach to improving locality also reduces bandwidth requirements and power consumed transferring data across the chip boundary.

This paper illustrates two aspects of contemporary computer architecture research. The first is a rethinking of conventional wisdom required when a new architecture model, technology, or application domain emerges. Second, this paper illustrates the opportunities available when co-optimizing across traditional computer architecture boundaries. Thread scheduling in particular has the potential to improve not just the L1 cache as described in this paper, but also secondary cache, DRAM, and interconnect architectures.

Back to Top

Back to Top

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More