Caches have been a mainstay of computer design for nearly 50 years and a continued subject of academic and industry research. Caches are intended to capture locality in the instructions or data of a program, enabling a computer system to provide the illusion of a large fast memory, when in fact the underlying hardware only provides physical structures that are small and fast (caches) or large and slow (main memory). CPU systems today have as many as four levels of cache spanning on-chip SRAM to off-chip embedded DRAM.
CPU systems have typically used caches as a means to hide memory latency. Without caching, a single-threaded program would spend the vast majority of its time stalled waiting for data to return from off-chip DRAM memory. However, throughput-oriented computing systems, such as vector processors and GPUs, are able to employ parallelism to tolerate memory latency, reducing the need for the latency reduction effects of a cache. GPUs in particular use massive multithreading to tolerate the latency; when one thread executes a load instruction that accesses main memory, other threads can execute, keeping the processor busy. Instead of being sensitive to memory latency, throughput-oriented systems tend to be sensitive to memory bandwidth. As a result, their memory hierarchies have traditionally been designed to employ caches to reduce DRAM bandwidth demand rather than to reduce latency.
No entries found