The gap between processor and memory performance has become a focal point for microprocessor research and development over the past three decades. Modern architectures use two orthogonal approaches to help alleviate this issue: (1) Almost every microprocessor includes some form of on-chip storage, usually in the form of caches, to decrease memory latency and make more effective use of limited memory bandwidth. (2) Massively multithreaded architectures, such as graphics processing units (GPUs), attempt to hide the high latency to memory by rapidly switching between many threads directly in hardware. This paper explores the intersection of these two techniques. We study the effect of accelerating highly parallel workloads with significant locality on a massively multithreaded GPU. We observe that the memory access stream seen by on-chip caches is the direct result of decisions made by the hardware thread scheduler. Our work proposes a hardware scheduling technique that reacts to feedback from the memory system to create a more cache-friendly access stream. We evaluate our technique using simulations and show a significant performance improvement over previously proposed scheduling mechanisms. We demonstrate the effectiveness of scheduling as a cache management technique by comparing cache hit rate using our scheduler and an LRU replacement policy against other scheduling techniques using an optimal cache replacement policy.
Have you ever tried to do so many things that you could not get anything done? There is a classic psychological principal, known as unitary-resource theory, which states that the amount of attention humans can devote to concurrently performed tasks is limited in capacity.16 Attempting to divide this attention among too many tasks at once can have a detrimental impact on your performance. We explore a similar issue in the context of highly multithreaded hardware and use human multitasking as an analogy to help explain our findings. Our paper studies architectures which are built to efficiently switch between tens of tasks on a cycle-by-cycle basis. Our paper asks the question: Even though massively multithreaded processors can switch between these tasks quickly, when should they?
No entries found