When programmers want to improve their program’s performance, they often turn to profilers to tell them what code to optimize. A profiler can measure many aspects of a program’s behavior, such as how many times each method is typically called, how long each method typically takes, and which methods typically lie on the critical path. Unfortunately, this information is not always relevant, and it can often cause programmers to waste their time on optimizations that have little impact on overall performance.
In particular, conventional profilers struggle to help developers optimize multithreaded programs. A simple but useful example is a program in which an initial thread waits for several worker threads to complete. If each worker runs for approximately the same amount of time, then most profilers will report that each worker accounted for an equal share of the execution time. At the same time, optimizing any individual worker will have a minimal impact on overall execution time since the program will only finish after all workers are done. A programmer could waste countless hours optimizing code without making the program run any faster.
In the following paper, Curtsinger and Berger describe a better approach called causal profiling; causal profilers tell programmers exactly how much speed-up bang to expect for their optimization buck. That is, a causal profiler can predict with spooky accuracy that speeding up a line of code by x% will improve overall responsiveness or throughput by y%. At first blush this seems like magic. No tool can make an arbitrary line of code arbitrarily faster and then measure the sped-up line’s impact on overall execution time. And yet Curtsinger and Berger developed a tool called Coz that would seem to do the impossible.
The key observations underlying Coz are that arbitrarily accelerating code is infeasible, but arbitrarily slowing code is quite easy, and all a tool has to do is adjust the relative speeds of code fragments in order to predict how a change in one fragment will impact overall performance. Put another way, Curtsinger and Berger realized one can measure the impact of speeding up a target code fragment by x% by slowing everything but the target by x% and measuring the overall slowdown. This wonderful technique is called a virtual speedup, and it is one of my favorite pieces of work in the past few years. Using virtual speedups for code profiling is such a clever, simple, and powerful idea that you wish you had thought of it yourself.
While the insights underlying Coz are undeniably clever, the implementation and evaluation results are equally impressive.
While the insights underlying Coz are undeniably clever, the implementation and evaluation results are equally impressive. Conceptually, whenever Coz runs a target line of code, it pauses all parallel threads by a delay that is less than the average runtime of the target line. However, naïvely pausing threads each time a target line executes would require heavy code instrumentation that could significantly skew the results and undermine the utility of the profiler. Instead, Coz uses a lightweight sampling scheme that requires no instrumentation at all. In Coz, each thread maintains a local counter and periodically samples its program counter and stack using Linux’s interface to the CPU’s hardware performance counters. When a thread finds it is executing the target method, it increments the global counter. When another thread processes its sample, it compares the value of its local counter to the global counter, and if the local counter is behind, increments the counter and pauses the thread. This sampling scheme is extremely elegant, and it allows Coz to set the virtual speedup of an execution by simply adjusting the ratio of the delay and the sampling period.
Finally, Coz’s predictions proved to be accurate across a wide range of benchmarks. Maybe most impressive, the authors used Coz to fix a really nasty and longstanding performance bug in a deployed hash table. Coz reported the greatest optimization opportunity in a file-compression application was a method that traversed the linked list of a hash-table bucket. Coz continued to identify this particular line even after the authors increased the number of buckets. Upon closer inspection, the authors discovered the table’s hashing function was assigning entries to only 3% of the available buckets. Fixing the buggy hash function required changing three lines of code and led to a nearly 9% end-to-end benchmark speedup. Coz not only identified the bug and predicted the impact of a fix, but it allowed the authors to discover and resolve the problem in just a few hours.
Above all else, what makes Coz noteworthy is that it exemplifies the best kind of systems work: it is elegant, insightful, and practical all at once. Papers that simultaneously achieve all of these qualities are exceedingly rare, and I suspect that practitioners and researchers will continue returning to it for many years to come.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment