All Sliders to the Right

Dear KV,

I work for a company that has what I can only describe as a voracious appetite for expensive servers, but it seems that simply buying systems with “all sliders to the right”—as one of our developers calls it—is not getting us the performance we are expecting. We are buying the biggest boxes with the most cores and the highest CPU frequencies, but the difference between this year’s model and last year’s seems not to be that great. We have optimized all our code for multithreading, and yet, something seems to be not quite right. Our team has been digging into this with various tools, some of which seem to give conflicting answers, but maybe it is just that machines really are not getting that much faster each year, and it is time to step off this expensive merry-go-round for a year or two.
Is Bigger Really Better?

Dear Bigger,

There are many reasons why this year’s model is not any better than last year’s model, and many reasons why performance fails to scale. It is true that the days of upgrading every year and getting a free—wait, not free, but expensive—performance boost are long gone, as we are not really getting single cores that are faster than approximately 4GHz. One thing many software developers fail to understand is the hardware on which their software runs at a sufficiently deep level.

Those who are working in the cloud have little choice but to trust their cloud provider’s assertions about the performance of any particular instance, but for those of us who still work with real metal—and it seems that you are one of us—there are a plethora of parameters about the underlying hardware that matter to the overall performance of the system. In no particular order, these parameters include the sizes of the various caches, the number of cores, the number of CPUs, and, crucially, the oft-overlooked bus structure of the system.

There are many reasons why this year’s model is not any better than last year’s model.

The diagrams we show computer science students that purport to illustrate how the CPU, memory, and I/O devices are connected in a computer are woefully out of date and often represent an idealized computer circa 1970 or 1975. These diagrams are, of course, complete fictions when you are looking at a modern server, but people still believe them. The most common big server used in a data center is a two-CPU system with a lot of memory and, hopefully, some fast I/O devices inside such as a 10-100G network card and fast flash storage. All the components are fast in and of themselves, but how they are connected may surprise you. The bus structure of a modern server is no longer flat, which means that access to memory and, in particular, I/O devices may go over a much slower path than you assume.

For example, each I/O device, such as a NIC (network interface card) or flash memory device, is close to only one of the two CPUs you have put into that big server, meaning that if your workload is I/O intensive, the second CPU is actually at a disadvantage and, in many cases, is completely useless from a performance standpoint. Each I/O transaction to the second CPU may even have to traverse the first CPU to get to the I/O devices, and the bus between the CPUs may be far more constrained than the connection between either CPU and its memory or cache. It is for all these reasons that sometimes, depending on what your system does, it may be far more cost-effective, and efficient, to buy systems with only a single CPU, because on a single-CPU system all cores are as close as possible to cache, memory, and I/O devices.

Of course, it is not like the system manufacturers explain any of this. Mostly, you wind up reading various data sheets about the bus structure that exists in whatever system you are buying and then hope the documents you found online—either at the manufacturer’s website or in a more tech-focused news provider, like the ones that talk about high-end gaming systems—are not lying to you. And when I say lying, I mean I am sure the companies are totally honest and upstanding but they forgot to put the errata online. Companies would never lie about the performance problems inherent in their highest-end systems. And if you believe that …

All of this gets a bit complicated because it depends on your workload and what you consider to be important. Are you CPU-bound? If so, then you care more about raw CPU power than anything else. Are you I/O-bound? Then you care more about I/O transactions per second. And then there are the moments when management says, “Just make it go faster.”

Those who are working in the cloud have little choice but to trust their cloud provider’s assertions about the performance of any particular instance.

Another thing that makes all this difficult, as you point out, is that the tooling for looking at performance is all over the place. There are books written about this stuff (I happen to favor things written by Brendan Gregg—see “the Flame Graph”; https://bit.ly/3XLvxa1), but they really only scratch the surface. It pays to spend a good deal of time reading over those data sheets, thinking about where your code runs and what it needs, and then designing experiments to see if this year’s model really is better than last year’s model for your particular application. In servers, as in much else, it is not the size that matters so much as how you use what you have.