My team and I have spent the past eight weeks debugging an application performance problem in a system we moved to a cloud provider. Now, after celebrating that achievement, we thought we would tell you the story and see if you have any words of wisdom.
In 2016, our management decided thatto save moneywe would move all our services from self-hosted servers in two racks in our small in-office data-center to the cloud so we could take advantage of the elastic pricing available from most cloud providers. Our system uses fairly generic, off-the-shelf, open source components, including Postgres and Memcached, to provide the back-end storage to our Web service.
Over the past two years we built up a good deal of expertise in tuning the system for performance, so we thought we were in a good place to understand what we needed when we moved the service to the cloud. What we found was quite the opposite.
Our first problem was very inconsistent response times to queries. The long tail of long queries of our database began to grow the moment we moved our systems into the cloud service, but each time we went to look for a root cause, the problem would disappear. The tools we would normally use to diagnose the issues we found on bare metal also gave far more varied results than expected. In the end, some of the systems could not be allocated elastically but had to be statically allocated, so the service would behave in a consistent manner. The savings management expected were never realized. Perhaps the only bright side is that we no longer have to maintain our own deployment tools, because deployment is handled by the cloud provider.
We wonder, is this really a common problem, or could we have done something that would have made this transition less painful?
Rained on Our Parade
Clearly, your management has never heard the phrase, "You get what you pay for." Or perhaps they heard it and did not realize it applied to them. The savings in cloud computing comes at the expense of a loss of control over your systems, which is summed up best by the popular sticker "The Cloud Is Just Other People's Computers."
All the tools you built during those last two years work only because they have direct knowledge of the system components down to the metal, or at least as close to the metal as possible. Once you move a system into the cloud, your application is sharing resources with other, competing systems, and if you are taking advantage of elastic pricing, then your machines may not even be running until the cloud provider deems them necessary. Request latency is dictated by the immediate availability of resources to answer the incoming request. These resources include CPU cycles, data in memory, data in CPU caches, and data on storage. In a traditional server, all these resources are controlled by your operating system at the behest of the programs running on top of the operating system; but in a cloud, there is another layer, the virtual machine, which adds another turtle to the stack, and even when it is turtles all the way down, that extra turtle is going to be the source of resource variation. This is one reason you saw inconsistent results after you moved your system to the cloud.
Let's think only about the use of CPU caches for a moment. Modern CPUs gain quite a bit of their overall performance from having large, efficiently managed L1, L2, and sometimes L3 caches. The CPU caches are shared among all programs, but in the case of a virtualized system with several tenants, the amount of cache available to any one programsuch as your database or Memcached serverdecreases linearly with the addition of each tenant. If you had a beefy server in your original colocation facility, you were definitely gaining a performance boost from the large caches in those CPUs. The very same server running in a cloud provider is going to give your programs drastically less cache space with which to work.
With less cache, fewer things are kept in fast memory, meaning your programs now need to go to regular RAM, which is often much slower than cache. Those accesses to memory are now competing with other tenants that are also squeezed for cache. Therefore, although the real server on which the instances are running might be much larger than your original hardwareperhaps holding nearly a terabyte of RAMeach tenant receives far worse performance in a virtual instance of the same memory size than it would if it had a real server with the same amount of memory.
Let's imagine this with actual numbers. If your team owned a modern dual-processor server with 128 gigabytes of RAM, each processor would have 16 megabytesnot gigabytesof L2 cache. If that server is running an operating system, a database, and Mem-cached, then those three programs share that 16 megabytes. Taking the same server and increasing the memory to 512 gigabytes, and then having four tenants, means the available cache space has now shrunk to one-fourth of what it waseach tenant now receives only four megabytes of L2 cache and must compete with three other tenants for all the same resources it had before. In modern computing, cache is king, and if your cache is cut, you are going to feel it, as you did when trying to fix your performance problems.
Most cloud providers offer systems that are non-elastic, as well as elastic, but having a server always available in a cloud service is more expensive than hosting one at a traditional colocation facility. Why is that? It is because the economies of scale for cloud providers work only if everyone is playing the game and allowing the cloud provider to dictate how resources are consumed.
Some providers now have something called Metal-as-a-Service, which I really think ought to mean a 1980s-era metal band shows up at your office, plays a gig, and smashes the furniturebut alas, it is just the cloud providers' way of finally admitting cloud computing is not really the right answer for all applications. For systems that require deterministic performance guarantees to work well, you really must think very hard about whether or not a cloud-based system is the right answer, because providing deterministic guarantees requires quite a bit of control over the variables in the environment. Cloud systems are not about giving you control; they are about the owner of the systems having the control.
20 Obstacles to Scalability
A Guided Tour through Data-center Networking
Dennis Abts and Bob Felderman
The Digital Library is published by the Association for Computing Machinery. Copyright © 2018 ACM, Inc.
No entries found