Space Time Discontinuum

Increasingly, calculations are based on timed events originating at many sources. By comparing, contrasting, joining, and noodling over these inputs, you can derive some interesting results. If the inputs to these calculations come from disparate computers (or sensors), you can’t always be sure of how quickly the information will propagate. Hence, you can’t be sure when you will get an answer to the requested calculation.

If you can’t promise when you will get the answer, what are you going to do? You can wait to get the perfect answer, or you can give an imperfect answer more promptly by basing it on partial knowledge.

How can you meet your SLAs (service-level agreements)? Sometimes, it’s not an easy and seamless continuum of options but more of a discontinuum.

Events and time. In many systems, events come with a timestamp. These may come from temperature sensors, motion detectors, factory floors, your cable box, your security system, automated tollbooths on the freeway, and much more. One source of information that is increasingly important is the monitoring of events in datacenters. These events may be used by an automated management system or by humans in complex error spelunking. It is common to try to see which events happened close in time to another particular event. Patterns in time proximity are essential to controlling these complex environments.

Events and space. Now, all this is cool except when the information you want is “over there.” In a distributed system, if the stuff is not “here,” it’s “over there.” When something is far away and over there, it may take a loooonnng time to get information in and out. It can be kind of like when the only road up the canyon gets washed out in a flood. I refer to remote nodes as being “over yonder” whether they are in the same rack in the datacenter or thousands of miles away.

Space, Distance, and Traffic Jams

If it’s over there with an open queuing network, it will usually get here in a timely fashion. Almost all modern networks are open queuing networks. In such a network, there’s no practical limit to the amount of stuff that may try to share the network. Highway 101 through San Francisco is an open queuing network. It works great until it doesn’t.

Right now, I’m writing this on an airplane. I’m glad the modern fly-by-wire systems use closed queuing networks. This type of network carefully manages the messages allowed in. By ensuring the network never ingests more work, it can ensure a bounded time for the network to digest the work it already has. Assuming no hardware faults (or a bounded number of hardware faults), the closed queuing network can do what is requested within a specified period of time. I don’t want a traffic jam on the fly-by-wire system controlling the flaps on the plane. I’m grateful for a closed queuing network.

In datacenters, on the other hand, you don’t see closed queuing networks. Instead, you see a freeway that usually has a lot of lanes. It mostly works OK.

Reviewing our queuing. Delays in propagating events are not just caused by the network. It’s common for events to get pushed into some queuing system as they are published. The published events are then consumed by one or more subscribers. These subscribers typically enqueue the event in yet another queue (ad nauseam) as the event rattles its way through the datacenter or Internet trying to seek its destination like a fly buzzing around a light on the back porch.

Misconstruing our queuing. Each of these queuing systems presents a whole new opportunity for delay. It’s almost as if the event gets frozen in time and, hopefully, gets defrosted within the desired time window. Each stage of these queuing systems is very likely to propagate the event quickly. Each time you add a “very likely to propagate the event quickly,” you also add a “sometimes it will be slow.”

Not only can queuing systems be slow, they rarely have end-to-end guarantees. Unless the final point of consumption is coordinated with the originating sender, the event will sometimes go kablooey and never arrive.

Service levels are a key component of engineering a system for use by a customer. These may deal with system performance, system availability, or quality of the returned results:^4,5

SLA (service-level agreement). This is a contract with the customer. It may list financial compensation if you don’t meet the numbers. It may be an understanding with your boss that you will miss part of your bonus if things go awry.
SLO (service-level objective). This is your internal target for the metric in question. You always want your SLO to be better than your SLA (or you are not trying to please your customer).
SLI (service-level indicator). This is what you actually measure against the metric. Hopefully, your SLI is better than your SLO, which is better than your SLA.

To ensure many different sources of events are combined, you should leave some room in your SLA to allow for slowness in getting the source data from all those sources. The more sources, the more likely you will blow your SLA as one or more of them blow their SLAs.

Managing Tail Latencies in Idempotent Operations

One of my favorite articles is “The Tail at Scale,” by Jeffrey Dean and Luis Andre Barroso.¹ In this article, the authors outline how a service requiring a whole bunch of inputs can use timeout-based retries to dramatically improve the probability of getting all the inputs needed to meet the desired 99.9% SLA for response. In this case, sufficient replicas exist to get the answer without loss of quality, simply by retrying and consuming more resources. Key to this is that it is OK to retry the laggard requests because they are idempotent. It does not cause harm to do them two or more times.

Knowing you can’t know. The more complex the set of inputs, the more likely you won’t see everything in a timely fashion. The more complex the store-and-forward queuing, the more likely stuff will arrive too late or not at all. The more distant the sources of your inputs, the more challenges you may have.

As we have seen, sometimes it can be effective to retry the request for input. In particular, in some systems, retrying can ensure all the inputs are available quickly.

In other systems, the inputs are not simply fetched but are rattling their way through queues similar to Highway 101 through San Francisco. In these environments, the processing probably has to simply cut off with what it has and do the best it can. This means you can’t guarantee the stuff is ready when you want it.

So, if you know you can only probably know, what’s the plan?

Approximating queries. There is some fun new work describing analytics with approximate answers. By expressing sampling operators, some systems can provide really good answers based on a small subset of all the inputs one would normally examine.^2,3 In the cited systems, there is more focus on sampling for performance when everything is working and timely. Still, it’s quite similar to what you would do to build systems that return answers based on what is available in a timely fashion.

Returning partial answers. Many systems work to give some answer in a timely fashion even if they are wounded. Many sophisticated websites will dribble out partial answers to the browser as they become available: The text for a product description may arrive before the product image; the product image may arrive before the user reviews. This decoupling yields a faster overall result and, in general, a more satisfied user.

When dealing with answers relying on data from a distance, it is important to consider how to decouple results and, where possible, return a degraded answer quickly when and if that best meets the needs of the business.

When dealing with answers relying on data from a distance, it’s important to consider how to decouple results and, where possible, return a degraded answer quickly when and if that best meets the needs of the business. What does it take to keep on going?

As the Black Knight from “Monty Python and the Holy Grail” says, “‘Tis but a scratch” (https://www.youtube.com/watch?v=ZmInkxbvlCs; thank you to Peter Vosshall for raising the Black Knight in discussion when we were both younger many years ago).

The Disconcerting Discontinuum

Back when you had only one database for an application to worry about, you didn’t have to think about partial results. You also didn’t have to think about data arriving after some other data. It was all simply there.

Now, you can do so much more with big distributed systems, but you have to be more sophisticated in the tradeoff between timely answers and complete answers. The best systems will adapt and interpret their problems as, “‘Tis but a scratch!”