Watchdogs vs. Snowflakes

Dear KV,

I have been working with a distributed job-control system for a large computing cluster for the past year. The system was developed in-house by one of the co-founders of the company, and he continues to work on it sporadically, while a small team of us adds new features and tries to fix bugs. The code isn’t terrible, but it has one major defect—if the system doesn’t have enough jobs in its queues, it tends to freeze up. I have been working with one other person on my team to diagnose the problem, but it has been assigned a very low priority by management because as long as we add dummy jobs when the system would otherwise be idle, the bug does not occur. I have never seen a system act like this, and I have to wonder: Is this kind of problem common in distributed job-control systems?

Jobless

Dear Jobless,

Is the specific problem of a system freezing up because of starvation common in distributed job-control systems? It has been my experience that each distributed system is a precious snowflake—and KV does not like snow!

Let’s first address the high-level issue—the fact that no one cares if you fix the bug, because if you put in dummy jobs, the system “just works.” The phrase “just works” is one of the most overused in computing, and what it really indicates, in this case, is that someone is intellectually lazy, or that his or her motivation lies elsewhere. “Why should we care that we’re running our systems at 100% power draw, when fixing the problem would cost time and money?” Apart from the fact that computing now consumes a significant percentage of the world’s electricity, leaving a bug like this un-addressed can have other deleterious side effects.

That a system can randomly jam does not just indicate a serious bug in the system; it is also a major source of risk. You do not say what your distributed job-control system controls, but let’s just say I hope it is not something with significant, real-world side effects—like a power station, jet aircraft, or financial trading system. The risk, of course, is that the system will jam, not when it is convenient for someone to add a dummy job to clear the jam, but during some operation that could cause data loss or return incorrect results. I rather suspect that having a system like this jam while coordinating, for example, the balancing of electrical power across a power grid would have spectacular and perhaps fatal results.

I am not saying every bug must be fixed at the expense of doing otherwise productive work, but it is bugs like this one that, in my experience, tend to hit at the absolute worst possible time. If the team knew about the bug in advance, it just leads to embarrassment when they must admit they knew about such a risk before it actually happened.

It is difficult to say much about the technical issue without looking into the system itself. (Remember KV’s earlier comment about snowflakes.) The most common way of handling this type of freezing is itself not completely satisfying, and that is to have a watchdog process that sees if the system is making progress and restarts it after a suitable timeout when it believes the system is stuck.

There are several problems with the watchdog approach. The first is what the watchdog will actually do. Some watchdogs operate by restarting a stuck process, and they do this bluntly, by killing the process and restarting it. If the computations undertaken by the system are all idempotent, then there is little risk because any operation that did not complete will be restarted from the beginning and should have no side effects. Most systems have side effects, which means such restarts can cause a cascade of errors through the whole system. If the errors are obvious, then a human operator might be able to roll back the system to a good, known state and start the system again. But what if the errors are a type of silent corruption, returning incorrect answers (as I mentioned at the beginning of this column)? In that case, the watchdog is likely to do more harm than good.

Even if a watchdog approach is not otherwise harmful, there is a second problem of choosing an appropriate timeout duration. Since the system becomes jammed when it does not have enough work, some people will want to set the watchdog timer to be very fast so as to prevent these jams from reducing the overall efficiency of the system. A very short watchdog timeout has the potential to make the system thrash, since each restart caused by the watchdog firing will require the system to do work to return to its running state. All the work done by the system when a process is restarted is pure overhead; it does not help the system perform the work it was intended to do. Conversely, setting a watchdog timeout to be too long risks having the system remain stuck for long periods, again reducing overall efficiency. Too often, the choice of these timeouts is accomplished by a form of black magic, referred to as “taking a wild guess,” followed by a heuristic, which is referred to as “taking another wild guess,” to see if it is better than the first.

There are several problems with the watchdog approach. The first is what the watchdog will actually do.

Do not underestimate the number of production systems that use these approaches. I believe if we truly knew how many of the systems we depend on every day used black magic under the hood, we would all be more likely to buy land in Wyoming, build bunkers, and live in them.

Unfortunately, as KV has discussed before, debugging distributed systems is difficult, but it turns out that not debugging them and having them fail catastrophically makes for even more difficult days.

Watchdogs vs. Snowflakes

DOI

June 2018 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.