The Foresight Saga

In hindsight, the lack of foresight is an old problem that keeps recurring. The most prevalent situation seems to be "we had a backup system, but it failed when it was needed." For example, a power blackout shut down the Los Angeles Air Route Traffic Control Center in Palmdale, CA, for three hours on the evening of July 18, 2006, following a power outage and automatic cutover to the backup power system—which subsequently failed. Two separate human errors in the Palmdale Center silenced LA-area airport communications in September 2004, when both the main system and the backup failed. A power failure disrupted Reagan National Airport on April 10, 2000, for almost eight hours, when the backup generator also failed. At the Westbury, Long Island, air traffic control center in June 1998, a software upgrade failed, as did reverting to the old software. The new LA-area El Toro air traffic control computer failed 104 times in a single day in 1989, and its predecessor was unavailable—having already been decommissioned. The new British Swanwick air traffic control system suffered a failure after an attempted software upgrade in June 2004; it took two hours to restore the backup system, halting air traffic throughout England. In 1991, an AT&T standby generator was accidentally misconfigured, draining the backup batteries, closing the three major NY-area airports.

The Swedish central train-ticket sales/reservation system and its backup both failed. The Washington D.C. Metro Blue Line computer system and its backup both failed on June 6, 1997. For three consecutive days of attempted software upgrades, San Francisco’s BART computers crashed, a situation that was complicated by a backup failure as well.

On one occasion, Nasdaq experienced a power failure, after which the backup power system also failed. On another occasion, a software upgrade caused a computer crash, and the backup system failed as well. In November 2005, a software bug caused the worst-ever Japanese stock exchange system crash; the monthly software upgrade failed, and the backup also failed (using the same software). In 1999, the Singapore Stock Exchange system crashed repeatedly due to erroneous interactions with the backup system.

A power surge shut down the 9 Mile Point nuclear station in Oswego, NY, and the supposedly uninterruptible backup power system failed as well, triggering a site-area emergency. An Australian TV channel went off the air due to multiple system failures and a power outage; failure of the backup system took down the national phone system. New York City’s 911 system crashed during a test of the backup generator; the backup system failed for an hour, the main system for six hours. In 1998, a malfunction of the Galaxy IV satellite onboard control system caused massive outages of U.S. pager service, with the backup switch also failing. In other examples, in which no recovery was possible, the NY Public Library lost all of its computerized references, and the Dutch eliminated an old system for managing criminals before successfully cutting over to the new system.

Further problems arise when constructive uses of operational redundancy fail their intended purpose. In December 1986, the ARPANET had seven dedicated trunk lines between New York and Boston, except that they all went through the same conduit—which was accidentally cut by a backhoe. Quite a few similar problems have resulted from localized failures of seemingly parallel power, communications, or aircraft hydraulic lines. Furthermore, common-mode failures can be particularly problematic in distributed systems, even with redundancy designed to ensure high reliability (such as majority voting). For example, Brunelle and Eckhardt describe a case in which two independent faulty programs had similar bugs and consistently outvoted the correct one. Furthermore, suppose that multiple subsystems all share the same security vulnerabilities; then, if one system can be compromised, so can all of them—at the same time—despite redundancy. Similarly, using an n-error correcting code when more than n errors are likely is a poor idea.

In all of these cases, evident shortcomings existed in the design, implementation, and testing of backup and recovery facilities. It is difficult to test for situations that occur very rarely. On the other hand, if backup and recovery facilities must be exercised frequently, the overall system is probably poorly designed and operated. Backup and recovery represent processes that must be carefully integrated with their associated systems, with suitable quality control, periodic reverification, and maintenance of compatibility. Considerable foresight, systemic planning, and periodic testing are essential.

The Foresight Saga

DOI

September 2006 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

The Foresight Saga

DOI

September 2006 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.