BLOG@CACM
Computing Applications

Just Press Reboot

Posted
Bertrand Meyer

After you pass security at Schiphol, Amsterdam’s international airport, go past the Christian Dior boutique and leave the Pretentious and Overpriced Mediocre Chocolate Shop (maybe not the exact name) on your left. On the wall you will see a big red button labeled "Reboot Schiphol." Just press it in case your flight is delayed.

Not really, but that is what Reuters seems to believe. On February 1, the airport was down for a day, causing a major disruption of European traffic. (I was supposed to fly though Amsterdam that day, on my way to a public lecture on software reliability—"Software Without Bugs?"—at the University of Toulouse. I managed to get rerouted, and on the plus side I knew what to put on my first slide.) According to the Reuters article [1], the incident was due to "a serious computer problem" and "a reboot of computer systems at one of Europe’s largest flight hubs failed to resolve the issue."

OK, ignore the absurdity of the very idea of rebooting Schiphol; why does the press continue to talk about "computer problems"? Computer hardware does —albeit rarely—fail, but that obviously is not what simultaneously happened to  dozens or hundreds of computers at Schiphol. It was a software failure. A failure is caused [2] by a fault.  A fault almost always results from a programmer’s mistake. It will be a sign of maturity when everyone starts calling these "computer problems" by what they are. All right, maybe not exactly what they are, programmer mistakes, because this assumes we have completed the analysis, but at least "software problems": of that we are certain.

As usual, the countless press articles on the incident were mostly bounced-off versions of each other, starting from the Reuters piece. Only one I saw, from the (UK) Independent,  added anything of substance: "the fault apparently occurred with radar correlation software, which compares and assesses information from primary and secondary radar." OK, that would make sense: maybe the Great Amsterdam Shutdown is another instance of the classical problem, reported as far back as 1981 [3], of what we may call the Bumbling Policeman Syndrome. It arises from an attempt to improve reliability by adding a redundant component that polices what the others are doing—and ends up causing havoc even as the rest works correctly. Maybe this is what happened in Amsterdam. But I want to know the details!

Thereby lies the fundamental issue. Software mistakes happen, and as a result software failures. But we need to learn from them. A (non-exhaustive) search, six weeks later, yields no information posterior to the day of the incident. Nothing! Yet someone must have investigated what went wrong, and we, the public, including the Schiphol-traveling public, are entitled to know.

I have made this point before, to the point of tediousness [4, 5, 6, 7], and will make it once more:

Airplanes today are incomparably safer than 20, 30, 50 years ago: 0.05 deaths per billion kilometers. That’s not by accident.

Rather, it’s by accidents.

What has turned air travel from a game of chance into one of the safest modes of traveling is the relentless study of crashes and other mishaps. In the U.S. the National Transportation Safety Board has investigated more than 110,000 accidents since it began its operations in 1967. Any accident must, by law, be analyzed thoroughly; airplanes themselves carry the famous "black boxes" whose only purpose is to provide evidence in the case of a catastrophe. It is through this systematic and obligatory process of dissecting unsafe flights that the industry has made almost all flights safe.

Whenever such post-mortem analyses have occurred for software catastrophes, they have proved tremendously useful; all good software engineering courses study the Ariane-5 failure, thanks to Gilles Kahn’s masterful dissection (reported in [8]), and the U.S. Government Accountability Office’s report on the lethal Patriot missile bug [9].

These are isolated examples, exceptions in fact. Like the Schiphol shutdown, many software failures make headlines for a day while a large number of people are feeling the consequences, and then move out of the journalists’ radar (including their "secondary radar" if they have one) and our consciousness. This is wrong. Someone will know what happen: any organization experiencing such a major incident will commission an internal study. But the rest of us have both a need and a right to know. Engineering can only progress if we learn from our mistakes.

More than ever, we need laws that require such disclosure. For any software-related mishap having caused trouble beyond a certain threshold, measured by financial loss or disruption of people’s lives, particularly if public money is involved, there should be a legal requirement to report promptly and accurately, at the organization’s expense, what happened and why.

It is time to start lobbying for such laws.

References

[1] Anthony Deutsch,  Long Delays at Amsterdam’s Schiphol Airport Due to Computer Problem, Reuters UK, 1 February 2017, available here.

[2] IEEE Standard classification for Software Anomalies, IEEE Std 1004-2009, available here (with subscription).

 [3] John R. Garman: The "BUG" heard ’round the world: discussion of the software problem which delayed the first shuttle orbital flight, in ACM SIGSOFT Software Engineering Notes, Vol.  6 No. 5, October 1981, pages 3-10, accessible here (text access requires subscription).

[4] Bertrand Meyer: The one sure way to advance software engineering, 21 August 2009, see here (in my personal blog).

[5] Bertrand Meyer: Dwelling on the point, blog article, see here<.

[6] Bertrand Meyer:  Analyzing a software failure, 24 May 2010, blog article, see here.

[7] Bertrand Meyer: Again: The One Sure way to Advance Software Engineering, this blog, available here.

[8] Jean-Marc Jézéquel and Bertrand Meyer: Design by Contract: The Lessons of Ariane, in Computer (IEEE), vol. 30, no. 1, January 1997, pages 129-130, available here.

[9] US GAO (General Accountability Office): Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia, report IMTEC-92-26, 4 February 1992, available here.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More