In February, Apple revealed and fixed a Secure Sockets Layer (SSL) vulnerability that had gone undiscovered since the release of iOS 6.0 in September 2012. It left users vulnerable to man-in-the-middle attacks thanks to a short circuit in the SSL/TLS (Transport Layer Security) handshake algorithm introduced by the duplication of a
goto statement. Since the discovery of this very serious bug, many people have written about potential causes. A close inspection of the code, however, reveals not only how a unit test could have been written to catch the bug, but also how to refactor the existing code to make the algorithm testableas well as more clues to the nature of the error and the environment that produced it.
This article addresses five big questions about the SSL vulnerability: What was the bug (and why was it bad)? How did it happen (and how didn't it)? How could a test have caught it? Why didn't a test catch it? How can we fix the root cause?
The Apple SSL vulnerability, formally labeled CVE-2014-1266, was produced by the inclusion of a spurious, unconditional
goto statement that bypassed the final step in the SSL/TLS handshake algorithm. According to the National Vulnerability Database (http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-1266) and the Common Vulnerabilities and Exposure (CVE) Standard Vulnerability Entry (http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-1266), the bug existed in the versions of iOS, OS X, and the Apple TV operating system shown in the accompanying table.
These formal reports describe the bug as follows: "The
SSLVerifySignedServerKeyExchange function in libsecurity_ssl/lib/sslKeyExchange.c in the Secure Transport feature in the Data Security component ... does not check the signature in a TLS Server Key Exchange message, which allows man-in-the-middle attackers to spoof SSL servers by using an arbitrary private key for the signing step or omitting the signing step." This error is visible by searching for the function name within Apple's published open-source code (http://opensource.apple.com/source/Security/Security-55471/libsecurity_ssl/lib/sslKeyExchange.c) and looking for this series of statements:
Those familiar with the C programming language will recognize that the first
goto fail is bound to the
if statement immediately preceding it; the second is executed unconditionally. This is because whitespace, used to nest conditional statements for human readability, is ignored by the C compiler; curly braces must enclose all statements bound to an if statement when more than one statement applies.
goto fail statements appearing throughout the algorithm are a common idiom in C for releasing resources when a function has encountered a fatal error prior to completion. In the flawed code, a successful
update() call will result in an unconditional jump to the end of the function, before the final step of the algorithm; and the return value will indicate the handshake was successful. In essence, the algorithm gets short-circuited.
For users of Apple's Safari and other Secure Transport-based applications on the affected platforms, "secure" connections were vulnerable to man-in-the-middle attacks, whereby an attacker in a position to relay messages from a client to a "secure" server across the Internet can impersonate the server and intercept all communications after the bogus handshake. (Users of products incorporating their own SSL/TLS implementations, such as Google Chrome and Mozilla Firefox, were not affected.) Though it is unknown whether this vulnerability was ever exploited, it rendered hundreds of millions of devices (and users) vulnerable over the course of 17 months.
Apple was criticized for patching the vulnerability for iOS devices and Apple TV on Friday, February 21, 2014, making knowledge of the vulnerability widespread, but delaying the patch for OS X Mavericks until the following Tuesday. This four-day window left users who were not aware of the iOS patch vulnerable to a now very public exploit.
Many have noted apparently missing factors that could have caught the bug. Coding standardsspecifically those enforcing the use of indentation and curly bracescombined with automated style-checking tools and code reviews, may have drawn attention to the repeated statement. An automated merge may have produced the offending extra line, and the developer may have lacked sufficient experience to detect it. Had coverage data been collected, it would have highlighted unreachable code. Compiler and static-analysis warnings also could have detected the unreachable code.
Others noted the code appears to lack unit tests, which likely would have caught the bug. While many of the other tools and practices might have been sufficient to prevent this specific vulnerability, a deeper problem, which ultimately produced the repeated
goto statement, would have been prevented by proper unit-testing discipline.
Some question whether adequate system testing took place, while others argue that because system testing cannot find every bug, this was merely an unfortunate case of one that happened to slip through. Others claim use of the
goto statement and/or deliberate sabotage is to blame. None of these claims stands up to scrutiny.
GO TO not "considered harmful." Since it is one of the more popular theories, let's dispatch the argument that the use of
goto is to blame for this vulnerability. Many have referred to the popular notion that
goto is "considered harmful," based on Edsger Dijkstra's letter published in the March 1968 Communications of the ACM. This is what Dijkstra actually said in "A Case against the GO TO Statement": "I do not claim that the clauses mentioned are exhaustive in the sense that they will satisfy all needs; but whatever clauses are suggested (for example, abortion clauses) they should satisfy the requirement that a programmer-independent coordinate system can be maintained to describe the process in a helpful and manageable way."9 In other words, "abortion clauses" to release a function's resources may still rely on
goto, absent other direct language support.
This C language "abortion clause" idiom is legitimate and well understood, and is directly supported by other languages. For example, in C++, automatic destructors implement the Resource Acquisition Is Initialization (RAII) idiom; Java employs try/catch/finally blocks (http://docs.oracle.com/javase/tutorial/essential/exceptions/handling.html); Go provides the
defer(), panic(), and
recover() mechanisms (http://blog.golang.org/deferpanic-and-recover); and Python has
try/except/finally blocks (http://docs.python.org/3/reference/compound_stmts.html#try) and the
with statement, which is used to implement RAII (http://docs.python.org/3/reference/compound_stmts.html#the-with-statement). Absent these mechanisms, in C this remains a legitimate application of the
goto statement, lest the code become bloated with repetitive statements or the control structure become nested so deeply as to hinder readability and maintainability.
In fact, a misplaced
return statement could have produced the same effect. Imagine a macro such as the following had been defined:
Then the bug might have appeared in this incarnation:
Even enforcing the use of curly braces might not have prevented the error, as they could be mismatched:
The blame for this vulnerability does not lie with the
goto statement. A proper unit test would have caught the error regardless of how it was written.
Code duplication. The handshake algorithm in which the extra
goto statement appears is duplicated six times throughout the code. Figure 1 shows the algorithm containing the repeated
goto fail line as it appears in the
SSLVerifySignedServerKeyExchange() function. Figure 2 shows the block immediately preceding this algorithm. This duplication is the critical factor leading to the manifestation of the vulnerability, and it can be traced directly to a lack of unit testing disciplinebecause of the absence of the craftsmanship and design sense that testable code requires. Someone writing code with unit testing in mind would have ensured only one copy of the algorithm existednot only because it is theoretically more proper, but because it would have been easier to test.
The coder could not "smell" (http://blog.codinghorror.com/code-smells/) the duplicate code as he or she was writing itor copying it for the second, third, fourth, fifth, or sixth time! This indicates a pattern of poor habits over time, not a single careless mistake. Ultimately, this speaks to a cultural issue, not a moment of individual error.
Landon Fuller published a proof-of-concept unit test implemented in Objective-C,10 using the Xcode Testing Framework.2 Fuller notes that "there's no reason or excuse for [the
SSLVerifySignedServerKeyExchange() function] not being fully tested for" all of the potential error conditions. This proof of concept, however, misses the opportunity to look deeper into the code and provide full test coverage of this particularly critical algorithmso critical that it appears six times in the same file.
Step one in properly testing the algorithm is to extract it into a separate functionwhich in itself might have prevented the duplicate
goto fail that caused the bug, since a single block of code is less susceptible to editing or automated merge errors than six blocks of nearly identical code (Figure 3).
The two earlier blocks of code from
SSLVerifySignedServerKeyExchange() now appear as in Figure 4.
This works because the
HashReference is a "jump table" structure, and SSLHashMD5 and SSLHashSHA1 are instances of
HashReference, which point to specific hash algorithm implementations. The
HashReference interface makes it straightforward to write a small test exercising every path through the isolated
HashHandshake() algorithm using a
HashReference stub, and to verify that it would have caught this particular error:
The code for
tls _ digest _ test.c is viewable at http://goo.gl/PBt9S7. Security-55471-bugfix-and-test.tar.gz (http://goo.gl/tnvIUm) contains all of my proof-of-concept changes;
build.sh automates downloading the code, applying the patch, and building and running the test with a single command. The test and the patch are very quick efforts but work as a stand-alone demonstration without requiring the full set of dependencies needed to build the entire library. The demonstration admittedly does not address further duplication or other issues present in the code.
The point of all this is, if an exprogrammer who has been out of the industry for 2.5 years can successfully refactor and test this code within a couple of hours, never having seen it before, why didn't the engineer or team responsible for the code properly test it 17 months earlier?
Several articles have attempted to explain why the Apple SSL vulnerability made it past whatever tests, tools, and processes Apple may have had in place, but these explanations are not sound, especially given the above demonstration to the contrary in working code. The ultimate responsibility for the failure to detect this vulnerability prior to release lies not with any individual programmer but with the culture in which the code was produced. Let's review a sample of the most prominent explanations and specify why they fall short.
Adam Langley's oft-quoted blog post13 discusses the exact technical ramifications of the bug but pulls back on asserting that automated testing would have caught it: "A test case could have caught this, but it is difficult because it is so deep into the handshake. One needs to write a completely separate TLS stack, with lots of options for sending invalid handshakes."
This "too hard to test" resignation complements the "I don't have time to test" excuse Google's Test Mercenaries, of which I was a member, often heard (though, by the time we disbanded, testing was well ingrained at Google, and the excuse was rarely heard anymore).11 As already demonstrated, however, a unit test absolutely would have caught this, without an excess of difficulty. Effectively testing the algorithm does not require "a completely separate TLS stack;" a well-crafted test exercising well-crafted code would have caught the errorindeed, the thought of testing likely would have prevented it altogether.
Unfortunately, some adopted Langley's stance without considering that the infeasibility of testing everything at the system level is why the small, medium, and large test size schema exists (that is unit, integration, and system to most of the world outside Google).8 Automated tests of different sizes running under a continuous integration system (for example, Google's TAP, Solano CI) are becoming standard practice throughout the industry. One would expect this to be a core feature of a major software-development operation such as Apple's, especially as it pertains to the security-critical components of its products.
Writing for Slate, David Auerbach breaks down the flaw for nonprogrammers and hypothesizes that the bug might have been caused by a merge error (based on this diff: https://gist.github.com/alexyakoubian/9151610/revisions; look for green line 631), but then concludes: "I think the code is reasonably good by today's standards. Apple would not have released the code as open source if it were not good, and even if they had, there would have been quite an outcry from the open source community if they had looked it over and found it to be garbage."3
Auerbach's conclusion assumes that everything Apple releases is high quality by definition, that it has every reasonable control in place to assure such high quality, and that all open-source code receives the focused scrutiny of large numbers of programmers (thanks to Stephen Vance for pointing this out specifically in his comments on my presentation)at least, programmers motivated to report security flaws. The actual code, however, suggests a lack of automated testing discipline and the craftsmanship that accompanies it, as well as the absence of other quality controls, not the fallibility of the existing discipline that Auerbach imagines Apple already applies.
Security guru Bruce Schneier notes, "The flaw is subtle, and hard to spot while scanning the code. It's easy to imagine how this could have happened by error. ... Was this done on purpose? I have no idea. But if I wanted to do something like this on purpose, this is exactly how I would do it."15 Schneier's focus is security, not code quality, so his perspective is understandable; but the evidence tips heavily in favor of programmer error and a lack of quality controls.
Delft University computer science professor Arie van Deursen notes many industry-standard tools and practices that could have caught the bug; but despite self-identifying as a strong unit-testing advocate, he demurs from asserting the practice should have been applied: "In the current code, functions are long, and they cover many cases in different conditional branches. This makes it hard to invoke specific behavior. ... Thus, given the current code structure, unit testing will be difficult."16 As already demonstrated, however, this one particular, critical algorithm was easy to extract and test. Software structure can be changed to facilitate many purposes, including improved testability. Promoting such changes was the job of the Test Mercenaries at Google.
My former Test Mercenary colleague C. Keith Ray noted both in his comments on van Deursen's post and in his own blog: "Most developers who try to use TDD [test-driven development] in a badly designed, not-unit-tested project will find TDD is hard to do in this environment, and will give up. If they try to do 'test-after' (the opposite of TDD's test-first practice), they will also find it hard to do in this environment and give up. And this creates a vicious cycle: untested bad code encourages more untested bad code."14
I largely agree with Ray's statement but had hoped he might seize the opportunity to mention the obvious duplicate code smell and how to eliminate it. Again, that was our stock-in-trade as Test Mercenaries. Absence of TDD in the past does not preclude making code more testable now, and we have a responsibility to demonstrate how to do so.
Columbia University computer science professor Steven M. Bellovin provides another thorough explanation of the bug and its ramifications, but when he asks "why they didn't catch the bug in the first place," his focus remains on the infeasibility of exhaustive system-level testing: "No matter how much you test, you can't possibly test for all possible combinations of inputs that can result to try to find a failure; it's combinatorially impossible."4
As demonstrated, this vulnerability was not a result of insufficient system testing; it was because of insufficient unit testing. Keith Ray himself wrote a "Testing on the Toilet"8 article, "Too Many Tests,"11 explaining how to break complex logic into small, testable functions to avoid a combinatorial explosion of inputs and still achieve coverage of critical corner cases ("equivalence class partitioning"). Given the complexity of the TLS algorithm, unit testing should be the first line of defense, not system testing. When six copies of the same algorithm exist, system testers are primed for failure.
Such evidence of a lack of developer testing discipline, especially for security-critical code, speaks to a failure of engineering and/or corporate culture to recognize the importance and impact of unit testing and code quality, and the real-world costs of easily preventable failuresand to incentivize well-tested code over untested code. Comments by an anonymous ex-Apple employee quoted by Charles Arthur in The Guardian2 support this claim:
"Why didn't Apple spot the bug sooner?
Given the complexity of the TLS algorithm, unit testing should be the first line of defense, not system testing. When six copies of the same algorithm exist, system testers are primed for failure.
"The former programmer there says, 'Apple does not have a strong culture of testing or test-driven development. Apple relies overly on dogfooding [using its own products] for quality processes, which in security situations is not appropriate. ...
"What lessons are there from this?
"But the former staffer at Apple says that unless the company introduces better testing regimesstatic code analysis, unit testing, regression testing'I'm not surprised by this ... it will only be a matter of time until another bomb like this hits.' The onlyminimal comfort: 'I doubt it is malicious.' "
Reviewer Antoine Picard, commenting on the similarity between this security vulnerability and reported problems with Apple's MacBook power cords, noted: "When all that matters is the design, everything else suffers."12
Those with unit-testing experience understand its productivity benefits above and beyond risk prevention; but when the inexperienced remain stubbornly unconvinced, high-visibility bugs such as this can demonstrate the concrete value of unit testingin working code.
Seize the teachable moments! Write articles, blog posts, flyers, give talks, start conversations; contribute working unit tests when possible; and hold developers, teams, and companies responsible for code quality.
Over time, through incremental effort, culture can change. The Apple flaw, and the Heartbleed bug discovered in OpenSSL in April 2014after this article was originally draftedcould have been prevented by the same unit-testing approach that my Testing Grouplet (http://mike-bland.com/tags/testing-grouplet.html), Test Certified,5 Testing on the Toilet, and Test Mercenary partners in crime worked so hard to demonstrate to Google engineering over the course of several years. By the time we finished, thorough unit testing had become the expected cultural norm. (My commentary on Heartbleed, with working code, is available at http://mike-bland.com/tags/heartbleed.html.)
Culture change isn't easy, but it's possible. If like-minded developers band together across teams, across companies, even across the industrysuch as is beginning to happen with the Automated Testing Boston Meetup (http://www.meetup.com/Automated-Testing-Boston/), its sister groups in New York, San Francisco, and Philadelphia, and the AutoTest Central community blog (http://autotestcentral.com/)and engage in creative pursuits to raise awareness of such issues and their solutions, change will come over time.
The goal is to have this and other articlesincluding my "Goto Fail, Heartbleed, and Unit Testing Culture" article published by Martin Fowler (http://martinfowler.com/articles/testing-culture.html)drive discussion around the Apple SSL and Heartbleed bugs, spreading awareness and improving the quality of discourse; not just around these specific bugs, but around the topics of unit testing and code quality in general. These bugs are a perfect storm of factors that make them ideal for such a discussion:
If we do not seize upon these opportunities to make a strong case for the importance and impact of automated testing, code quality, and engineering culture, and hold companies and colleagues accountable for avoidable flaws, how many more preventable, massively widespread vulnerabilities and failures will occur? What fate awaits us if we do not take appropriate corrective measures in the wake of
goto fail and Heartbleed? How long will the excuses last, and what will they ultimately buy us?
And what good is the oft-quoted bedrock principle of open-source software, Linus's Law"Given enough eyeballs, all bugs are shallow"if people refuse to address the real issues that lead to easily preventable, catastrophic defects?
I have worked to produce artifacts of sound reasoning based on years of experience and hard evidenceworking code in the form of the Apple patch-and-test tarball and heartbeat_test.c, contributed to OpenSSL (http://goo.gl/1F7SKs)to back up my rather straightforward claim: a unit-testing culture most likely could have prevented the catastrophic
goto fail and Heartbleed security vulnerabilities.
High-profile failures such as the Apple SSL/TLS vulnerability and the Heartbleed bug are prime opportunities to show the benefits of automated testing in concrete terms; to demonstrate technical approaches people can apply to existing code; and to illustrate the larger, often cultural, root causes that produce poor habits and bugs. Given the extent to which modern society has come to depend on software, the community of software practitioners must hold its members accountable, however informally, for failing to adhere to fundamental best practices designed to reduce the occurrence of preventable defectsand must step forward not to punish mistakes but to help address root causes leading to such defects. If you see something, say something!
Further reading. This article is based on my presentation, "Finding More than One of the Same Worm in the Apple" (http://goo.gl/F0URUR), corresponding material,56,7,8 and excerpts from my blog post (http://mike-bland.com/2014/04/15/goto-fail-tott.html).
Figure 5 is by Catherine Laplace, based on the author's sketch of an image from the Testing Grouplet/EngEDU Noogler Unit Testing lecture slides for new Google engineers.
My deepest gratitude to former Google colleagues, new associates from the Automated Testing Boston Meetup, and generous acquaintances whom I've met only online: David Plass, Isaac Truett, Stephen Vance, RT Carpenter, Gleb Bahmutov, Col Willis, Chris Lopez, and Antoine Picard. Thanks to Sarah Foster of the AutoTest Meetups and the AutoTest Central blog for providing a forum to discuss this issue. Thanks to Guido van Rossum, George Neville-Neil, and Martin Fowler for supporting my efforts.
Security is Harder than You Think
John Viega and Matt Messier
Nine IM Accounts and Counting
Browser Security Case Study: Appearances Can Be Deceiving
Jeremiah Grossman, Ben Livshits, Rebecca Bace and George Neville-Neil
1. Apple Inc. Xcode overview, 2014; http://bit.ly/1kXUAzD
2. Arthur, C. Apple's SSL iPhone vulnerability: How did it happen, and what next? The Guardian, (Feb. 25, 2014); http://www.theguardian.com/technology/2014/feb/25/apples-ssl-iphone-vulnerability-how-did-it-happen-and-what-next.
3. Auerbach, D. An extraordinary kind of stupid. Slate (Feb. 25, 2014); http://slate.me/1o75yGs
4. Bellovin, S.M. Goto Fail. SMBlog (Feb. 23, 2014); https://www.cs.columbia.edu/~smb/blog/2014-02/2014-02-23.html.
5. Bland, M. Test Certified, 2011; http://mike-bland.com/2011/10/18/test-certified.html.
6. Bland, M. Testing on the Toilet, 2011; http://mike-bland.com/2011/10/25/testing-on-the-toilet.html.
7. Bland, M. Test Mercenaries, 2012; http://mike-bland.com/2012/07/10/test-mercenaries.html.
8. Bland, M. AutoTest Central, 2014; http://autotestcentral.com/small-medium-and-large-test-sizes
9. Dijkstra, E. A case against the GO TO statement. Commun. ACM 11, 3 (Nov. 1968), 147148; http://www.cs.utexas.edu/users/EWD/ewd02xx/EWD215.PDF.
10. Fuller, L. TestableSecurity: demonstrating that
SSLVerifySignedServerKeyExchange() is trivially testable, 2014; https://github.com/landonf/Testability-CVE-2014-1266.
11. Google, Inc. Too many tests. Google Testing Blog (Feb. 21, 2008); http://googletesting.blogspot.com/2008/02/in-movie-amadeus-austrian-emperor.html.
12. Greenfield, R. Why Apple's power cords keep breaking. The Wire (July 30, 2012); http://www.thewire.com/technology/2012/07/why-apples-power-cords-keep-breaking/55202/.
13. Langley, A. Apple's SSL/TLS bug. Imperial Violet (Feb. 22, 2014); https://www.imperialviolet.org/2014/02/22/applebug.html.
14. Ray, C.K. TDD and signed SSLVerifySignedServerKeyExchange. Exploring Agile Solutions: Software Development with Agile Practices (Feb. 23, 2014); http://agilesolutionspace.blogspot.com/2014/02/tdd-and-signed-sslverifysignedserverkey.html.
15. Schneier, B. Was the iOS SSL flaw deliberate? Schneier on Security: A Blog Covering Security and Security Technology (Feb. 2014); https://www.schneier.com/blog/archives/2014/02/was_the_ios_ssl.html.
16. van Deursen, A. Learning from Apple's #gotofail security bug. Arie van Deursen: Software Engineering in Theory and Practice (Feb. 22, 2014); http://avandeursen.com/2014/02/22/gotofail-security/.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.
I could not agree more about the unit testing and if possibly even more about the evils of copy & paste coders.
However ... this bug would have been a shining red error in my Eclipse workspace since trivial static analysis would have detected it. There are better arguments for unit tests imho ...
@Peter Kriens: Part of the argument is that the bug could've taken another form, e.g. mismatched braces. Static analysis, while a very helpful tool that might've caught this one manifestation of the bug, likely wouldn't have caught mismatched braces. A unit test would have. A unit test such as the proof-of-concept test I wrote to accompany this article should be called for regardless, to ensure that the correct failure points are triggered given various flavors of bogus inputs. Then there's the design pressure to eliminate duplication, which I believe is directly responsible for the genesis of this bug.
Either way, it appears neither unit testing nor static analysis was brought to bear here, and possibly not even code review (or, at best, the diff was so big the reviewer was blind to the buried error). That points to the deeper cultural issue I raised, where subpar code quality is indicative of a development (and possibly corporate) culture that does not take its social responsibilities seriously enough, or at least had a dangerous lapse in this one case. Having most of the media bandwidth spent on talking about how hard it would've been to test for, or how the code was pretty good anyways, doesn't help solve the problem, as it's not just a technical problem. In addition, we need people to say something more than just "X would've caught it" or "goto is e-ville" or "Nobody should be using C anymore". None of these responses go far enough to help solve the problem.
What better arguments for unit testing are there? Seriously, if they exist, please share them. We need more concrete, compelling arguments that show the value of unit testing in particular, and code quality practices in general. Bear in mind that pure rational arguments often aren't effective on the other side of the chasm, to borrow the metaphor from Geoffrey Moore's "Crossing the Chasm". Such appeals get heads to nod among the like-minded Innovators and Early Adopters (described in Moore's book), but everyone else needs more meat, in terms of examples and, best of all, experience (hence the proof-of-concept test).
Part of my motivation in writing this was to inspire more people to share their concrete arguments and experiences, even if only internally in their own companies. Bugs such as these, where we can both point clearly at the code and the downstream implications, provide opportunities like no other to make the case that code quality matters, and unit testing in particular is one of the best tools at our disposal to prevent a multitude of really straightforward coding errors early--and produces a host of beneficial second-order effects when done well. I also go into far greater depth in my "Goto Fail, Heartbleed, and Unit Testing Culture" article for Martin Fowler, linked from the final section of the article. (This article is seven pages in the printed CACM, including images; the other article, printed from Chrome, minus the acknowledgements at the end, is forty-seven.)
Displaying all 2 comments