Opinion
Security and Privacy

Stop Using Vulnerability Counts to Measure Software Security

Shifting focus from assessing volume to effectiveness.

Posted
hands holding banners with numbers, illustration

Every link. Every email message. Every phone call. Every download. From end users to engineers to executives, we all have to make daily decisions about what software to trust. And every day, we see our systems updating to patch yet another vulnerability. We must continually ask: Is this software secure enough to do this?

So we go to metrics to make these decisions. Metrics are great—they take complex nuance and boil it down to a number or a label. All you need is data. Projects like the National Vulnerability Database (NVD) have become the international standard repository for all reported vulnerabilities in the software industry, open source or closed. Since 1999, the NVD has amassed records of over 150,000 vulnerabilities. This project is near and dear to me, as I have devoted my career to studying the NVD, uncovering the engineering failures that led to those vulnerabilities.

But, with all this data, I see a mistake made. I see it in peer reviewing scientific papers, in industry, in the classroom, and in media coverage. Everywhere.

Vulnerability fix count is not the same as vulnerability count.  Just because a project has a history of vulnerabilities fixed does not mean it is less secure. We ignore the fact that just finding those vulnerabilities likely took an above-and-beyond effort from an ethical Good Samaritan who probably did not even have cybersecurity as a required part of their degree. We ignore when some vulnerabilities suddenly become easier to find because fuzzer technology advanced or APIs deprecated. We ignore the chilling effect of engineers not being candid about their mistakes.

So, I beg of you, Communications readers: Please stop using vulnerability fix counts to measure security. Vulnerability data has more nuance than that. Let’s dissect this rhetorically, explore its variations, and discuss an alternative approach.

The Two Airplane Mechanics

Suppose you get to choose one of two airplanes to fly, and you only have a 15-minute meeting with each plane’s mechanic. The first mechanic is dressed in a suit and gives you a well-rehearsed pitch. He talks about his pedigree and his qualifications. He tells you about his 1,000-point checklist, mentioning every technical buzzword about aerospace you have ever heard. He assures you that he has checked thoroughly and, of course, has found no problems.

The second mechanic shows up a little late, in a stained T-shirt, grease under her fingernails, and a sharp pencil behind her ear. Her clipboard is packed with meticulously documented printouts of every fix she made to this plane from the last 10 years. Forgetting to introduce herself, she dives right into what she fixed recently, what parts of the plane wear out faster, and what she would like to do better next time. She shows you what she checked, what she fixed, how she fixed it, how she found it, how she thinks the mistake was originally made, and the steps the team has already taken to prevent it from happening again.

Which plane do you fly on?

We do the digital equivalent of this in cybersecurity all the time.

Variations on a False Argument

I see two forms of this false argument: false comparisons, and justifying a new security practice. The false comparison is when we use phrases such as “Microsoft fixed 30 vulnerabilities but MacOS only had 12.” What does that really mean? Is Microsoft more secure because more problems were fixed? Or did MacOS have fewer vulnerabilities to begin with? Since we do not know the actual number of vulnerabilities, (and, try as we might, I have yet to see a solid way to estimate it), we cannot know if those fixes represent more effort in quality assurance or more secure software.

I see the false comparison being made between systems, languages, and frameworks. I see software projects claiming to be better alternatives because they have not found as many vulnerabilities. I see rankings of which types of vulnerabilities are “bigger problems” based on recent fix trends. These arguments are problematic.

I also see vulnerability fix counts used to justify a new security practice. Academic researchers are especially guilty of this. “People have found and fixed thousands of SQL injection vulnerabilities, so here’s a way to find SQL injection vulnerabilities.” If people are so good at finding and fixing SQL injection vulnerabilities—which were already easy to find and trivial to fix—then why do we need a new way to find them? This self-fulfilling prophecy feedback cycle has led to flooding the literature with techniques for SQL-injections. Meanwhile, researchers ignore the thousand other types of vulnerabilities.

The Mousetrap Paradox

This false argument leaves out a key fact: the vulnerability is gone now. The problem you are pointing to no longer exists. And, by the time a vulnerability has made its way into the public database, the development team has had time to reflect and enact process improvements. Like Heisenberg measuring subatomic particles, the moment you measure it you have changed it.

Suppose you are curious if your house has mice, so you set a trap. The next day, you discover you successfully caught a mouse. This is both bad news and good news. Your house did have at least one mouse, maybe more. But, on the bright side, you were effective at trapping the mouse—had you used the wrong bait or placed it in the wrong place, you might conclude that your house had no mice.

A vulnerability fix is a fascinating philosophical paradox: it is a record of a mistake that is now gone. It is both a defeat and a triumph.

Cybersecurity is so hard to reason about because it is defined by what is not happening. A system is considered secure until it’s found to be insecure. As my coach used to say, “Experience is what you get right after you need it”—security is only ever really known in hindsight.

Mixed-Up Incentives

Perhaps the worst consequence is that it creates a terrible incentive structure for engineers. If you are punished every time you admit a mistake, then the answer is easy: stop admitting mistakes. I recently conducted workshops with professional software engineers about learning cybersecurity from vulnerability history. We dived into some source code fixes to actual, historical vulnerabilities in well-known open source projects. I asked a developer who had spent the last half-hour learning about a vulnerability in Tomcat: “What would it have taken for someone on the development team to find this?” To paraphrase his answer: “To find an issue like this, you have to go above and beyond, deviate from your never-ending task list and consider the wider implications, then convince your colleagues that the problem is real, then convince management that you need to cut a new release, make change logs, write test cases, and so forth. I cannot fathom seeing developers in our organization pursuing that, though obviously I wish we would.”

Members of the team later likened this workshop experience as “therapy” because they had conversations with their manager that they never had before. Mistakes were never on the agenda. Behind every vulnerability fix is a symphony of human error. Software developers need a safe space to admit their mistakes. They need to think critically about improvement and to work out the solutions without the pressure for results. They need a place to record, track, manage, and resolve those mistakes without oversimplified metrics setting their meeting agendas for them. Let’s not pollute bug databases by messing with their incentives.

An Uncommon Denominator

In the world of software metrics, we talk a lot about denominators. Instead of raw counts of defects, for example, we prefer defect density. The term “normalizing” is used to describe this conversion from raw numbers to a ratio or proportion that can be compared across projects, releases, languages, time, and other confounding factors. Good denominators are often neutral. For example, the number of source lines of code (aka “SLOC”) is a useful denominator because some projects are large, others are small. Denominators help filter out the noise in the numerators.

I propose that we start using the Number of Vulnerabilities Reported (NumVulnsReported) as the denominator in our security metrics. This has the nice side effect that if NumVulnsReported is zero, then your security posture is undefined. It shifts the conversation to the effectiveness of your vulnerability discovery efforts rather than the volume. Our proposed metrics in the next section uses NumVulnsReported as a denominator.

Vulnerability Recidivism Metrics

So counting vulnerabilities is bad? Not at all. The best metrics are actionable, meaning that they drive process improvement by identifying problems as opposed to outcomes. What if we measured a potential lack of process improvement? We propose the term vulnerability recidivism to refer to engineering failures that result in vulnerabilities repeating in various ways. Here’s one such metric.

Type Recidivism: NumRepeatedTypeVulnsReported / NumVulnsReported, or the percentage of vulnerabilities whose type has been reported before in this project. If a security practice is effective, it should find many different types of vulnerabilities. Additionally, if the team is looking for design-level improvements instead of only patching, then this metric will remain low. For example, a project patches a new integer overflow vulnerability every few weeks without doing a comprehensive review of how they handle security-critical integers in their design. The Common Weakness Enumeration is the animal kingdom of vulnerabilities and would be useful for determining if a new “type” of vulnerability was found.

Vulnerability recidivism can be explored in other ways, such with modules and authors. In other words, counting the number of vulnerabilities reported that were in a previously-vulnerable module, and from an author whose code was associated with vulnerabilities before.

These three metrics “stack” nicely too, because a vulnerability can have one or more of those recidivisms, increasing the evidence of a lack of process improvement. If the team sees a vulnerability that has a repeated type, a repeated module, and a repeated author—perhaps a broader discussion is needed. As the old adage goes: “Fool me once, shame on you. Fool me twice, shame on me.”

A Hard Problem

In March 2024, the U.S. White House released a report discussing some recommendations for cybersecurity. Much of the media coverage of this report dwelled on a discussion surrounding memory safety, which is a worthy discussion for another time. But the report also includes this gem: “Software measurability is one of the hardest open research problems to address; in fact, cybersecurity experts have grappled with this problem for decades.” As one of those experts, I can confirm. Measuring security is both necessary, and hard. Good measurement can only happen when developers are safe to admit mistakes and improve their process. Above-and-beyond effort ought to be celebrated and incentivized. Let’s start looking deeper into vulnerability history to learn how we can do better in the future.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More