How CrowdStrike Stopped Everything

Cybersecurity experts know the CIA Triad well: that’s the acronym for Confidentiality, Integrity, and Availability, all of which are essential to data security. Organizations must keep data secret so not just anyone can see it; they must maintain its accuracy and reliability, and they must keep the information available to those who need it.

Organizations ensure data confidentiality with access controls and encryption. They maintain data integrity through backup and recovery and version, access, and security controls. Data holders keep information available through redundancy, backups, and systems monitoring and updates.

However, keeping data available means more than preventing data loss. It is not only ransomware that puts data out of reach. Data, applications, and systems become unavailable in IT system outages. The data is still there, but no one can get to it.

On July 19, 2024, the global IT outage that cybersecurity company CrowdStrike caused brought down millions of Windows computers worldwide, making data and applications inaccessible. Hospitals canceled surgeries, airlines grounded planes, and 911 call centers suffered outages lasting several hours. These and countless other systems were unavailable. It was not a cyberattack or a data breach, but it was still a security issue, the biggest of its kind.

CrowdStrike Root Cause Analysis

CrowdStrike, in its Falcon Content Update Remediation and Guidance Hub, reported, “On July 19, 2024, a Rapid Response Content update was delivered to certain Windows hosts, evolving the new capability first released in February 2024. The sensor expected 20 input fields, while the update provided 21. The mismatch resulted in an out-of-bounds memory read, causing a system crash.”

The root cause analysis (RCA) means that a CrowdStrike programmer(s) did not check their inputs before pushing an update to the CrowdStrike Falcon Windows Sensor in production. The update (Channel File 291) contained a mismatch between the expected and actual number of input fields. The update forced the Windows sensor to read the 21^st input field, which it could not. It forced an exception that the system could not handle, which led to the Windows Blue Screen of Death (BSOD). It broke the software and bricked the Windows devices.

The extent of the outage depends on how you look at the data. “We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices, or less than 1% of all Windows machines,” said David Weston, vice president of Enterprise and OS Security at Microsoft, in a Microsoft blog.

It’s Fixed

According to the Falcon Content Update Remediation and Guidance Hub, ~99% of Windows sensors were online as of July 29 at 5 pm PT. Regarding the Windows sensor uptime, Kevin Benacci, senior director of corporate communications at CrowdStrike, relayed via email, “We typically see a variance of ~1% week-over-week in sensor connections.”

According to Kelly Whitten, principal at Kekst CNC, a global communications consultancy, responding for press relations for CrowdStrike, the ~1% variance happens constantly. It is attributable to things like turning Windows systems off.

Critical infrastructure, business-critical, and everyday applications

The Microsoft Teams meeting application was among those disrupted by the outage. “Everyday applications that people rely on, like cloud-based productivity tools and communication platforms, went dark, bringing business operations and daily routines to a grinding halt. It reminded us how modern economies intertwine with these systems and how devastating it can be when they fail,” said Trevor Horwitz, chief information security officer and founder at TrustNet Inc., a provider of cybersecurity and compliance products and services.

Digital signage at New York City’s LaGuardia Airport displayed the Microsoft Windows BSOD error message. “The outage disrupted a diverse array of essential systems vital for the functioning of hospitals, airports, stock exchanges, airlines, financial institutions, and retail outlets,” said Bob Bruns, chief information security officer at Avanade, a global provider of IT consulting and services focused on the Microsoft platform.

Emergency call centers were down, and hospitals delayed, diverted, or canceled clinical medical procedures. “These disruptions are resulting in some clinical procedure delays, diversions, or cancellations. [The] impact is also being felt indirectly as a result of local emergency call centers being down,” said John Riggi, American Hospital Association (AHA) national advisor for Cybersecurity and Risk, in an AHA Cybersecurity Advisory.

The scale of the outage confirms the ultimate tragedy. “The failures cascaded as dependent systems crashed, halting operations across multiple sectors. Emergency services and day-to-day business applications became inaccessible, affecting national and international services, from healthcare to transportation. At this scale, it can easily be said that this cost lives, as well as untold loss of income and aggravation for individuals,” said John Marcato, chief technology officer at TWE Solutions Inc., an IT solutions and consulting firm.

There were sea freight lapses at the ports of Rotterdam, Netherlands, and Gdansk, Poland. “Airline reservation systems, logistics platforms, and even smart traffic management systems were affected, causing delays and cancelations. It disrupted business travel, trade, and everyday commutes,” said Bruns.

In the U.K., London taxi drivers’ card payment systems were down, forcing travelers to pay with banknotes (cash). Some had to pay extra to cover trips to ATMs to get the cash to reach their original destinations.

The CrowdStrike outage disrupted TV broadcasters such as Sky News, which went off the air. Social Security Administration offices closed for a day. Grocery stores saw system crashes.

Verizon said on its website that due to the CrowdStrike outage, some of its customer service and store operations may have been limited, possibly creating longer wait times.

Preventing a recurrence

While it may be impossible to prevent such outages, some experts believe their impact could be mitigated by making systems resilient.

Said Jen Easterly, director of the U.S. Cybersecurity and Infrastructure Security Agency (CISA) in a personal LinkedIn article post, “Even in a world where we make major progress on Security-By-Design and Security-By-Demand, bad things will still happen. Whether it’s a technology outage caused by faulty code or a cyber-attack caused by Chinese cyber actors, we must expect there to be disruption. We should plan for it, prepare for it, and build our systems and our networks to withstand it as much as possible, as well as train and resource our people to manage through it.”

Said Avanade’s Bruns, “While we can and must take every precaution to prevent such outages, we must also be realistic: no system is infallible, and technology evolves faster than we can anticipate. The truth is, it’s not a matter of whether but when the next global IT disruption will occur.”

On the other hand, one technology expert said the CrowdStrike outage was preventable as it’s not a reasonable practice to deploy software updates as CrowdStrike did. According to Marcato, no competent software developer makes a simple, logical choice to roll out software updates so widely simultaneously; it’s not done in the modern world.

“It is not a new or complex concept. You do your full internal testing, and once approved, you roll out to a small and varied distribution of systems, allowing for a waiting period and then a slightly larger group,” said Marcato.

“You can still get a new update out in under a week, but you can stop if you see problems. This would have had minimal impact, and you would never have heard about it if it had been approached this way.”

Lost data, fraudsters, and scammers

In the CrowdStrike outage, some data was not created, while other data was not recorded. Said TrustNet’s Horwitz, “During the downtime, hospitals couldn’t update patient records, leading to gaps in critical medical histories. In financial services, transaction data wasn’t captured, causing discrepancies that could have serious financial implications. The data can’t be recovered. It simply doesn’t exist.”

Threat actors, fraudsters, and scammers have capitalized on the high-profile global outage as they did with the pandemic and other prominent world events. “As with any major disruption, bad actors quickly exploited the situation,” said Horwitz. “During the outage, we saw a spike in phishing attempts, with scammers sending out emails that appeared to be from trusted IT providers, aiming to steal credentials or install malware.”

Future-prepping

Recalling David Weston’s numbers, the outage affected 1% of Windows machines. What happens when an IT outage affects 10%, 5%, or even just 2% of Windows installations, or those of other equally ubiquitous software?

We cannot know for sure, but we can make better preparations. We already know the steps of mature software development, test, and release practices, building in security during initial product design, system and infrastructure resilience, and response training.

David Geer is a journalist who focuses on issues related to cybersecurity. He writes from Cleveland, OH, USA.

How CrowdStrike Stopped Everything

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.