acm-header
Sign In

Communications of the ACM

ACM News

What Caused the Facebook Outage?


View as: Print Mobile App Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
Facebook outage message

According to The Washington Post, the outage of Facebook and its WhatsApp messaging service stalled the personal and professional lives of tens of millions of people.

Credit: Facebook

On October 4, Facebook and its properties Facebook Messenger, WhatsApp, Instagram, and Oculus suffered a global outage. People and businesses could not use the sites or services. The outage started around 11:39 a.m. Eastern Time, according to a blog post from Kentik, a network observability company, and ended that evening.

According to many major news reports, Facebook employees and contractors also could not log on to the social media titan's internal tools for work and business communications. Facebook engineers could not gain physical access to enter datacenter facilities to fix the issue.

Dissecting Essential Outage Events

According to a statement from Facebook VP of Infrastructure Santosh Janardhan in a Facebook Engineering blog post, routers manage data traffic between Facebook datacenters globally. Facebook engineers frequently take parts of the company's global backbone network offline to repair fiber lines, add capacity, or update router software.

During such routine maintenance, the system that manages Facebook's global backbone network capacity sent a command to see how much bandwidth was available. A command is a set of instructions that software and devices understand and use. The command errantly took down all the connections in the backbone network, disconnecting all the Facebook datacenters. The system has an audit tool to audit these commands to prevent these mistakes, but the audit tool had a bug that kept the tool from stopping the command. The specific command, audit tool, and bug are uncertain.

Further, according to the Facebook Engineering blog, because Facebook's Domain Name Servers (DNS) could not reach the company's datacenters and IP addresses, those servers told the Border Gateway Protocol (BGP), which "advertises" Facebook's IP addresses to the Internet, to "withdraw those BGP advertisements." As Janardhan wrote in his Facebook Engineering blog post, "Our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the Internet to find our servers."

More Than Bugs

A bug in an audit tool started it all. When asked about the audit tool, Facebook's PR and communications manager Tom Parnell said, "We have nothing further to share beyond the blog post that you noted below:

https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/."

Yet experts agree that a seminal audit bug was not the main contributor to the outage. It is alarming that the audit tool could be a single point of failure, says Johnny Young, a.k.a. JohnE Upgrade, cybersecurity expert emeritus, IBM's Cloud division. "Taken at face value, the statement by VP of Infrastructure Santosh Janardhan says the architecture of the Facebook network is so poorly designed that a hacker who'd breached it could take down their entire global network with a single router command," says Young.

Says Fred Cohen, CEO of security and risk management consultancy Management Analytics and a seminal researcher on many cybersecurity defenses, "No single act should have this level of consequence. It demonstrates poor practice, unsound change control, and lacking redundancy and fail-safes for high-consequence events. All tools have 'bugs,' thus the need for redundancy and fail-safes. They were negligent."

Outage Effects

The outage hampered businesses that count on the social media giant. Stephen Light, co-owner of Colorado-based mattress firm Nolah Sleep, says, "We rely on Facebook and Instagram advertisements. The outage coincided with a week-long sale. For an entire day of our event, we were unable to reach our audience."

According to The Washington Post, the WhatsApp communications blackout stalled the personal and professional lives of tens of millions of people. Doug Madory, director of Internet Analysis for network observability service Kentik, says, "Outside the United States, it is very common to use WhatsApp in lieu of SMS messages. My family recently lived in Spain for a year, and all communications with my boys' school and their sports teams were over WhatsApp. It was assumed that you used WhatsApp. I think many businesses were disrupted, along with many people's daily routines, by not having WhatsApp."

Facebook paid the price for its catastrophe. The financial losses from the embarrassing outage probably total about $60 million, according to Ars Technica. In addition, Facebook founder Zuckerberg lost about $7 billion during the outage due to a nearly 5% drop in the price of the company's stock, according to Yahoo Finance.

Coincidence, Not Correlation

Coincidentally, the Facebook outage happened the Monday morning after the CBS 60 Minutes episode featuring Facebook whistleblower Frances Haugen, the former Facebook data scientist who worked in the social giant's unit on civic integrity. Haugen leaked many internal Facebook documents, which she says demonstrate that the company knew its products were harming teens' and pre-teens' mental health and well-being, according to transcripts of her Congressional testimony.

There does not appear to be any relationship between the outage and the damaging revelations from Haugen's interviews, testimonies, and copied Facebook papers. However, they focus critical attention on Facebook's technical and business practices and their effects on society, business, and culture.

David Geer is a journalist who focuses on issues related to cybersecurity. He writes from Cleveland, OH, USA.


 

No entries found