On October 4, 2021, Facebook—along with its subsidiaries, Instagram and WhatsApp—went down. For more than six hours, 3.5 billion users were cut off from their digital data—and from the online worlds that house them.
Nate Foster, Associate Professor of Computer Science in the Ann S. Bowers College of Computing and Information Science, works to develop languages and tools that make it easy for programmers to build secure and reliable systems. He recently addressed three aspects of the unprecedented occurrence: cause, impact, and prevention:
Cause of the Outage: "Facebook is likely to publish a detailed "incident postmortem" about this outage in the coming days. But based on information released so far, it seems it was caused by a bug in the configuration for an Internet router (i.e., BGP) that also disconnected Facebook's domain name servers (i.e., DNS) from the rest of the world. In addition, there have been some reports that the bug was introduced by a flaw in an automated network management system, though these reports haven't been publicly confirmed. So far, there is no indication that the outage was due to malicious activity."
Impact of the Outage: "The impact of the outage was significant, both externally and internally. For external users, the outage meant that Facebook, Instagram, and WhatsApp were all unavailable for much of the day. In addition, Facebook is now used by many small businesses (e.g., on their marketplace) and as an authentication service for other websites. Within Facebook, there were some reports, again unconfirmed, that employees could not get access to buildings at corporate sites, presumably because the ID card system relies on Facebook's network. So the impact of this disruption was enormous."
Preventing Future Outages: "There are technical approaches that could be used to prevent outages like this in the future. For instance, one can imagine redesigning Facebook's network architecture to provide better failsafes, even if the main routes to the Internet are disconnected. Another approach is to use formal verification to validate the router configurations before they are deployed -- an idea that has been used by other large network operators in an attempt to reduce the frequency and severities of outages."
For related coverage of Foster's research see:
- Nate Foster and Team Win Most Influential Paper for a Network Programming Language
- Multiple Google Faculty Research Awards for Cornell CS, including Nate Foster and Thomas Ristenpart
See also Brooke Erin Duffy's remarks on the outage