Blog

Why Facebook’s outage should be a wake-up call for internet services

October 10, 2021

By Jake Madders

The recent widespread Facebook outage has highlighted the rapidly growing reliance on a select few, centralized technology companies, exposing the fragility of scaled, global networks. Like many others, my first thought when I heard the news was to share it over WhatsApp. But the Facebook-owned messaging service was out too, exposing the integral role that the social media giant plays in millions of people’s lives. With more than 2 billion users across 180 countries, WhatsApp is a key communication tool for people and businesses around the world every day, and the impact of an outage is astronomical.

The reality is that all internet service providers operating at scale require such a high level of automation that one wrong code deployment can take down an entire system. With free communication platforms such as Facebook, WhatsApp, and Instagram, we’ve built up a false sense of security. There’s an expectation that giant multinational tech conglomerates, due to their almost limitless budgets, will put in the necessary fail-safes—but this isn’t always the case.

This issue, however, doesn’t just affect Facebook – any internet service provider has the potential for mass outages. It’s time for companies to ensure that their network infrastructure is resilient and in-built with strong security measures.

The “route” of the problem

It’s impossible to say exactly what happened with Facebook on Monday evening, but the scale of the outage isn’t surprising. With remote, global setups such as Facebook, thousands of automated devices are required to maintain the network. Automation is a necessary requirement for giant companies to scale appropriately, allowing Facebook’s many applications to operate continuously with very high usage.

But one wrong code deployment is then replicated thousands of times across all of the devices and the scale of the problem becomes exponential. This is why the outage can be so widespread and last for such a long period of time, requiring almost endless troubleshooting for each machine. Just one misplaced character has the potential to cause this problem.

In order to run an internet-service business, there has to be a point where the network physically connects to the internet through a router which means proper planning must be put in place. To be as secure as possible, these measures need to be added when the network is being built.

More than a social detox

For a lot of people online, an evening without Facebook’s many social media applications served a useful purpose, a detox from screen addiction. There’s no question that it was nice to have a break from the endless WhatsApp group chats. But what if the outage had lasted another day? Or a week? At Hyve, we’ve always ensured that we use multiple communication channels alongside WhatsApp, such as Slack. But many businesses and self-employed individuals were struck with the realization that they were operating almost exclusively out of these applications.

When it comes to communication, diversifying your comms channels is very important. Outages will happen and you can’t let them destroy your business operations; there are many fail-safes you can implement to avoid your own network suffering the same fate.

An effective disaster recovery plan can include fast rollback functionalities built into every device. When changes or updates are being implemented and something goes wrong centrally, a rollback plan can be set up in a way that requires any changes to the devices to be confirmed before being activated. If they’re not confirmed by an engineer within 20 seconds, the device will automatically revert back to its original settings, stopping the mass changes from taking place.

Additionally, as companies scale at even greater levels, the amount of employees with potential access to the many thousands of network devices also increases. Internet services need to ensure that there are substantial access policies in place preventing anyone and everyone from having access to important devices. This should also include diligent sign-off processes and checks – the more safety procedures, the higher the chances of stopping issues before they become widespread.

It’s also possible to quickly switch to secondary backup devices quickly to mitigate mass outages. When it comes to storage, having a backup location with a different storage provider significantly reduces the chances of losing important data. It becomes much easier to mitigate outages quickly. In all cases, having a dedicated team of engineers on location is incredibly important – it’s surprising that Facebook, with its gigantic global team, didn’t seem to have this system in place.

If these security measures haven’t already been implemented, it will likely be necessary to plan a temporary outage to build these fail-safes in. But the huge long-term benefits far outweigh the costs.

The nature of internet services

There are inherent risks that can never be completely eliminated when running an internet service—it’s the nature of the business. But if steps aren’t taken to reduce these risks as much as possible, future outages of this kind could be disastrous, with bigger and longer outages that wreck countless businesses. The more we become reliant on a select few giant tech conglomerates, the bigger these risks become—when they go down, we all go down. This is a wake-up call for internet service companies everywhere.

About the author

Jake Madders is managing director and co-founder of Hyve, a UK-based managed hosting provider.

Moving beyond passive RAG: How to implement active memory reconstruction for…

How self-improving harnesses are rewriting the agent engineering playbook

How Nvidia’s ASPIRE framework accelerates robot programming with self-improving AI

How the AI arms race moved from smart models to full-stack…

Why LLMs should stop thinking out loud (and what comes after…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

Demystifying loop engineering: Get more from AI agents, avoid loopmaxxing

Why the future of agentic AI is all about the harness

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

Why Facebook’s outage should be a wake-up call for internet services

The “route” of the problem

More than a social detox

The nature of internet services

Like this:

Leave a ReplyCancel reply

The “route” of the problem

More than a social detox

The nature of internet services

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks