By Jake Madders
The recent widespread Facebook outage has highlighted the rapidly growing reliance on a select few, centralized technology companies, exposing the fragility of scaled, global networks. Like many others, my first thought when I heard the news was to share it over WhatsApp. But the Facebook-owned messaging service was out too, exposing the integral role that the social media giant plays in millions of people’s lives. With more than 2 billion users across 180 countries, WhatsApp is a key communication tool for people and businesses around the world every day, and the impact of an outage is astronomical.
The reality is that all internet service providers operating at scale require such a high level of automation that one wrong code deployment can take down an entire system. With free communication platforms such as Facebook, WhatsApp, and Instagram, we’ve built up a false sense of security. There’s an expectation that giant multinational tech conglomerates, due to their almost limitless budgets, will put in the necessary fail-safes—but this isn’t always the case.
This issue, however, doesn’t just affect Facebook – any internet service provider has the potential for mass outages. It’s time for companies to ensure that their network infrastructure is resilient and in-built with strong security measures.
The “route” of the problem
It’s impossible to say exactly what happened with Facebook on Monday evening, but the scale of the outage isn’t surprising. With remote, global setups such as Facebook, thousands of automated devices are required to maintain the network. Automation is a necessary requirement for giant companies to scale appropriately, allowing Facebook’s many applications to operate continuously with very high usage.
But one wrong code deployment is then replicated thousands of times across all of the devices and the scale of the problem becomes exponential. This is why the outage can be so widespread and last for such a long period of time, requiring almost endless troubleshooting for each machine. Just one misplaced character has the potential to cause this problem.
In order to run an internet-service business, there has to be a point where the network physically connects to the internet through a router which means proper planning must be put in place. To be as secure as possible, these measures need to be added when the network is being built.
More than a social detox
For a lot of people online, an evening without Facebook’s many social media applications served a useful purpose, a detox from screen addiction. There’s no question that it was nice to have a break from the endless WhatsApp group chats. But what if the outage had lasted another day? Or a week? At Hyve, we’ve always ensured that we use multiple communication channels alongside WhatsApp, such as Slack. But many businesses and self-employed individuals were struck with the realization that they were operating almost exclusively out of these applications.
When it comes to communication, diversifying your comms channels is very important. Outages will happen and you can’t let them destroy your business operations; there are many fail-safes you can implement to avoid your own network suffering the same fate.
An effective disaster recovery plan can include fast rollback functionalities built into every device. When changes or updates are being implemented and something goes wrong centrally, a rollback plan can be set up in a way that requires any changes to the devices to be confirmed before being activated. If they’re not confirmed by an engineer within 20 seconds, the device will automatically revert back to its original settings, stopping the mass changes from taking place.
Additionally, as companies scale at even greater levels, the amount of employees with potential access to the many thousands of network devices also increases. Internet services need to ensure that there are substantial access policies in place preventing anyone and everyone from having access to important devices. This should also include diligent sign-off processes and checks – the more safety procedures, the higher the chances of stopping issues before they become widespread.
It’s also possible to quickly switch to secondary backup devices quickly to mitigate mass outages. When it comes to storage, having a backup location with a different storage provider significantly reduces the chances of losing important data. It becomes much easier to mitigate outages quickly. In all cases, having a dedicated team of engineers on location is incredibly important – it’s surprising that Facebook, with its gigantic global team, didn’t seem to have this system in place.
If these security measures haven’t already been implemented, it will likely be necessary to plan a temporary outage to build these fail-safes in. But the huge long-term benefits far outweigh the costs.
The nature of internet services
There are inherent risks that can never be completely eliminated when running an internet service—it’s the nature of the business. But if steps aren’t taken to reduce these risks as much as possible, future outages of this kind could be disastrous, with bigger and longer outages that wreck countless businesses. The more we become reliant on a select few giant tech conglomerates, the bigger these risks become—when they go down, we all go down. This is a wake-up call for internet service companies everywhere.
About the author
Jake Madders is managing director and co-founder of Hyve, a UK-based managed hosting provider.