From WhatsApp to Greggs - why is tech going down more?
- Published
What links Greggs, maker of the UK’s most popular sausage roll, and tech titans Apple and Meta?
In March and April 2024, all have seen customers struggle to access some of their services - from baked goods, to Big Macs and WhatsApp messages – because of IT outages.
Coincidence? Not according to experts, who say such outages now really are happening more frequently.
These recent high-profile cases have thrust one particular website into the spotlight.
Downdetector is a platform which monitors web outages - its data gives an idea of the extent of the problems companies have been facing recently.
More than 1.75 million user-reported issues were flagged worldwide for WhatsApp on 3 April, according to the site.
It says tens of thousands were also reported for the App Store and Apple TV.
Neither firm responded to the BBC's questions about what had caused their outages.
But Brennen Smith, vice-president of technology at Downdetector's parent company Ookla, says such instances reflect what they are seeing - which is more outages taking place and a higher number of reports from users as they happen.
"The internet is not exactly getting more stable," he told the BBC.
To understand why, you need to understand a little more about the internet itself.
Like software, it is composed of many layers. And every time regulators demand changes to platforms, consumers seek seamless access to data or investors push for buzzy new features like AI chatbots, new layers are added.
Introducing more layers and complexity creates more risk of things going wrong.
"Right now there's a push for these mega giants to incorporate very game-changing new technology into their products and services," Mr Smith said.
"I think with the push for innovation now, we're going to start to see tech companies move faster [but] it comes at the risk of potentially breaking things."
Moving parts and thundering herds
The other thing to bear in mind with the internet: there are lots of different things that can make it fall over. Typos in code, faulty hardware, power failures and cyber attacks are just a few examples of why a service might go down.
Even severe weather, such as heatwaves, storms and natural disasters can affect data centres - the huge halls housing powerful computers, known as servers, upon which online services rely.
"There are a lot of moving parts, and if just one of those goes wrong you can see problems," says Sam Kirkman of cyber-security firm NetSPI.
Another issue is lots of firms have moved from managing their servers and infrastructure in-house to putting them on the cloud over the last decade.
That has enabled those firms to do more "faster than they ever could before", Mr Kirkman told the BBC - but it also means a single outage in one place at the cloud service provider can "cascade across a lot of the platforms, technologies and companies we use today".
Glitches for some of the biggest names in the industry - namely Amazon Web Services (AWS), Microsoft Azure and Google Cloud - have previously led to downtime for thousands of customers.
Even those impacting smaller, yet heavily relied-upon providers like Fastly and Cloudflare have also had a knock-on effect for services.
The UK government's portal gov.uk was among major platforms knocked offline when Fastly had issues in June 2021.
Sudden spikes in demand for a service can cause prolonged or complex outages, especially on high-traffic events like Black Friday or during low-staffed periods like bank holidays or weekends.
Theories that Fridays see more outages than other days of the week may just be speculation, Mr Smith says.
But he notes many firms do have policies not to ship updates or changes on them.
"Less humans have hands on keyboards, less eyes on monitoring systems. It's a time where you don't want to be rolling out changes," he says.
IT glitches affecting Nationwide, McDonald's and Sainsbury's all took place on or began on Fridays in March, though they have been attributed to different causes.
More widely, engineers trying to patch problems and get a service back online during outages can also find themselves contending with a stampede of users trying to get hold of it.
Cloudflare said, external it encountered one of these so-called "thundering herds" when, during an outage stemming from a data centre power failure in November 2023, a slew of requests initially overwhelmed a recovery site.
'Technical debt'
Underpinning all of this is another fundamental truth of the online world: while the services and products on offer grow ever more sophisticated, its basic architecture is, often, quite antiquated.
In other words, the modern internet relies "on a fabric of really old technology", says Mr Kirkman.
He highlights Border Gateway Protocol (BGP) - one of the internet's most important in determining where traffic goes - as a good example, shown by Meta's six-hour outage in October 2021.
Misconfigured BGP updates by Facebook meant it essentially stopped talking to the rest of the internet.
And users of its platforms were likewise left unable to communicate with families or manage their businesses.
Mr Kirkman says BGP represents an ongoing challenge because it has to be maintained, but cannot be easily updated and minor configurations can take down entire platforms.
It highlights what he says some might consider "technical debt" as an issue potentially affecting the whole of the internet.
These problems are not new. But our growing reliance upon online services means they are becoming an ever bigger challenge for firms seeking to prevent them.
"What really we're seeing is that people are caring more and more," says Mr Smith.
"I'd say now more than ever, it's really important that services are able to stay resilient to stay online and still bring new innovations and features to market," he adds.