Slow recovery from IT outage begins as experts warn of future risks | Microsoft IT outage


Services began to come back online Friday evening after an IT failure that wreaked havoc worldwide on Friday. But full recovery could take weeks, experts have said, after airports, healthcare services and businesses were hit by the “largest outage in history”.

Flights and hospital appointments were cancelled, payroll systems seized up and TV channels went off air after a botched software upgrade hit Microsoft’s Windows operating system.

It came from the US cybersecurity company CrowdStrike, and left workers facing a “blue screen of death” as their computers failed to start. Experts said every affected PC may have to be fixed manually, but as of Friday night some services started to recover.

As recovery continues, experts say the outage underscored concerns that many organizations are not well prepared to implement contingency plans when a single point of failure such as an IT system, or a piece of software within it, goes down. But these outages will happen again, experts say, until more contingencies are built into networks and organizations introduce better back-ups.

In the UK, Whitehall crisis officials were coordinating the response through the Cobra committee. Ministers were in touch with their sectors to tackle the fallout from the IT failure, and the transport secretary, Louise Haigh, said she was working “at pace with industry” after trains and flights were affected.

Many people are being affected by today’s IT outages impacting services across the country and globally.

Ministers are working with their sectors and respective industries on the issue.

I am in close contact with teams coordinating our response through the COBR response system

— Pat McFadden (@patmcfaddenmp) July 19, 2024

A Microsoft spokesperson said on Friday: “We’re aware of an issue affecting Windows devices due to an update from a third-party software platform. We anticipate a resolution is forthcoming.”

Texas-based CrowdStrike confirmed the outage was due to a software update from one of its products and was not caused by a cyber-attack.

Its founder and chief executive, George Kurtz, said he was “deeply sorry for the impact that we’ve caused to customers”, adding there had been a “negative interaction” between the update and Microsoft’s operating system.

CrowdStrike’s stock price fell dramatically over the course of the day, dropping by as much as 13% at some points in trading.

Elon Musk, owner of Tesla, said the outage caused “a seizure to the automotive supply chain” while banks in Kenya and Ukraine reported issues with their digital services, and supermarkets in Australia had problems with payments.

Govia Thameslink Railway (GTR) – the parent company of Southern, Thameslink, Gatwick Express and Great Northern – warned passengers to expect delays.

According to the service status monitoring website Downdetector, users in the UK were reporting issues with the services of Visa, BT, big supermarket chains, banks, online gaming platforms and media outlets.

The Sky News and CBBC channels were also temporarily off-air in the UK before resuming broadcasting, while Australia’s ABC was also affected.

In financial services, Metro Bank reported problems with its phone lines in the UK and Santander said card payments “may be affected”. Monzo said some customers were reporting issues, while some bankers at JP Morgan were unable to log on to their systems and the London Stock Exchange said there were problems with its news service.

Troy Hunt, a leading cybersecurity consultant, said the scale of the IT failure was unprecedented.

“I don’t think it’s too early to call it: this will be the largest IT outage in history,” he tweeted.

“This is basically what we were all worried about with Y2K, except it’s actually happened this time,” he added, referring to the millennium bug that worried IT experts in the run-up to 2000 – but ultimately did not cause serious damage.

The UK’s chartered institute for IT, the BCS, said it could take days and weeks for systems to recover, although some fixes will be easier to implement.

“In some cases, the fix may be applied very quickly,” said Adam Leon Smith, a BCS fellow. “But if computers have reacted in a way that means they’re getting into blue screens and endless loops it may be difficult to restore and that could take days and weeks.”

Alan Woodward, a professor of cybersecurity at the University of Surrey, said the fix required a manual reboot of affected machines and “most standard users would not know how to follow the instructions”. Organisations with thousands of PCs distributed in different locations face a tougher task, he added.

“It’s just sheer numbers. For some organisations it could certainly take weeks,” he said.

From Amsterdam to Zurich, Singapore to Hong Kong, airport operators flagged technical issues that were disrupting their services. While some airports halted all flights, in others airline staff had to check-in passengers manually.

Among the companies affected on Friday was Ryanair, Europe’s largest airline, which said on its website: “Potential disruptions across the network due to a global third-party system outage … We advise passengers to arrive at the airport three hours in advance of their flight to avoid any disruptions.”

Heathrow, Europe’s biggest airport, said it was “working hard” to get passengers “on their way”.

A spokesperson for Heathrow said: “We continue to work with our airport colleagues to minimise the impact of the global IT outage on passenger journeys. Flights continue to be operational and passengers are advised to check with their airlines for the latest flight information.”

In the US, flights were grounded owing to communications problems that appear to be linked to the outage. American Airlines, Delta and United Airlines were among the carriers affected.

Berlin airport temporarily halted all flights on Friday. The aviation analytics company Cirium said 5,078 flights – 4.6% of those scheduled – were cancelled globally on Friday, including 167 UK departures and 171 arrivals.

Queues and blank screens at airports as Microsoft IT outage disrupts travel – video

GP practices in the UK said they were unable to access patient records or book appointments. Surgeries reported on social media that they could not access the EMIS Web system.

It is understood that 999 services were unaffected by the outage, but the Royal Surrey NHS Trust, in the south of England, declared a critical incident and cancelled radiotherapy appointments scheduled for Friday morning. The National Pharmacy Association confirmed that UK services could be affected.

A spokesperson for Keir Starmer said they were unaware of the problem having any impact on government services, but added they recognised the impact it was having more broadly. Reports from the Netherlands also suggested there may be problems within the health service.

The Israeli health ministry said “the global malfunction” had affected 16 hospitals, while in Germany the Schleswig-Holstein university hospital in the north of the country said it had cancelled all planned operations in Kiel and Lübeck.

Ted Wheeler, the mayor of Portland, Oregon, issued an emergency declaration stating that certain essential city services including emergency communications were affected by the outage.

The University of Surrey’s Alan Woodward said the outage was caused by an IT product called CrowdStrike Falcon which monitors the security of large networks of PCs and downloads a piece of monitoring software to every machine.

“The product is used by large organisations that have significant numbers of PCs to ensure everything is monitored. Sadly, if they lose all the PCs they can’t operate, or only at a much reduced service level,” said Woodward.

Steven Murdoch, a professor of security engineering at University College London, said many organisations could struggle to carry out the fix swiftly.

“The problem is occurring before the computer is connected to the internet so there is no way to fix the problem remotely, so that requires someone to come out … and fix the problem,” said Murdoch, adding that companies and organisations that have cut back on IT staff or outsourced their IT work would find their ability to address the problem hampered.

However, Ciaran Martin, the former chief executive of the National Cyber Security Centre, said that unlike adversarial cyber-attacks, this problem had already been identified and a solution had been flagged.

“The recovery is not about getting on top of the situation but getting back up. I think it’s unlikely to be very newsworthy in terms of ongoing disruption this time next week,” he said.

The problems for businesses in the US were also compounded by problems with Microsoft’s Azure cloud computing business that occurred on Thursday.

Reuters contributed to this report





Source link

Leave a Reply

Back To Top