CrowdStrike update causes global outages: Analysis

A botched CrowdStrike software update resulted in a massive global outage of computer systems, primarily impacting Microsoft Azure customers. The problem surfaced in the early hours of Friday, July 19, when organizations began encountering the notorious "blue screen of death." Dustin Sachs, chief technologist at CyberRisk Alliance, explained, "At about 1 AM Eastern US time, companies started noticing errors that caused their computers to stop functioning. It was quickly traced to an update for CrowdStrike Falcon, a cybersecurity tool".

(Full transcript of video available below)

The scope of the disruption is staggering. Major airlines such as American Airlines, Delta, and United have been affected, alongside the UK's National Health Service and various hospital labs in the US along with financial institutions. "The scope of this is growing minute by minute," Sachs said, adding that financial institutions, including credit banks, have also reported issues.

Despite the widespread impact, Sachs emphasized that this was not a cyberattack. "CrowdStrike is making it very clear that this was not a cyberattack. This was not a ransomware attack. It was an update issue," he clarified. The automatic update, intended to enhance security, inadvertently caused systems to fail, highlighting potential flaws in the update process.

Full coverage of CrowdStrike's boggled update and Microsoft Azure outage

The incident underscores the importance of rigorous testing before deploying updates. "We can't just install an update when it comes out. There must be some level of testing," Sachs advised. He drew parallels to the SolarWinds incident, suggesting that more thorough pre-deployment checks could prevent similar occurrences.

Both CrowdStrike and Microsoft have responded promptly. "CrowdStrike and Microsoft have been very good about getting out quickly with information on what to do," Sachs said. Remediation steps have been provided to affected customers, and efforts are underway to implement fixes.

The timeline for full recovery remains uncertain. "The remediation fixes should at least stop the bleeding and allow some level of operations to resume," Sachs mentioned, though he acknowledged that returning to full operational capacity could take more time. IT teams are expected to work through the weekend to assess and mitigate the impact.

While the incident has caused significant disruption across various sectors, the swift response from CrowdStrike and Microsoft offers hope for a prompt resolution. The event serves as a stark reminder of the critical need for thorough testing and cautious deployment of software updates to maintain cybersecurity integrity.

(Full Transcript of video interview with Dustin Sachs follows)

Tom Spring

Hi, I'm Tom Spring with SC Media, and I'm joined today with Dustin Sachs, chief technologist at CyberRisk Alliance. Thank you for joining us today to talk about the CrowdStrike update incident that is causing massive outages across the globe and is impacting significantly Microsoft customers. Dustin, thank you for joining us once again, and please introduce yourself.

Dustin Sachs

Yeah, absolutely. Thank you. Tom. My name is Dustin Sachs. I am a Doctor of Computer Science as well as the Chief Technologist at CyberRisk Alliance. Have a little over 18 years of experience in cybersecurity industry as a practitioner, and am now here running content and working with our community,

Tom Spring

The question I think we would all love to know at these early stages is what happened? What do we know has happened in terms of this outage?

Dustin Sachs

So as of Friday, the 19th of July, at about 11am what we know is that around about one o'clock eastern US time this morning, companies around the world started noticing that they were encountering what's known as the blue screen of death. They were encountering an error that was causing their computers to stop functioning, and very quickly found out that it was traced to an update, an automatic update that appears to have run at most of these companies. For us, a software known as CrowdStrike and specifically CrowdStrike Falcon, which is a cybersecurity tool that does monitoring of computers to determine if there's been any sort of compromise of those computers, and it looks like that software update had a flaw in it that caused these computers to start to fail.

Tom Spring

Okay, so we're just now getting a good clear picture on just how widespread these this outage is. Can you give us a sense in terms of who is being impacted? I know that Microsoft customers have been primarily impacted by this update that CrowdStrike is uh, blaming for these outages? Can you talk about how, how the who, who is being, who is, who is impacted? How big is the scope of the outages?

Dustin Sachs

Yeah, absolutely. So, you know, as you said, yeah, it's, it's, it's primarily Microsoft based customers. Um, the list seems to be changing, you know, minute to minute. But as of now, we know that major airlines, American Airlines, Delta united, were all impacted. We know that the National Health Service in the UK was impacted. I've gotten reports that some of the labs, the hospital labs here in the US have been impacted, We've heard about credit banks being impacted. So the scope of this, I think, is growing minute by minute.

Tom Spring

Can you talk a little bit more about the number of sectors within the economy that are being impacted by this outage, yeah,

Dustin Sachs

It's going to cross pretty much every sector. The key kind of connector is that they were using this cyber security software. So, I would expect we're going see impact across you know, every sector of industry.

Tom Spring

What you're mentioning now, in regards to cybersecurity, is super important. CrowdStrike is making it very clear that this was not a cyberattack. This has nothing to do with its systems being compromised. It has to do with an update. What can we talk about when it comes to cybersecurity implications? However, with this incident, you know, certainly this type of widespread outage is going to have implications that are going to impact an IT team's ability to protect networks.

Dustin Sachs

Yeah. I mean, I think what you said is spot on the you know, they came out very quickly and said, this was not a cyber incident, this was not a ransomware attack. This was not what we have traditionally seen the biggest. Implication is, you know, it certainly seems clear, or, or is becoming clear, that this was an automatic update that many companies were running because it happened, you know, in the middle of the night, which is a good thing. You know, it's a good sign that, yes, they companies are using software to keep their themselves secure, and they are regularly updating it. There are certainly things to be considered about how quickly updates are being applied and whether or not testing is being done, but the biggest challenge is going to be over the next couple days and weeks, while CrowdStrike remediates whatever the issue that they had was it's going to be that that organizations are not going to be able to perhaps, monitor their own cybersecurity. So you worry about now is an incident that is, is an incident, an actual cyber incident going to occur that some companies not going to see because their software is not working.

Tom Spring

You know, as we get into this in terms of lessons learned. What are, what are some of the lessons learned at 11am ET - day one - of this incident? And I'm sure hindsight is going to be 20/20, however, when, when we consider, what are the lessons to be learned from an issue like this, what are they are? Is there anything that we can gather at these early stages in terms of lessons learned and advice?

Dustin Sachs

I mean, I think this one, and even going back to, you know, SolarWinds a couple years ago, the biggest lesson, I think, that can easily be taken away already, is we can't just install an update when it comes out, when it's available. There has to be some level of testing done on the on the implication of that that update before it is done, or before it's implemented. And you know, the fact that, again, this happened overnight in the US, certainly indicates that or, or would tend to indicate that it was an automatic update that was running on systems that probably hadn't been as fully tested by the organizations that are implementing it as they probably should have okay.

Tom Spring

What do we know that in terms of how Microsoft is handling this situation and its communications to its customers that are being impacted and any remediation that may be available to customers right now.

Dustin Sachs

Yeah, so CrowdStrike and Microsoft have both been very good about getting out very quickly with information on what to do. There are. There is a Microsoft article, I believe, on their TechNet about the exact steps to do if you're using CrowdStrike in an Azure environment. CrowdStrike has put out to their customers a step-by-step guide of what to do so the remediation steps are out there. I think this is, this has been one of the first silver linings in all of this is that both organizations have been very quick to get the fix out to people. It's a matter of now getting it fixed and implementing those fixes as quickly as possible, and then figuring out what the full impact of this was.

Tom Spring

There is a sense that we're going to be back to normal anytime soon? What is the what is the recovery time on something like this, even with the patches, the workarounds and the updates that are available to customers today?

Dustin Sachs

It certainly seems like at this early venture that the remediation fixes that have been put out will at least stop the bleeding and will at least allow some level of operations to resume. CrowdStrike has been very open about the fact that they're not sure exactly how long it might take to get back to 100% I think you know from from what I'm seeing, what I'm reading, what I'm hearing from my contacts in the industry, getting back up to at least some level of operational should be a fairly easy thing, and should be something that can be done, you know, within the next, you know, couple hours or days. But the overall impact of how long is still an unknown.

Tom Spring

So if you work in IT, you're working through the weekend, is what you're saying

Dustin Sachs

For many organizations, yeah. I mean, the fact that this happened on a Friday morning is kind of one of those things that we always joke about. But unfortunately, it's probably, yeah, the full impact, and assessing the full impact, even if you. Get back up and running today. Operationally, the assessment of ‘how badly we were affected?’ is going to is going to probably take the weekend, because there was a six, seven hour period for some companies before they realized this had been impacted. If they, if they didn't, didn't know it one in the morning.

Tom Spring

I want to ask a dangerous question, and that is, can you break it down technically, in terms of, you know, putting it in simple terms, exactly what happened here, technically at a high level, in terms of what caused this problem?

Dustin Sachs

I want to be very careful, because we don't know a lot. We still don't know a lot about exactly what happened, based on what we know now and what has been reported, what likely occurred was that this software update was developed by CrowdStrike over the past weeks or months, and one of the steps that always occurs towards the end is a quality assurance check.

They check to make sure the software operates as expected. And it appears, from what we know now, that something in that check got missed, and there was a piece of code that is doing something it shouldn't be doing. It does not appear, by any stretch or from any report that it was a malicious action, but simply a failure to catch something that was innocuous under normal circumstances, and the software was then pushed out, and because that check hadn't been done, it didn't work.

So, think of it like any other product you get, that when you get it, it breaks immediately, because they didn't check to make sure that it was glued on correctly, or they didn't make sure the screw was tied as tight as it should have been. So, this is, this seems to be a very non malicious incident, but certainly something where some failure of process appears to have occurred. But again, we don't know. We don't know the full extent, and this could change over the next and likely will change over the next days and weeks.

Tom Spring

I want to thank you so much for your time and breaking this down and giving us a quick snapshot on the totality of the story and where we are at today, early on in reporting what happened. So thank you so much. Dustin, I really appreciate your time.

Dustin Sachs

Thank you, Tom you.