Application security, DevSecOps

How could Microsoft let the CrowdStrike meltdown happen?

Share
Microsoft logo on the website homepage.

CrowdStrike bears the ultimate responsibility for the global IT disaster July 19, in which 8.5 million Windows machines worldwide failed to boot.

CrowdStrike’s apparent negligence, which resulted in the cybersecurity provider putting its kernel-accessing content update through flawed QA-testing software, is shocking. CrowdStrike has faced a mountain of criticism and derision, all of it deserved.

But there's another, potentially even more disturbing angle.

How could Microsoft let this happen? How could it allow an improperly tested kernel-level update?

In the 1990s, I QA-tested Windows drivers for graphics, video, and 3D hardware. I loved it. It was hard work, and most of it was spent testing and re-testing and explaining to engineers exactly where the problems were. Then came a new round of testing. Rinse and repeat.

After months of effort, after the driver bug list had been whittled down to insignificance, there was a final round — Microsoft’s round.

Microsoft didn’t take my word that the driver wouldn’t crash Windows. We had to run the software through Microsoft’s QA process and then submit the software to Microsoft for final approval.

Not all software needed that last testing round — just software that accessed the lower levels of Windows.

Microsoft’s developer tools give engineers different layers of access to the operating system, from accessing UI features such as dialogue boxes to lower-level parts such as the system kernel. If developers want to write desktop applications — word processors, photo editors, weather apps. — that access only higher-level functions, they can do so without Microsoft’s approval.

Because of this tiered system, it’s long been safe to develop Windows desktop applications. If the code has bugs, the program crashes, but nothing lethal happens to the computer. At worst, the user just restarts the PC.

This open model has been critical to Microsoft’s success. Letting anyone develop and ship desktop applications while keeping the operating system safe helped make Microsoft the biggest, most successful software company in the world.

Underlying this success was a seriousness about internal development, maintenance, and quality that was endemic to Microsoft’s culture.

Some may argue that Microsoft no longer can test every lower-level piece of software now that it’s distributed online rather than on floppy disks. Security has also become paramount. Maybe it’s no longer possible to wait for Microsoft QA approval before deploying a security-software update to prevent a vulnerability exploit.

I disagree. When it comes to Windows kernel updates, responsible software companies with large deployments should wait.

Large software-client firms should also hire qualified security professionals who won’t panic when confronted with an exploitable vulnerability. They can prepare for possible exploitation and defend the network until a patch is ready.

Isn’t waiting a few days better than praying some third-party vendor doesn’t brick all the machines? Doesn’t it make sense that if the company has a 10,000-node network, the company, rather than a third-party software vendor, are ultimately responsible for it running properly?

But if there’s a ton of vendors who want to write kernel-level code, Microsoft can’t possibly test them all, right? Well, Microsoft used to, and it still can.

First, there are levels of developer partnerships that large software firms can buy to get priority access to Microsoft QA rounds. Second, Microsoft can create – and may already have – an array of automated QA testing suites that can give a quick preliminary deployable confirmation.

Let’s walk through a hypothetical scenario:

Acme Cybersecurity finds its latest release has a bug. DevSecOps are on it and a fix will come by EOD tomorrow. Acme informs its client Blue Sky Airlines, which employs security professionals who can watch for the potential threat while they await the patch.

Acme explains that even after its fix finishes, it still needs to wait for confirmation from Microsoft. Acme has paid a hefty annual fee to Microsoft to be a “close partner,” so its software changes get priority and QA tests will get done in three business days.

Simultaneously, Acme submits its latest code to Microsoft’s automated QA tests. That takes one day, and the cleared code gets stamped as a preliminary release. Blue Sky’s IT department can decide for itself whether to deploy the preliminary version or to wait for the final build.

With this process, Blue Sky Airlines waits an extra one to three days, but the responsibilities for the software update are appropriately distributed. If the security team detects a vulnerability, they deal with it. At worst, the security team has a rough few days and has to power-cycle a few systems.

Some large cybersecurity clients like Blue Sky Airlines may want to have their cake and eat it too. They may want the convenience of Microsoft’s open access development model without taking full responsibility for their own systems.

But if a large firm decides to rely upon a huge installed base of technology, it needs to employ qualified IT and security professionals.

Executives must hire people who, if told their systems are temporarily vulnerable, can protect the network until there’s a patch ready. That’s far better than outsourcing all responsibility to a cybersecurity vendor, crossing their fingers, and hoping nothing goes wrong.

Let’s return to what really disturbs me. In the last few decades, we’ve seen too many companies favor feature cramming, quick releases, and maximizing revenue rather than focusing on quality, testing, and maintenance.

Historically, that hasn’t described Microsoft. But has it also fallen under this spell, this Boeing “efficiency” creed of cutting key services, infrastructure, and QA? Has Microsoft lost its culture and soul? Is it still a well-maintained outfit the world can rely on?

Or, has Microsoft decided that instead of maintaining the testing infrastructure from back in the ‘90s, it will now just trust vendors with access to low-level systems and hope they don’t screw up?

To me, this is a much scarier notion than having a cybersecurity company push out a buggy update. There are dozens of other cybersecurity firms, but there’s only one Microsoft.

There was an EU regulatory ruling back in 2009 that forced Microsoft to give third parties more kernel access, but this doesn’t absolve the company of responsibility for its products.

I’ve long assumed that Microsoft was a reliable, backbone firm. I shudder to think that we can no longer depend on them.

Mike Mathog, freelance data scientist

Mike Mathog

Mike Mathog spent seven years QAing a vast array of hardware and Microsoft software for both media companies and technology firms. He’s now a freelance data scientist living in San Francisco.

Get daily email updates

SC Media's daily must-read of the most current and pressing daily news

By clicking the Subscribe button below, you agree to SC Media Terms and Conditions and Privacy Policy.