The recent CrowdStrike outage sent shockwaves through the IT community. Millions of devices worldwide were crippled by a faulty update, underscoring a critical truth: blindly auto-updating Windows applications is a risky gamble.
While the convenience of automatic updates is undeniable, the potential consequences are far-reaching. From system instability to data breaches, the pitfalls of this approach are numerous. In a statement, Microsoft estimates the carnage: "We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices, or less than one percent of all Windows machines. While the percentage was small, the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services."
In this post, we delve into the specific risks associated with auto-updates and provide practical steps to protect your organization from similar catastrophes.
What caused the CrowdStrike Falcon Blue Screen of Death?
CrowdStrike Falcon is a piece of software specifically designed to protect devices from viruses, ransomware attacks, or something else suspicious. It is very common for large businesses to have this type of software installed, especially given its standing in analyst community research. Like many cybersecurity software products, CrowdStrike installs a boot-start driver onto computers to create a secure-as-possible situation that only allows devices to initialize if they are in a secure environment. And this is where the problem started. Let's look at it in more detail.
CrowdStrike Falcon is a device driver operating at kernel level
CrowdStrike works at a deep level within the operating system, the so-called Ring 0 or Kernel level, which performs all the OS core functionality the Operating System provides. The kernel is where drivers can talk to the hardware and other devices, manage memory scheduling threads, and much more. Contrary to User Mode, which is typically responsible for running applications, the Kernel Mode has very deep access and visibility across the entire system. If the Kernel code detects a critical failure, it blue-screens the system to prevent other more destructive alternatives from happening.
To understand this issue better, we need to look at how CrowdStrike Falcon works. The software requires access on the Kernel level because it analyzes a wide range of application behaviors rather than just examining application file definitions to be able to proactively detect malicious behavior before it is even categorized and listed in a formal definition. To achieve this, CrowdStrike Falcon is written as a device driver and lives down on the Kernel level, giving it unfettered access to the system's data structures and services.
As a footnote, Microsoft is now reviewing whether 3rd party vendors should be able to access the OS kernel after a previous rebuking by cybersecurity vendors and EU regulators. In contrast, Apple in 2020, successfully locked down their macOS to avoid anything like this happening.
CrowdStrike also utilizes dynamic definition files
To protect users from installing unsafe drivers, Microsoft issues a WHQL (Windows* Hardware Quality Labs) Certificate to all drivers that have passed their WHQL testing. As long as nothing on the driver changes, the certificate remains valid. But in the case of CrowdStrike, where every minute counts to get a new update out to millions of devices in preventing a new zero-day attack from spreading, getting a WHQL certificate every time isn't feasible. So, instead of certifying each update, and to increase speed of deployment, CrowdStrike relies on dynamic definition files (it terms these "rapid response content") that the driver will read but that aren't included in the driver itself.
This approach allows for some scary scenarios. For example, the rapid response content could contain actual program files that the driver then executes in Kernel mode. Since the driver has never changed, the driver is still certified, but the update changes the way the driver operates. It appears that in this case, a rapid response content file was downloaded (in the form of a .sys file) but a mistake at the CrowdStrike end meant that the file contained only zeros and not any definition code.
As the driver itself was not touched, there are few airchecks or parameter validations within the operating system to contain it, nor was there any phased deployment, the file contents got added to over 8.5 million systems. As the boot-start driver tried to read the empty definition file it crashed, hence the Blue Screen Of Death (BSOD) at startup.
The fix is easy, but recovery will take a long time (and a lot of resources)
Although CrowdStrike deployed an update very quickly (i.e., a new version of their update), the problem is that you cannot deploy an update to a machine that won't turn on. It requires physical access to the machine. To recover your device, you had to boot in safe mode, open up the command prompt, probably run a PowerShell script provided by your IT department, or manually delete the offending .sys file.
If you had BitLocker, which is the standard encryption for Windows PCs in many businesses, it took a lot more effort. Because you have to reboot in safe mode, you could not connect to WiFi to enter your user credentials and password to unlock your hard drive. To unlock your BitLocker-locked computer, you would have to find another computer, sign in as you, and get an encryption key that you then manually enter into your BSOD-locked device to unlock it. A very unwieldy process for your average end user.
CrowdStrike outage holds a critical warning about auto-updates
CrowdStrike, in an effort to deploy updates as fast as possible, pushed out Over-The-Air (OTA) updates of its software that had inadequate airchecking and parameter validation at their end in the release management process. A high reliance on highly automated release management procedures to ensure that updates were not impactful failed in this case, which meant that a complete rollout began across every device with the agent. It brings to the fore a question of the level of risk we are holding within our software application updates.
Are organizations auto-updating their applications, crossing their fingers, and hoping for the best?
Historically, the first thing IT teams would do when deploying an application to their devices is turn off automatic updates. The reasoning was to maintain control over the change and ensure appropriate testing of the updates before rolling them out. This came at a cost, both in terms of speed of deployment (think zero-day vulnerabilities) and the physical cost of resources to run the process. The robustness of the Windows 10 operating system and maturing of the application environment, to all but remove DLL conflict hell, has given digital workplace leaders the confidence to think differently. Consequently, some companies have started to take a more hands-off approach with their application updates.
This is understandable, considering the volume of change these teams have to manage, especially for security patches, has increased drastically. According to Security Magazine, the average enterprise device has 67 different applications installed. Ten percent of all enterprise devices looked at in this study had more than 100 unique applications installed. Add to this the number of application versions and the potential number of different builds, and you have a very complex environment.
On an enterprise scale with tens or even hundreds of thousands of employees, this results in thousands of applications and application versions an organization needs to manage. Most of them have multiple updates a year, making it impossible to stay on top of this manually. Even with application owner programs in place, the resources needed would be enormous.
This raises the question about operational efficiency. Manually testing and deploying every update is unmanageable. So enabling automatic updates for stable applications seems reasonable, as the time savings outweigh the occasional problematic update, Most of the time, everything is fine... until it suddenly isn't.
Auto-updating your apps introduces some risk
However, while a blanket "auto-update everything" approach is tempting for an already stretched thin IT team, it introduces a multitude of risks, especially when dealing with a wide range of applications, applications that hit core OS features, and custom business applications that aren't on every device.
- Wide-ranging applications. Applications that interact with multiple systems or other software (wide-ranging applications) may cause unforeseen conflicts or face compatibility issues. An update in one app could break functionality in another, leading to disruptions in workflows and processes. Automating updates eliminates the crucial step of internal testing, potentially introducing bugs or security vulnerabilities into the system that could have been caught beforehand.
- Applications that hit core OS features. Updates for system-critical applications can destabilize the entire operating system (as this issue has demonstrated). Bugs in these updates could lead to crashes, blue screens of death, or even data loss. Recovering from a problematic core feature update can be complex and time-consuming. Automating these updates removes the ability to easily roll back in case of issues.
- Business applications not on every device: Businesses often rely on custom applications or specific versions of software to maintain key operations. Auto-updating these applications on all devices, even those not utilizing them, can disrupt workflows for those who need the older version. In addition, business applications often integrate with other systems within the organization. Auto-updating one application could potentially break existing integrations with others, causing delays or errors in essential business processes.
At Juriba, we recommend against a blanket "auto-update everything" approach, as that introduces unnecessary risk. Instead, we have always highlighted the importance of an agile and automated process that allows for quick testing of updates. While some applications, like the Microsoft Office suite or Google Chrome, can be auto-updated, we recommend carefully considering where you should draw the line on your risk profile when it comes to automatically updating applications.
Eliminate risks while increasing IT agility with automated application testing
If you find yourself struggling to keep your applications up-to-date and having to resort to auto-updates for operational efficiency, I hear you. I have been in your shoes, and I want to share a few thoughts with you.
Consider implementing automated application testing. To get started, begin with a thorough review and categorization of your application estate. By identifying your application coverage, which applications are critical, which have extensive dependencies, and which can tolerate auto-updates, organizations can better manage potential issues and prioritize resources effectively.
Next, I recommend that you explore ways to use automation for agile and fast testing of updates. This involves setting up automated testing frameworks that can quickly validate updates in controlled environments before they are rolled out company-wide. Agile automation not only speeds up the testing process, but also ensures that updates are robust and compatible with existing systems, minimizing the risk of unexpected disruptions.
Reviewing and amending rollout plans is another crucial step. Rather than pushing updates to all devices simultaneously, organizations should adopt a phased or wave/ring-based approach (as should all vendors such as CrowdStrike by the way!). Starting with a pilot group allows IT teams to monitor the effects of the update, address any issues that arise, and ensure that broader deployments proceed smoothly. This staged rollout reduces the impact of potential problems and ensures a more stable update process but only has a minor impact on the speed of rollout.
Finally, it is essential to remediate any applications with a high-risk profile that are set to auto-update. These apps, due to their critical nature or complex integrations, should be given special attention. Ensuring that high-risk applications are manually reviewed and tested before updates can prevent significant operational disruptions and enhance overall system stability. By adopting these strategies, enterprises can maintain up-to-date applications without compromising on security and reliability.
With the right tools, you can now build this type of agile environment where change is managed fast but with the appropriate levels of validation.
Conclusion
Overall, the CrowdStrike incident highlighted the importance of having robust application readiness processes to catch issues before they hit production. And the trend towards more automated updates underscores the need for that kind of proactive, agile approach to managing the application estate.
Microsoft has recently also written a blog on the best practice approach to application management on the Windows OS - https://techcommunity.microsoft.com/t5/windows-it-pro-blog/windows-resiliency-best-practices-and-the-path-forward/ba-p/4201550
I am here to help. If you have any questions about this article or the approach I am suggesting, feel free to email me. If you want to see what this looks like in practice, feel free to set up a demo, and one of our team members will walk you through everything live.