Understanding the root cause of IT outages

Learn from the recent global IT outage caused by a faulty update to CrowdStrike's Falcon software. Explore the root cause, impacts, and solutions for preventing future disruptions.

Shftrs

Aug 2, 2024

The recent global IT outage caused by a faulty update to CrowdStrike's Falcon software underscores a critical lesson: not all IT disruptions stem from cyberattacks. To prevent future incidents, we must understand the true causes of these outages. This article explores the root cause of the outage, the immediate impacts, and the steps taken to address and prevent such issues in the future.

Incident overview

On July 19, 2024, at 04:09 UTC, CrowdStrike deployed a configuration update for the Falcon sensor used on Windows systems. The update aimed to enhance telemetry capabilities by tracking new and evolving threat techniques. Unfortunately, this update led to widespread system crashes, affecting Windows hosts running Falcon sensor version 7.11 and above. Resulting inWindows computers globally experiencing a mass Blue Screen of Death, as reported by Forbes.

"The outage was caused by a defect found in a Falcon content update for Windows hosts. Mac and Linux hosts were not impacted. This was not a cyberattack," said George Kurtz, CrowdStrike Founder and CEO, in an official statement shortly after the incident.

At least 8.5 million Windows devices were affected, marking this as one of the most significant IT disruptions in recent history, according to BBC.

What went wrong?

The outage was triggered by a faulty Rapid Response Content update from CrowdStrike, intended to address new cybersecurity threats quickly. This update introduced an error in a content file that caused Windows systems to crash with a Blue Screen of Death (BSOD). While CrowdStrike’s standard updates undergo rigorous testing, this Rapid Response Content update slipped through due to a flaw in the validation process, which failed to catch the error before deployment.

CrowdStrike's QA Process

CrowdStrike typically follows a rigorous QA process for updates, including automated and manual testing, validation, and controlled rollouts. This ensures that standard Sensor Content is thoroughly vetted before deployment.

However, according to CrowdStrike, the update that caused the recent outage was part of their Rapid Response Content, which is designed to address emerging threats quickly and goes through a faster, less extensive testing process. This difference allowed the error to slip through, leading to the system crashes.

“It’s evident that for such mission-critical software running on millions of computers, every change—no matter how small it may seem—should be subject to a full QA procedure, including staged rollouts”, says Talal Haj Bakry, a security researcher at Mysk.

Common pitfalls in update management and software deployment

This incident highlights several common pitfalls in update management and software deployment that can lead to significant disruptions:

Inadequate testing: Rushing updates without thorough testing can result in undetected errors that disrupt systems.
Poor Deployment Timing: Coordination and communication during updates are crucial to avoid disruptions.
Weak error detection and rollback mechanisms: Without strong error detection and rollback options, issues from updates can quickly escalate.
Lack of communication: Transparent and timely communication with customers about issues and resolutions is essential to maintain trust and minimize disruption.

SHFTRS integration: A proactive approach to error prevention

One effective approach to mitigating these pitfalls is leveraging advanced test automation platforms like SHFTRS. Here’s how SHFTRS can contribute:

Automated testing: SHFTRS automates comprehensive testing, reducing the risk of errors in production.
Continuous integration and deployment: Supports frequent, reliable updates by catching issues early.
Error detection: Enhances the ability to detect and address errors before they impact users.
Real-time monitoring: Provides immediate feedback, enabling swift responses to emerging issues.

Immediate response and resolution

Upon discovering the problem, CrowdStrike acted swiftly to revert the problematic update. The defective content was replaced with a stable version. The company communicated the incident to affected customers and partners, apologizing for the disruption and assuring them of the continuity of their services.

“Channel File 291 was identified and fixed 78 minutes after it was released, at 1:27 a.m. EDT on July 19. A logic error in our Content Validator (software that performs control checks on content before deployment) has also been fixed and is currently in testing, with a release date scheduled for early August 2024”, CrowdStrike stated in their Post Incident Review.

Even though CrowdStrike acted fast to try to resolve the issue, Tech Informed reported that there was still many companies having to go through every single device and manually reboot in “safe mood”. Time Magazine also reported that it may take some time before systems across the globe are fully up and running as normal again. Experts agree it’s too early to determine the full financial impact of Friday’s global internet disruption, but Patrick Anderson, CEO of Michigan’s Anderson Economic Group, estimates that costs could easily exceed $1 billion (CNN).

Technical insights

The technical root of the issue lay in a specific channel file within the Falcon sensor’s directory, located at %WINDIR%\System32\drivers\CrowdStrike. The problematic file had a timestamp indicating it was deployed at 04:09 UTC. The corrected version, with a timestamp of 05:27 UTC, was restored and served as the active content. Hosts that had not been updated during the problematic window were not affected and required no further action.

Preventative measures and future steps

To prevent future issues, CrowdStrike is enhancing testing and validation for Rapid Response Content updates, which require robust checks despite their speed. They are also improving monitoring, rollback capabilities, and customer control over update deployment to ensure better management and quicker issue resolution.

Conclusion

Understanding IT outage root causes is key to improving system resilience and preventing future disruptions. The CrowdStrike incident highlights the challenges of managing dynamic updates in cybersecurity. By analyzing the failure and taking corrective actions—such as enhanced testing, better monitoring, and more granular update controls—CrowdStrike aims to prevent similar issues. Additionally, using tools like SHFTRS for more rigorous testing and deployment can further enhance system reliability. These steps provide valuable lessons for other organizations to strengthen their update management and maintain service integrity.

Sources:

Corvin, A. (2024, July 25). Five lessons from the CrowdStrike Windows IT outage. Tech Informed. Retrieved on August 1, 2024, from: https://techinformed.com/five-lessons-from-the-crowdstrike-windows-it-outage/

CrowdStrike. (2024, July 24). Preliminary Post Incident Review. Remediation and Guidance Hub: Falcon Content Update for WIndows Hosts. https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/

Isidore, C. (2024, July 22). Costs from the global outage could top $1 billion – but who pays the bill is harder to understand. CNN. Retrieved on August 1, 2024, from: https://edition.cnn.com/2024/07/21/business/crowdstrike-outage-cost/index.html

O'Flaherty, K. (2024, July 24). CrowdStrike Reveals New Details About What Caused Windows Outage. Forbes. Retrieved on July 29, 2024, from: https://www.forbes.com/sites/kateoflahertyuk/2024/07/24/crowdstrike-reveals-new-details-about-what-caused-windows-outage/

Kurtz, G. (2024, July 19). To Our Customers and Partners. CrowdStrike. Retrieved on July 29, 2024, from: https://www.crowdstrike.com/blog/to-our-customers-and-partners/

Schneid, R. (2024, July 19). CrowdStrike’s Role In the Microsoft IT Outage, Explained. Time Magazine. Retrieved on August 1, 2024, from: https://time.com/7000476/microsoft-it-outage-crowdstrike-role-what-happened-explanation/

Tidy, J. (2024, July 20) CrowdStrike IT outage affected 8.5 million Windows devices, Microsoft says. BBC. Retrieved on July 31, 2024, from: https://bbc.com/news/articles/cpe3zgznwjno

Understanding the root cause of IT outages

Incident overview

What went wrong?

Common pitfalls in update management and software deployment

SHFTRS integration: A proactive approach to error prevention

Technical insights

Preventative measures and future steps

Conclusion

Similar posts

The 11 Key takeaways from the recent IT outage: Results & risks

AI in IP: Upholding integrity and preventing plagiarism

Beyond Pass or Fail: Exploring the Nuances of Test Automation Reporting