The 11 Key takeaways from the recent IT outage: Results & risks

Key takeaways from the recent IT outage caused by a faulty CrowdStrike update highlight the importance of rigorous testing, QA processes, and transparent communication in building stronger IT systems and cybersecurity resilience.


The recent global IT outage caused by a faulty CrowdStrike update has delivered a wake-up call to IT professionals and organizations. Disrupting millions of Windows devices worldwide, this incident highlights the critical challenges of managing software updates and ensuring cybersecurity. Learning from this disruption, we have gathered 11 key takeaways that offers valuable insights for building stronger, more resilient IT systems.

1. Rigorous testing is crucial

The CrowdStrike outage highlights the essential role of rigorous testing for all software updates. It is vital to conduct thorough tests to identify potential issues before deployment. Rapid response updates, which are designed to address urgent threats quickly, must still undergo comprehensive testing to prevent introducing new problems. The failure to do so in this instance led to significant disruptions, demonstrating that even expedited updates cannot afford to skip detailed testing phases.

Washington Times recently reported that CrowdStrike is now being sued by angry investors for making allegedly false statements before the software update that led to the outage.They are claiming that “this inadequate software testing created a substantial risk that an update to Falcon could cause major outages”. 

2. The importance of a full QA process

A full Quality Assurance (QA) process is fundamental to ensuring the reliability of software updates. This process should encompass not only standard updates but also rapid response updates. The outage revealed that even updates intended to be deployed quickly need to adhere to a thorough QA procedure to validate their stability and compatibility. By adhering to a comprehensive QA process, organizations can catch and rectify potential issues before they reach end users, thereby avoiding widespread disruptions. According to Forbes, CrowdStrike has now pledged to improve their QA and testing process as a result of this incident.  

3. Effective error detection and rollback mechanisms


Robust error detection and rollback mechanisms are vital for mitigating the impact of faulty updates.  In this case, CrowdStrike's response was more about quickly issuing a fix rather than rolling back to an earlier version of the software. Enhanced error detection and rollback capabilities could have minimized the initial disruption. Ensuring that systems can quickly revert to a stable state is essential for managing software updates effectively.

4.Transparent communication is key

Clear and timely communication with customers is critical during an IT outage. In the case of CrowdStrike’s communication with their customers after the incident, they were fast with providing support and an official apology, however not every customer felt like this was enough. Before the incident CrowdStrike had a really good reputation with a valuation at over $83 billion and known for rapid threat detection, aiding in major investigations like the Sony hack and Democratic National Committee breaches. It serves around 29,000 customers, including over 500 from the Fortune 1000 (The Verge). 

However, According to AP News, The CEO of the American airline Delta, Ed Bastian, recently stated that they are facing $500 million in costs due to the outage. AP News reported that Delta was one of the companies that took the hardest hit from the incident, and even though CrowdStrike has reached out for free consulting advice, they are yet to offer monetary compensation to Delta. This could potentially harm CrowdStrikes reputation even further, and it underscores the importance to manage customer expectations and maintain trust. Effective communication can reduce confusion and frustration, providing users with the information they need to navigate the disruption.

“A lack of faith in cyber products is something that’s likely to impact the entire cyber community for months to come. CTOs and CIOs who are already trying to convince boards to invest more in security tooling now have a greater task”, Tech Informed commented after the outage.

5. Automated testing can reduce risks

Automated testing platforms play a crucial role in identifying errors before updates are deployed. Tools like SHFTRS, which offer comprehensive testing and validation, can help catch issues early and reduce the risk of deploying faulty updates. Leveraging automated testing can enhance overall update management and system reliability.

6. Update timing and coordination

The timing and coordination of software updates can significantly impact their success. Deploying updates during periods of low activity or using staged rollouts can help mitigate risks and minimize disruptions. Careful planning and coordination are essential to avoid unintended consequences and ensure a smooth update process. 

“What Crowdstrike was doing was rolling out its updates to everyone at once. That is not the best idea. Send it to one group and test it. There are levels of quality control it should go through,” Eric O’Neill, a former FBI counterterrorism and counterintelligence operative and cybersecurity expert said.

7. Developing robust incident response plans

A well-defined incident response plan is crucial for managing and recovering from IT outages. The CrowdStrike incident highlights the need for organizations to have a clear and actionable response strategy. An effective plan should include protocols for detecting, addressing, and communicating about issues to ensure a swift resolution.

Simon Newman, co-founder of Cyber London and International Cyber Expo Advisory Council member, emphasized that organizations need better response plans,“This incident demonstrates the need for every organization to have a robust Incident Response Plan in place that is regularly reviewed and tested to minimize the impact and recover quickly”.

jason-goodman-Oalh2MojUuk-unsplash

Photo: Unsplash

8. Monitoring and real-time feedback

Continuous infrastructure and application monitoring and real-time feedback are essential for detecting and addressing issues as they arise. Improved monitoring systems can provide early warnings of potential problems, allowing for quicker responses and minimizing the impact on users. Real-time feedback helps organizations stay informed and act proactively to prevent or mitigate disruptions.

9. Increased risk of cyber attacks

Following a major IT outage, organizations may face increased risks of cyber attacks. Disruptions can create vulnerabilities that threat actors might exploit, particularly if systems are left in a weakened state or if quick fixes lead to security oversights. The CrowdStrike incident underscores the importance of maintaining strong cybersecurity measures and ensuring that systems are thoroughly secured following an outage to prevent subsequent attacks.

“Business owners need to stop viewing cybersecurity services as merely a cost and instead as an essential investment in their company’s future,” Javad Abed, an assistant professor of information systems at Johns Hopkins Carey Business School said.

10. The role of industry standards and collaboration


Collaboration within the industry and adherence to established standards can improve overall IT resilience. Sharing insights and best practices helps organizations learn from each other’s experiences and enhance their own processes. Industry-wide efforts to develop and implement standards can contribute to better update management and system reliability.

11. Financial implications and security costs

The financial fallout from the outage was immense, with costs potentially exceeding $1 billion, according to CNN. Beyond the direct financial implications, organizations must consider the long-term security costs associated with such disruptions, including increased risk of data breaches and cyber attacks that can arise from weakened security postures.

Sources:

Corvin, A. (2024, July 25). Five lessons from the CrowdStrike Windows IT outage. Tech Informed. Retrieved on August 6, 2024, from: https://techinformed.com/five-lessons-from-the-crowdstrike-windows-it-outage/

Chapman, M. (2024, July 31). Delta CEO says airline is facing $500 million in costs from global tech outage. AP News. Retrieved on August 6, 2024, from: https://apnews.com/article/delta-crowdstrike-outage-buttigieg-5902eb14e0ec23697d8ffd9c0e63f23e

Isidore, C. (2024, July 22). Costs from the global outage could top $1 billion – but who pays the bill is harder to understand. CNN. Retrieved on August 6, 2024, from: https://edition.cnn.com/2024/07/21/business/crowdstrike-outage-cost/index.html

Matthews, B. (2024, August 1). CrowdStrike sued by shareholders in aftermath of global Microsoft outage. Washington Times. Retrieved on August 6, 2024, from: https://www.washingtontimes.com/news/2024/aug/1/crowdstrike-sued-by-shareholders-in-aftermath-of-g/

O'Flaherty, K. (2024, July 24). CrowdStrike Reveals New Details About What Caused Windows Outage. Forbes. Retrieved on July 29, 2024, from: https://www.forbes.com/sites/kateoflahertyuk/2024/07/24/crowdstrike-reveals-new-details-about-what-caused-windows-outage/

O'Flaherty, K. (2024, July 29). CrowdStrike—How Microsoft Will Protect 8.5 Million Windows Machines. Forbes. Retrieved on August 6, 2024, from: https://www.forbes.com/sites/kateoflahertyuk/2024/07/29/crowdstrike-how-microsoft-will-protect-85-million-windows-machines/

Roth, E. (2024, July 19). What is CrowdStrike, and what happened?. The Verge. Retrieved on August 9, 2024, from: https://www.theverge.com/2024/7/19/24201864/crowdstrike-outage-explained-microsoft-windows-bsod

Williams, K. (2024, July 20). The CrowdStrike fail and next global IT meltdown already in the making. CNBC. Retrieved on August 6, 2024, from: https://www.cnbc.com/2024/07/20/the-crowdstrike-fail-and-next-global-it-meltdown-already-in-the-making.html