Navigating IT Challenges: Lessons from the recent CrowdStrike global outage

In today's interconnected world, even a brief disruption can cause significant issues. Recently, we saw this firsthand with the global outage experienced by users of CrowdStrike Falcon, one of the most widely used Endpoint Detection and Response (EDR) platforms. This event is a stark reminder to us of the fragility of our digital infrastructure. For organisations relying on IT services, it's crucial to understand what happened, the impacts, and how to prepare for similar future events.

On Friday 19th July, CrowdStrike caused a major global outage that disrupted their services worldwide. A defective Falcon content configuration update for Windows hosts meant that many businesses were unable to access both their workstations and servers, severely impacting on BAU activities. The immediate impacts were significant across the globe.

Operational delays were inevitable, especially in sectors like finance and healthcare where access to data is paramount. Financial losses were another consequence, with businesses losing revenue, unable to access their payment systems and customers unable to access their banking facilities, not to mention incurring additional costs for contingency measures.
Overall, this incident highlighted the intricate and interconnected nature of modern IT systems and the importance to plan ahead for such events.

Incident Management and Response

While some organisations were able to support themselves, many had to rely heavily on external support. Inde’s team sprang into action whilst customers were calling in for support and proactively reached out to all our customers that use CrowdStrike to check in and provide support where needed. Inde identified that over 170 servers owned by a number of customers were affected and promptly put together a P1 team. This included a technical response team of specialists monitoring, analysing and providing all the latest technical problem-solving expertise, and a team of technical engineers who supported our customers with implementing the fixes to bring their systems back online.

One customer lost remote access to their servers during an unrelated power outage affecting 1500 homes, rendering their systems inaccessible for remediation. A logistics company experienced a complete loss of systems, severely impacting customer-facing services, whilst another healthcare customer faced critical issues with their printers and helpdesk going down.

The operational process to fix these issues was highly manual and resource intensive. Servers kept crashing and were unable to restart, which in cloud environments required disks to be temporarily relocated for remediation. Fortunately, Inde’s P1 team were able to provide their expertise at our customer sites and on-call support where needed, resulting in a robust and rapid fix for our customers, with most of their systems back online and fully functional as we headed into the early hours of Saturday.

Endpoint Detection and Response (EDR) platforms like CrowdStrike do present IT teams with a bit of a dilemma. Starting too late in the Windows boot sequence leaves them susceptible to missing detection of malware running at the lowest level of the Windows operating system, or being disabled by it. But being given boot priority is a privilege and not a right, and developers of Windows kernel drivers are required to uphold extremely high quality-assurance standards. When they don't get this right things can go very wrong. If you’d like to understand more about the technical aspects of this, our security experts have recommended this video by Dave Plummer, a renowned programmer from Microsoft, known for creating Task Manager amongst many other programs.

Key Learnings and Recommendations

The CrowdStrike outage underscored the necessity for robust incident response plans in organisations. These plans should include:

Clear Protocols: Establish protocols for communication, escalation, and recovery to minimise disruption during events.
Diverse IT Environment: Implement a multi-layered IT strategy with multiple vendors to provide essential redundancy and enhance resilience, avoiding reliance on a single vendor.
Regular Backups and Testing: Ensure backups are secure, accessible, and frequently tested to facilitate quicker recovery during an outage, allowing prompt restoration of data and resumption of operations.
Vendor Risk Management: Conduct thorough due diligence and ongoing assessments of vendors’ reliability, understanding their contingency plans and capacity to handle disruptions.
Comprehensive Disaster Recovery (DR) and Business Continuity Plan (BCP): Assess how your business can operate without technology and ensure data recovery even if servers are irreparably damaged.
Staging Updates: Deploy updates in stages to small test groups, ensuring updates are controlled, phased, and cause minimal business disruption.

Taking these steps ensures that your organisation is better equipped to handle similar events, safeguarding operations and maintaining the trust of clients and stakeholders.

Preparing for the future

Organisations should conduct regular security assessments to evaluate potential vulnerabilities within their IT environment. Enhancing collaboration with IT service providers can lead to better preparedness and faster recovery during crises. Adopting a culture of continuous improvement is also crucial - learning from past incidents, updating response strategies, and investing in training for IT teams. Staying informed about industry developments and emerging threats allows organisations to adapt quickly and stay ahead of potential disruptions.

The CrowdStrike outage highlighted just how important preparedness is in our digital age. By learning from this incident and implementing robust strategies, your organisation can enhance its resilience against future IT disruptions. In a world where technology is constantly evolving, staying vigilant and prepared isn’t just an option—it’s a necessity. Taking these steps ensures that your organisation is better equipped to handle similar events, safeguarding operations and maintaining the trust of clients and stakeholders.

Our team at Inde can work with your IT team to identify areas of concern and implement effective fixes, ensuring your business remains operational even in the face of significant IT disruptions. Inde has successfully worked with global customers to roll out new services across their IT infrastructure, such as during business mergers. We help identify the architecture of your IT infrastructure, pinpoint potential areas that could be affected, and implement strategies to mitigate risks and ensure seamless operations.

If you’d like to know more, please reach out to our Cloud & Identity Director Chris Burke.

About the author

Chris Burke

With over two decades of dedicated experience in the field of Information Technology, Chris (Burko to all but the most distant of relatives) is a seasoned professional who has made significant contributions across diverse regions, including New Zealand, Australia, and the United Kingdom. His expertise lies in successfully navigating the complexities of large-scale cloud migrations, spearheading integration projects, and seamlessly executing identity provider migrations. His specialised focus on large-scale cloud migrations showcases his ability to harness cutting-edge technologies and methodologies to drive organisational efficiency and competitiveness. His hands-on experience in overseeing integration projects has resulted in streamlined operations and enhanced collaboration within organisations. Outside of work, he finds joy in a variety of activities, including perfecting the art of BBQing, indulging in the occasional bout of amateur guitar playing and ensuring his kids' bikes are always in top-notch condition. A nature lover at heart, Burko digs the refreshing vibes of ocean dips and wraps it all up with a well-deserved beer to unwind.