Today's systems are inherently complex, with some component parts often operating in or close to suboptimal or failure modes. Left unchecked, as complexity increases, the compounding of failure modes will inevitably lead to catastrophic system failure. Chaos Days help us address this risk by spending time deliberately inducing failures, then analysing the response. This session summarises our experience of running Chaos Days on a large scale platform. We'll explore the what, why, how and when of running a Chaos Day.
As engineers we spend a lot of our time thinking about how best to shield our clients and customers from the risks inherent in the systems we build. We ask ourselves 'what's the worst that could happen?' - and work hard to mitigate the risk. A common risk in most systems, particularly distributed ones, is the unexpected failure of a component part. As a system's complexity and its number of subsystems grows, so does the likelihood of a subsystem failure. Subsystem failures can compound in such a manner that catastrophic system failure becomes a certainty. The only uncertainty is when the system will fail.
Chaos Engineering addresses the risks inherent in distributed systems that stem from unexpected component failure. It does so by running experiments that explore the impact of sub-system failures by deliberately inducing different types of failure in different components. Outcomes are then analysed and learnings applied to improve the system's resilience. These learnings deepen our understanding of the system and its failure modes, which aids the identification of new failure scenarios. This feedback loop informs subsequent rounds of experimentation, and thus the cycle repeats. In addition, planned failures provide a safe environment for teams to improve their incident response and how they conduct subsequent postmortems.
Chaos experiments can take many forms, ranging from continuous, automated failure injection (made famous by the Netflix Chaos Monkey), to one off Chaos Days (similar to Amazon's Game Day), where disruption is manually instigated. Chaos engineering is similar to the ethos of 'building quality in': it's a mindset, not a toolset: you don't need to be running EKS on AWS to benefit from being curious about failure modes and how to improve a system's resilience towards them. It just requires a focus on 'building resilience in'.
This session shares our experience of running Chaos Days over the last year, with one of our clients - a major Government department that hosts around 60 distributed, digital delivery teams. These teams design, deliver and support hundreds of microservices that serve online content to the department's varied customers.
The microservices all run on a single platform, itself run by seven Platform Teams that take responsibility for distinct areas (infrastructure, security and so on). Inspired by the Netflix Chaos Monkey and Amazon's Game Day, the Platform Teams have planned and executed several Chaos Days - to see just how well they and the Platform coped when everything that could go wrong, does go wrong.
More details: https://confengine.com/agile-india-2020/proposal/13278
Conference Link: https://2020.agileindia.org