Your customers shouldn’t find problems before you do. When we develop software and make architectural decisions, we try to anticipate potential problems—ambiguous user interfaces, performance bottlenecks, and other edge cases. Generally we do a good job of it, but as system complexity grows, the mental models we use to plan and understand those structures don’t always adequately accommodate those complexities. So what do we do about this? We can test all the things! By using automation, we test complex scaling scenarios to validate our mental models and to identify unanticipated side-effects.
One of the issues we recently dealt with was supporting a major change in our traffic patterns. Although overall load stayed the same, the stress points produced by that load changed significantly. Major shifts like these always have the potential to disrupt our service, and in turn, disrupt our customers’ ability to keep their systems running. We had some predictions about how our system would react to the new load profile, but we wanted to validate those predictions ourselves rather than waiting for our customers to experience service degradation.
Although each engineering team had some idea of how these changes would affect the performance of their own services and had work scheduled to address those issues, I wanted to make sure we were all equipped to make informed prioritization and planning decisions. All I had to do was figure out a way to consolidate the efforts of more than 90 engineers into one focused attack on our scaling challenges.
Fortunately, I didn’t have to start from scratch: I could build on existing attitudes of collaboration, ownership, and a culture of reliability which has resulted in a rich toolset for testing resilience and scalability. This talk will outline how we used those tools, developed new ones, what we learned in the process, and the challenges of consolidating the efforts of separate teams towards a specific, common initiative.