SREcon16 Europe - Production Improvement Review: Taking a Bite Out of Repair Debt
Martin Check, Microsoft
Azure SRE works with services that have widely variable maturity, ranging from fully federated devops teams, to fully Tiered IT/Ops teams, and everything in between. The one thing all of these services have in common is that they have outages. While they all respond in different ways to recover and respond, SRE has to collect and leverage data in a common manner across all services to prevent outages and drive reliability up consistently. In this this talk we’ll discuss how SRE leverages diverse data sets to drive improvements across this heterogeneous set of services. SRE ensures that teams are rigorously completing post incident reviews and addressing their live site debt. We not only look at the actual repair debt, but we’ve introduced a new concept called “virtual debt” which shows where a service incident response faltered, but no appropriate repair was logged. Virtual Debt is affectionately referred to as “PacMan debt” due to the appearance of the chart.
View the full SREcon16 Europe Program at https://www.usenix.org/conference/srecon16europe/program