DevOpsDays Philadelphia 2018 - Metrics, Alerts, and Dashboards by Mauricio Linhares
In this presentation we’ll learn what are the most important metrics we should be measuring in our systems (upper and lower bounds, SLAs/SLOs), what is the purpose of having dashboards, how different consumers will need different dashboards and why dashboards are for gathering more information about outages and not to figure out there is one outage happening, and, sadly, alerting. What to think about before including a new alert (can we automate the response? is it really actionable? do we have expectations for when it will trigger) and avoiding alerting burnout. The main goal is to help teams and managers to make sense of their data by collecting meaningful information, showing it in a way that is useful for all parties involved and not drowning teams on noise.