The talk will start with a quick overview of the rapid growth Datadog experienced and the resulting challenges. This is done to illustrate the eventual challenges where a simple primary and secondary on-call team starts to fall apart.
In hindsight the signs are obvious however in the thick of it all it is hard to step back and realize the on-call team and processes were falling apart. It should be said that what was in place worked and met its needs for a long time. You have to start somewhere. The evolution is what I focus on while sharing the tricks to make that evolution easier.
The talk will then go into some of the patterns Datadog found useful such as refining our incident management processes and roles, growing the depth of the oncall team, eventually switching to per team rotations and the challenges involved through this evolution.
We will highlight some of the useful tricks and tools Datadog have used such as:
Structured service templates to help with on-call training On-call training and shadow ops rotations The use of Github Issues to track on-call tasks for handoff and to use as training examples Scheduled on-call handoffs that include systematically reviewing the sources of alerts to kill noise Providing a way to capture monitor feedback from every alert notification Patterns of using Github projects to track where each on-call member stands as far as service training Scripts to use in conjunction with the service templates and on-call scheduling to show each on-call member a list of what changed since the last time they were on-call