Having a software engineer on-call at a tech company ensures that if any failures or alerts occur, there is someone dedicated to respond to them. Most engineers will have experienced some form of being on-call over their career. I have seen many posts on Hacker News about being on-call with negative opinions and poor experiences. Thankfully, I cannot commiserate with these examples, so my goal of this post is to show how we have tried to reduce the pain points of being on-call at ClassDojo.
Our core engineers rotate being on-call on a weekly basis, switching on Monday mornings, US west coast time. With our current set of engineers, an individual can expect to have approximately 4 shifts a year or one every 3 months, though this changes as our team grows. If your shift falls on a planned vacation week or you have something planned on a certain day, other engineers on the team will happily switch and support. We use PagerDuty to manage scheduling shifts and handling alerts. It also allows us to easily schedule overrides. Engineers who are doing risky work will often override the on-call alerts until they have completed their task. Other engineers are always willing to take a few hours of someone else’s on-call shift if they have some availability conflict. Folks are pretty flexible about being on-call, which makes things better for everyone.
For the most part we don’t have separation of concerns, so our engineers operate across the stack. People who work on our services also write the tests and are responsible for ensuring production is healthy.
The question on a lot of people's mind is about being woken up at 2am. We try to reduce any middle of the night wake-ups with our humanity > PagerDuty policy. We have engineers across a few timezones, so we have additional rotating schedules where alerts will be redirected to colleagues who are awake in Europe and South America. Not to say that they won’t have to wake anyone up if something is wrong, but they can act as a first line of defense if there is a middle of the night alert. If you do happen to get woken up, we don’t expect anyone to work a normal workday. This isn’t an exercise in sleep deprivation. When an alert goes off at night, we assess the situation and fix the alert if needed. We take people being woken up seriously.
The engineer on-call is expected to focus on triage based work, not on their product team tasks. At a minimum this means acknowledging any alerts, investigating issues, and gathering other engineers if they need support in fixing the issue. The expectation is not that you will know how to respond to everything, but that you can manage the situation and see the issue through to the end with support if needed. The main application is a monolith and every route is linked to a team. We try to have intelligent routing to specific teams for them to handle, making on-call work more of a general safety net. Especially during our busy back to school season it can be helpful for the on-call engineer to help point out issues arising for a specific team that they might not have noticed.
Alerts should not constantly be going off. Runbooks should be created or updated, thresholds on monitors should be tweaked. If a major incident does occur, an investigation should take place, a post-mortem should be organized and follow-up tasks prioritized.
Engineers might have cross team work that they set aside to do during their shift, or there is always work that can be picked up from one of our guilds. Many engineers enjoy this time to work on improvements they didn’t have time to do during regular weeks or take time to learn something new. Culling Containers with a Leaky Bucket is a recent project that resulted from triage issues. Projects that automate tasks or improve our test speed are highly celebrated.
As our engineering team grows, some interesting cons might crop up. If engineers start to have shifts months and months apart, will newcomers get good training in this area? Will people feel confident being on-call when they do it so infrequently? We can look at partitioning days better as our pool of colleagues across time zones grows. We could look at shortening shifts, but it might be a bumpy transition.
We can easily get focused on only fixing problems within our product team domain. Being on-call allows us to lift our heads and see what is happening across the company. There might not always be an epic mystery to solve, but the breathing room while on-call can help engineers recognize patterns across teams and the time to implement improvements.
On a personal note: I had a lot of anxiety being on-call when I first started at ClassDojo. The breadth of the alerts, not knowing if I was going to get woken up or have an incident over the weekend, all contributed to my stress. Over time my confidence grew, relative to my increased knowledge of our system and ability to investigate different problems. Even if I couldn’t solve the problem myself, I felt better reaching out for help being able to present what I believed the issue was. Trying to understand what was a critical issue (raise the alarm) vs something that can be fixed on Monday is still a nuance I am working on. As we scale and expand, there is much more to learn and alerts to finesse.
I hope that I have convinced you that being on-call at ClassDojo should not be seen as a negative. Every time I am on-call now, I take it as a challenge to improve something (no matter how small) and an opportunity to learn. I find enjoyment in a good investigation and supporting other teams when I see something abnormal happening.