Being On-call at ClassDojo

Sarah Mahovlich

2023-02-22

Having a software engineer on-call at a tech company ensures that if any failures or alerts occur, there is someone dedicated to respond to them. Most engineers will have experienced some form of being on-call over their career. I have seen many posts on Hacker News about being on-call with negative opinions and poor experiences. Thankfully, I cannot commiserate with these examples, so my goal of this post is to show how we have tried to reduce the pain points of being on-call at ClassDojo.

The Basics

Our core engineers rotate being on-call on a weekly basis, switching on Monday mornings, US west coast time. With our current set of engineers, an individual can expect to have approximately 4 shifts a year or one every 3 months, though this changes as our team grows. If your shift falls on a planned vacation week or you have something planned on a certain day, other engineers on the team will happily switch and support. We use PagerDuty to manage scheduling shifts and handling alerts. It also allows us to easily schedule overrides. Engineers who are doing risky work will often override the on-call alerts until they have completed their task. Other engineers are always willing to take a few hours of someone else’s on-call shift if they have some availability conflict. Folks are pretty flexible about being on-call, which makes things better for everyone.

For the most part we don’t have separation of concerns, so our engineers operate across the stack. People who work on our services also write the tests and are responsible for ensuring production is healthy.

Inconvenient Alerts

The question on a lot of people's mind is about being woken up at 2am. We try to reduce any middle of the night wake-ups with our humanity > PagerDuty policy. We have engineers across a few timezones, so we have additional rotating schedules where alerts will be redirected to colleagues who are awake in Europe and South America. Not to say that they won’t have to wake anyone up if something is wrong, but they can act as a first line of defense if there is a middle of the night alert. If you do happen to get woken up, we don’t expect anyone to work a normal workday. This isn’t an exercise in sleep deprivation. When an alert goes off at night, we assess the situation and fix the alert if needed. We take people being woken up seriously.

Expectations

The engineer on-call is expected to focus on triage based work, not on their product team tasks. At a minimum this means acknowledging any alerts, investigating issues, and gathering other engineers if they need support in fixing the issue. The expectation is not that you will know how to respond to everything, but that you can manage the situation and see the issue through to the end with support if needed. The main application is a monolith and every route is linked to a team. We try to have intelligent routing to specific teams for them to handle, making on-call work more of a general safety net. Especially during our busy back to school season it can be helpful for the on-call engineer to help point out issues arising for a specific team that they might not have noticed.

Alerts should not constantly be going off. Runbooks should be created or updated, thresholds on monitors should be tweaked. If a major incident does occur, an investigation should take place, a post-mortem should be organized and follow-up tasks prioritized.

On-call Projects

Engineers might have cross team work that they set aside to do during their shift, or there is always work that can be picked up from one of our guilds. Many engineers enjoy this time to work on improvements they didn’t have time to do during regular weeks or take time to learn something new. Culling Containers with a Leaky Bucket is a recent project that resulted from triage issues. Projects that automate tasks or improve our test speed are highly celebrated.

Growing Pains

As our engineering team grows, some interesting cons might crop up. If engineers start to have shifts months and months apart, will newcomers get good training in this area? Will people feel confident being on-call when they do it so infrequently? We can look at partitioning days better as our pool of colleagues across time zones grows. We could look at shortening shifts, but it might be a bumpy transition.

Conclusion

We can easily get focused on only fixing problems within our product team domain. Being on-call allows us to lift our heads and see what is happening across the company. There might not always be an epic mystery to solve, but the breathing room while on-call can help engineers recognize patterns across teams and the time to implement improvements.

On a personal note: I had a lot of anxiety being on-call when I first started at ClassDojo. The breadth of the alerts, not knowing if I was going to get woken up or have an incident over the weekend, all contributed to my stress. Over time my confidence grew, relative to my increased knowledge of our system and ability to investigate different problems. Even if I couldn’t solve the problem myself, I felt better reaching out for help being able to present what I believed the issue was. Trying to understand what was a critical issue (raise the alarm) vs something that can be fixed on Monday is still a nuance I am working on. As we scale and expand, there is much more to learn and alerts to finesse.

I hope that I have convinced you that being on-call at ClassDojo should not be seen as a negative. Every time I am on-call now, I take it as a challenge to improve something (no matter how small) and an opportunity to learn. I find enjoyment in a good investigation and supporting other teams when I see something abnormal happening.

    ClassDojo occasionally has a few containers get into bad states that they're not able to recover from. This normally happens when a connection for a database gets into a bad state -- we've seen this with Redis, MySQL, MongoDB, and RabbitMQ connections. We do our best to fix these problems, but we also want to make it so that our containers have a chance of recovering on their own without manual intervention. We don't want to wake people up at night if we don't need to! Our main strategy to make that happen is having our containers decide whether they should try restarting themselves.

    The algorithm we use for this is straightforward: every ten seconds, the container checks if it's seen an excessive number of errors. If it has, it tries to claim a token from our shutdown bucket. If it's able to claim a token, it starts reporting that it's down to our load balancer and container manager (in this case, nomad). Our container manager will take care of shutting down the container and bringing up a new one.

    On every container, we keep a record of how many errors we've seen over the past minute. Here's a simplified version of what we're doing:

    let recentErrorTimes: number[] = [];
    function serverError(...args: things[]) {
      recentErrorTimes.push(Date.now());
    }
    
    export function getPastMinuteErrorCount () {
      return recentErrorTimes.count((t) => t >= Date.now() - 60_000);
    }
    

    Check out ERROR, WARN, and INFO aren't actionable logging levels for some more details on ClassDojo's approach to logging and counting errors.

    After tracking our errors, we can then check whether we've seen an excessive number of errors on an interval. If we've seen an excessive number of errors we'll use a leaky token bucket to decide whether or not we should shut down. Having a leaky token bucket for deciding whether or not we should try to shut down the container is essential: if we don't have that, a widespread issue that's impacting all of our containers would cause ALL of our containers to shut down and we'd bring the entire site down. We only want to cull a container when we're sure that we're leaving enough other containers to handle the load. For us, that means we're comfortable letting up to 10 containers shut themselves down without any manual intervention. After that point, something is going wrong and we'll want an engineer in the loop.

    let isUp = true;
    const EXCESSIVE_ERROR_COUNT = 5;
    const delay = (ms: Number) => new Promise((resolve) => setTimeout(resolve, ms));
    
    export async function check () {
      if (!isUp) return;
      if (getPastMinuteErrorCount() >= EXCESSIVE_ERROR_COUNT && await canHaveShutdownToken()) {
        isUp = false;
        return;
      }
    
      await delay(10_000);
      check();
    }
    
    export function getIsUp () {
      return isUp;
    }
    

    At this point, we can use getIsUp to start reporting that we're down to our load balancer and to our container manager. We'll go through our regular graceful server shutdown logic and when our container manager brings up a new container, starting from scratch should make us likely to avoid whatever issue caused the problem in the first place.

    router.get("/api/haproxy", () => {
      if (getIsUp()) return 200;
      return 400;
    });
    

    We use redis for our leaky token bucket. If something goes wrong with the connection to Redis, our culling algorithm won't work and we're OK with that. We don't need our algorithm to be perfect -- we just want it to be good enough to increase the chance that a container is able to recover from a problem on its own.

    For our leaky token bucket, we decided to do the bare minimum: we wanted to have something simple to understand and test. For our use case, it's OK to have the leaky token bucket fully refill every ten minutes.

    /**
     * returns errorWatcher:0, errorWatcher:1,... errorWatcher:5
     * based on the current minute past the hour
     */
    export function makeKey(now: Date) {
      const minutes = Math.floor(now.getMinutes() / 10);
      return `errorWatcher:${minutes}`;
    }
    
    const TEN_MINUTES_IN_SECONDS = 10 * 60;
    const BUCKET_CAPACITY = 10;
    export async function canHaveShutdownToken(now = new Date()): Promise<boolean> {
      const key = makeKey(now);
      const multi = client.multi();
      multi.incr(key);
      multi.expire(key, TEN_MINUTES_IN_SECONDS);
      try {
        const results = await multi.execAsync<[number, number]>();
        return results[0] <= BUCKET_CAPACITY;
      } catch (err) {
        // if we fail here, we want to know about it
        // but we don't want our error watcher to cause more errors
        sampleLog("errorWatcher.token_fetch_error", err);
        return false;
      }
    }
    

    See Even better rate-limiting for a description of how to set up a leaky token bucket that incorporates data from the previous time period to avoid sharp discontinuities between time periods.

    Our container culling code has been running in production for several months now, and has been working quite well! Over the past two weeks, it successfully shut down 14 containers that weren't going to be able to recover on their own and saved a few engineers from needing to do any manual interventions. The one drawback has been that it makes it easier to ignore some of these issues causing these containers to get into these bad states to begin with, but it's a tradeoff we're happy to make.

      On our teams, we do our best to ensure that we're fully focused on the most important thing that our team could be doing. To do that, we often "swarm" on the top priority: this is some internal documentation on what that looks like in practice!

      Swarming is when a team is as fully dedicated to their absolute top priority as possible. This is the ideal state for maximum productivity. This is a list of patterns for swarming. No one pattern works in every scenario. Some patterns can be combined! It might be the right move for a group to change patterns in the middle work! We're probably also missing great patterns! Get creative! These are just suggestions and starting points.

      💡 Also please note that full-team swarming is often impossible or suboptimal in many cases. You can't just keep adding more people to the same task and always expect that addition to be effective (e.g., nine women can't make a baby in one month).

      Group Programming

      Group programming is any kind of programming with more than 1 person on one screen at a time. There are many different kinds and you should decide which one makes sense instead of just adopting one for every situation. You will want to consider the goal of this type of collaboration, the skill levels and background knowledge of everyone involved, the responsibilities of everyone in the group, and the type of work you're trying to accomplish.

      Roles

      • "navigator" — the person talking about what needs to be done on a high level
      • "driver" — the person with their hands on the keyboard making that happen
      • "observer" — anyone else that is just observing

      Every group programming pattern here has a single navigator and single driver for the sake of simplicity.

      Observers will have an extremely hard time staying engaged unless they are very regularly rotated into more engaging roles.

      Notes

      • Group programming is most useful when you're introducing new people to a code-base, or working on an integration between two platforms.
      • Group programming is extremely exhausting and should involve heavy, heavy use of breaks. I'd recommend 15 minutes per hour AT MINIMUM and adjusting as you feel necessary.
      • Not everyone in the group needs to be a software developer for this to be useful.
      • Group programming with just two people is generally called Pair programming or pairing.
      • Group programming with more people is generally called mob programming or mobbing

      Pair Programming

      Peer Pairing

      A type of pair programming where both developers have roughly the same level of expertise and background knowledge.

      • Recommended for: Really hard things or really unfun work that is less unfun with a friend.
      • Not Recommended for: Simple straightforward tasks: try Divide and Conquer instead. Cases where the work would benefit from more eyes, or when we want to spread knowledge/skills with more people in similar ways: try mob programming.

      Mentor Pairing

      A type of pair programming where one developer is trying to teach another developer something as they do the work together. The two developers could have things to learn from each other, and one of them could possibly not even be a developer at all.

      You'll want to be very thoughtful about who should be the navigator and who should be the driver depending on your teaching goals. You may or may not want to rotate roles.

      • Recommended for: Cases where we want one person to teach another person something in a hands-on way. Hands-on teaching on real work is generally pretty engaging and practical compared to more academic approaches.
      • Not Recommended For: Times when multiple people need the same mentoring — try Directed Mobbing.

      Ping-Pong Pairing

      A type of pair programming where the developers practice TDD with one person writing a test and the other writing the app code that passes it. You could have only one person writing the tests during the session, or the duties could rotate. Recommended for: really complex and detailed uses cases where an adversarial testing technique can drive out a better result.

      Not recommended for: straightforward building (like CRUD stuff)

      Mob Programming

      Mob-programming is group programming with more than 2 people. The additional people beyond the driver and the navigator are almost always observers to keep things simple. The traditional recommendation is to change the driver every 10 minutes. Some examples are: Directed Mobbing , Peer Mobbing

      Directed Mobbing

      Directed mobbing is a form of mob programming when there's a group trying to learn something from a particular expert . In general that expert should never be driving so that the learners are always engaged and learning in a hands-on fashion. Usually the expert will be the navigator, but there may be a point when the learners can even get into the navigation role and allow the expert to continue just as an observer that's around for consultation. If there comes a point where the expert is no longer necessary at all, this style may no longer be useful.

      • Recommended for: teaching/guiding many people the same thing at once.
      • Not Recommended for: groups that have graduated to having the teacher as an observer for an extended amount of time. Move on to something more engaging/efficient!

      💡 Check out this great blog post written by Melissa Dirdo for more on ClassDojo's approach to mobbing.

      Peer Mobbing

      Peer mobbing is is a form of mob programming when everyone has roughly the same level of ability and it makes sense to rotate them into navigator/driver positions equally.

      • Recommended for: stuff the team is working though/learning together that they all want to know/understand. getting the team on the same page about practices/processes/conventions. working through a hairy problem that could use lots of different ideas/perspectives.
      • Not Recommended for: simple straightforward things. things that are low value for everyone to learn.

      Divide and Conquer

      If stories are sliced well vertically, a particular story may still have divisible parts along other lines that could allow them to be worked on simultaneously without much overhead. This requires that everyone involved gets in a group and does actual upfront planning and breakdown but it can be pretty quick ("You do the frontend and I'll do the backend", "You do the model and I'll do the controller"). This does not require any tracking whatsoever in asana (and probably shouldn't), so once things are broken up well vertically in asana, feel free to break up a particular story horizontally to collaborate on it if you think that's the best way. You'll probably want to solve integrations between layers first before actually building the layers. There may be other sub-pieces to cut out other than horizontal layers as well!

      • Recommended for: When everything is simple and straightforward and everyone knows what to do and how to do it. When everybody is exhausted from group programming and wants to just bang out some code and listen to Iron Maiden. When someone being mentored is ready to try flying solo to see if they really can apply their new learnings on their own.
      • Not recommended for: Work that can't be parallelized. Teams that don't have widespread capability on the subject matter (they're just going to continuously interrupt each other asking questions). Work that, when broken down, is not straightforward.
        Newer posts
        Older posts