Slack is the Worst Info-Radiator

When the ClassDojo engineering team was in the office, we loved our information radiators: we had multiple huge monitors showing broken jenkins builds, alerts, and important performance statistics. They worked amazingly well for helping us keep our CI/CD pipelines fast & unblocked, helped us keep the site up & fast, and helped us build an engineering culture that prioritized the things we showed on the info radiators. They worked well while the whole team was in the office, but when we went fully remote, our initial attempt of moving that same information into a slack channel failed completely, and we had to find a different way to get the same value.

Open-office with row of 4 monitors displaying production metrics across the back wall

Most teams have an #engineering-bots channel of some sort: it's a channel that quickly becomes full of alerts & broken builds, and that everyone quickly learns to ignore. For most of these things, knowing that something was broken isn't particularly interesting: we want to know what the current state of the world is, and that's impossible to glean from a slack channel (unless everyone on the team has inhuman discipline around claiming & updating these alerts).

We had, and still have, an #engineering-bots channel that has 100s of messages in it per day. As far as I know, every engineer on the team has that channel muted because the signal to noise ratio in it is far too low. This meant that we occasionally had alerts that we completely missed because they quickly scrolled out of view in the channel, and that we'd have important builds that'd stay broken for weeks. This made any fixes to builds expensive, allowed some small production issues to stay broken, and slowed down our teams.

slack channel with lots of alerts in it

After about a year of frustration, we decided that we needed to figure out a way to give people a way to set up in-home info-radiators. We had a few requirements for a remote-work info-radiator:

  1. It needed to be configurable: teams needed a way to see only their broken builds & the alerts that they cared about. Most of the time, the info-radiator shouldn't show anything at all!
  2. It needed be on an external display: not everyone had an office setup with enough monitor real-estate to support a page and keep it open
  3. It needed to display broken builds from multiple Jenkins locations, broken builds from GitHub Actions, and triggered alerts from Datadog and Pagerduty on a single display

We set up a script that fetches data from Jenkins, Github Actions, Datadog, Pagerduty, and Prowler, transforms that data into an easily consumable JSON file, and finally uploads that file to S3. We then have a simple progressive web app that we installed on small, cheap Android displays that fetches that JSON file regularly, filters it for the builds that each person cares about, and renders them nicely.

picture of info-radiator with broken build highlighted picture of small Android display running the info-radiator on a desk

These remote info-radiators have made it much simpler to stay on top of alerts & broken builds, and have sped us up as an engineering organization. There's been a lot written about how valuable info-radiators can be for a team, but I never appreciated their value until we didn't have them, and the work we put into making sure we had remote ones has already more than paid for itself.

    Having a software engineer on-call at a tech company ensures that if any failures or alerts occur, there is someone dedicated to respond to them. Most engineers will have experienced some form of being on-call over their career. I have seen many posts on Hacker News about being on-call with negative opinions and poor experiences. Thankfully, I cannot commiserate with these examples, so my goal of this post is to show how we have tried to reduce the pain points of being on-call at ClassDojo.

    The Basics

    Our core engineers rotate being on-call on a weekly basis, switching on Monday mornings, US west coast time. With our current set of engineers, an individual can expect to have approximately 4 shifts a year or one every 3 months, though this changes as our team grows. If your shift falls on a planned vacation week or you have something planned on a certain day, other engineers on the team will happily switch and support. We use PagerDuty to manage scheduling shifts and handling alerts. It also allows us to easily schedule overrides. Engineers who are doing risky work will often override the on-call alerts until they have completed their task. Other engineers are always willing to take a few hours of someone else’s on-call shift if they have some availability conflict. Folks are pretty flexible about being on-call, which makes things better for everyone.

    For the most part we don’t have separation of concerns, so our engineers operate across the stack. People who work on our services also write the tests and are responsible for ensuring production is healthy.

    Inconvenient Alerts

    The question on a lot of people's mind is about being woken up at 2am. We try to reduce any middle of the night wake-ups with our humanity > PagerDuty policy. We have engineers across a few timezones, so we have additional rotating schedules where alerts will be redirected to colleagues who are awake in Europe and South America. Not to say that they won’t have to wake anyone up if something is wrong, but they can act as a first line of defense if there is a middle of the night alert. If you do happen to get woken up, we don’t expect anyone to work a normal workday. This isn’t an exercise in sleep deprivation. When an alert goes off at night, we assess the situation and fix the alert if needed. We take people being woken up seriously.

    Expectations

    The engineer on-call is expected to focus on triage based work, not on their product team tasks. At a minimum this means acknowledging any alerts, investigating issues, and gathering other engineers if they need support in fixing the issue. The expectation is not that you will know how to respond to everything, but that you can manage the situation and see the issue through to the end with support if needed. The main application is a monolith and every route is linked to a team. We try to have intelligent routing to specific teams for them to handle, making on-call work more of a general safety net. Especially during our busy back to school season it can be helpful for the on-call engineer to help point out issues arising for a specific team that they might not have noticed.

    Alerts should not constantly be going off. Runbooks should be created or updated, thresholds on monitors should be tweaked. If a major incident does occur, an investigation should take place, a post-mortem should be organized and follow-up tasks prioritized.

    On-call Projects

    Engineers might have cross team work that they set aside to do during their shift, or there is always work that can be picked up from one of our guilds. Many engineers enjoy this time to work on improvements they didn’t have time to do during regular weeks or take time to learn something new. Culling Containers with a Leaky Bucket is a recent project that resulted from triage issues. Projects that automate tasks or improve our test speed are highly celebrated.

    Growing Pains

    As our engineering team grows, some interesting cons might crop up. If engineers start to have shifts months and months apart, will newcomers get good training in this area? Will people feel confident being on-call when they do it so infrequently? We can look at partitioning days better as our pool of colleagues across time zones grows. We could look at shortening shifts, but it might be a bumpy transition.

    Conclusion

    We can easily get focused on only fixing problems within our product team domain. Being on-call allows us to lift our heads and see what is happening across the company. There might not always be an epic mystery to solve, but the breathing room while on-call can help engineers recognize patterns across teams and the time to implement improvements.

    On a personal note: I had a lot of anxiety being on-call when I first started at ClassDojo. The breadth of the alerts, not knowing if I was going to get woken up or have an incident over the weekend, all contributed to my stress. Over time my confidence grew, relative to my increased knowledge of our system and ability to investigate different problems. Even if I couldn’t solve the problem myself, I felt better reaching out for help being able to present what I believed the issue was. Trying to understand what was a critical issue (raise the alarm) vs something that can be fixed on Monday is still a nuance I am working on. As we scale and expand, there is much more to learn and alerts to finesse.

    I hope that I have convinced you that being on-call at ClassDojo should not be seen as a negative. Every time I am on-call now, I take it as a challenge to improve something (no matter how small) and an opportunity to learn. I find enjoyment in a good investigation and supporting other teams when I see something abnormal happening.

      ClassDojo occasionally has a few containers get into bad states that they're not able to recover from. This normally happens when a connection for a database gets into a bad state -- we've seen this with Redis, MySQL, MongoDB, and RabbitMQ connections. We do our best to fix these problems, but we also want to make it so that our containers have a chance of recovering on their own without manual intervention. We don't want to wake people up at night if we don't need to! Our main strategy to make that happen is having our containers decide whether they should try restarting themselves.

      The algorithm we use for this is straightforward: every ten seconds, the container checks if it's seen an excessive number of errors. If it has, it tries to claim a token from our shutdown bucket. If it's able to claim a token, it starts reporting that it's down to our load balancer and container manager (in this case, nomad). Our container manager will take care of shutting down the container and bringing up a new one.

      On every container, we keep a record of how many errors we've seen over the past minute. Here's a simplified version of what we're doing:

      let recentErrorTimes: number[] = [];
      function serverError(...args: things[]) {
        recentErrorTimes.push(Date.now());
      }
      
      export function getPastMinuteErrorCount () {
        return recentErrorTimes.count((t) => t >= Date.now() - 60_000);
      }
      

      Check out ERROR, WARN, and INFO aren't actionable logging levels for some more details on ClassDojo's approach to logging and counting errors.

      After tracking our errors, we can then check whether we've seen an excessive number of errors on an interval. If we've seen an excessive number of errors we'll use a leaky token bucket to decide whether or not we should shut down. Having a leaky token bucket for deciding whether or not we should try to shut down the container is essential: if we don't have that, a widespread issue that's impacting all of our containers would cause ALL of our containers to shut down and we'd bring the entire site down. We only want to cull a container when we're sure that we're leaving enough other containers to handle the load. For us, that means we're comfortable letting up to 10 containers shut themselves down without any manual intervention. After that point, something is going wrong and we'll want an engineer in the loop.

      let isUp = true;
      const EXCESSIVE_ERROR_COUNT = 5;
      const delay = (ms: Number) => new Promise((resolve) => setTimeout(resolve, ms));
      
      export async function check () {
        if (!isUp) return;
        if (getPastMinuteErrorCount() >= EXCESSIVE_ERROR_COUNT && await canHaveShutdownToken()) {
          isUp = false;
          return;
        }
      
        await delay(10_000);
        check();
      }
      
      export function getIsUp () {
        return isUp;
      }
      

      At this point, we can use getIsUp to start reporting that we're down to our load balancer and to our container manager. We'll go through our regular graceful server shutdown logic and when our container manager brings up a new container, starting from scratch should make us likely to avoid whatever issue caused the problem in the first place.

      router.get("/api/haproxy", () => {
        if (getIsUp()) return 200;
        return 400;
      });
      

      We use redis for our leaky token bucket. If something goes wrong with the connection to Redis, our culling algorithm won't work and we're OK with that. We don't need our algorithm to be perfect -- we just want it to be good enough to increase the chance that a container is able to recover from a problem on its own.

      For our leaky token bucket, we decided to do the bare minimum: we wanted to have something simple to understand and test. For our use case, it's OK to have the leaky token bucket fully refill every ten minutes.

      /**
       * returns errorWatcher:0, errorWatcher:1,... errorWatcher:5
       * based on the current minute past the hour
       */
      export function makeKey(now: Date) {
        const minutes = Math.floor(now.getMinutes() / 10);
        return `errorWatcher:${minutes}`;
      }
      
      const TEN_MINUTES_IN_SECONDS = 10 * 60;
      const BUCKET_CAPACITY = 10;
      export async function canHaveShutdownToken(now = new Date()): Promise<boolean> {
        const key = makeKey(now);
        const multi = client.multi();
        multi.incr(key);
        multi.expire(key, TEN_MINUTES_IN_SECONDS);
        try {
          const results = await multi.execAsync<[number, number]>();
          return results[0] <= BUCKET_CAPACITY;
        } catch (err) {
          // if we fail here, we want to know about it
          // but we don't want our error watcher to cause more errors
          sampleLog("errorWatcher.token_fetch_error", err);
          return false;
        }
      }
      

      See Even better rate-limiting for a description of how to set up a leaky token bucket that incorporates data from the previous time period to avoid sharp discontinuities between time periods.

      Our container culling code has been running in production for several months now, and has been working quite well! Over the past two weeks, it successfully shut down 14 containers that weren't going to be able to recover on their own and saved a few engineers from needing to do any manual interventions. The one drawback has been that it makes it easier to ignore some of these issues causing these containers to get into these bad states to begin with, but it's a tradeoff we're happy to make.

        Newer posts
        Older posts