Posts By: Will Keleher

Culling Containers with a Leaky Bucket

ClassDojo occasionally has a few containers get into bad states that they're not able to recover from. This normally happens when a connection for a database gets into a bad state -- we've seen this with Redis, MySQL, MongoDB, and RabbitMQ connections. We do our best to fix these problems, but we also want to make it so that our containers have a chance of recovering on their own without manual intervention. We don't want to wake people up at night if we don't need to! Our main strategy to make that happen is having our containers decide whether they should try restarting themselves.

The algorithm we use for this is straightforward: every ten seconds, the container checks if it's seen an excessive number of errors. If it has, it tries to claim a token from our shutdown bucket. If it's able to claim a token, it starts reporting that it's down to our load balancer and container manager (in this case, nomad). Our container manager will take care of shutting down the container and bringing up a new one.

On every container, we keep a record of how many errors we've seen over the past minute. Here's a simplified version of what we're doing:

let recentErrorTimes: number[] = [];
function serverError(...args: things[]) {
  recentErrorTimes.push(Date.now());
}

export function getPastMinuteErrorCount () {
  return recentErrorTimes.count((t) => t >= Date.now() - 60_000);
}

Check out ERROR, WARN, and INFO aren't actionable logging levels for some more details on ClassDojo's approach to logging and counting errors.

After tracking our errors, we can then check whether we've seen an excessive number of errors on an interval. If we've seen an excessive number of errors we'll use a leaky token bucket to decide whether or not we should shut down. Having a leaky token bucket for deciding whether or not we should try to shut down the container is essential: if we don't have that, a widespread issue that's impacting all of our containers would cause ALL of our containers to shut down and we'd bring the entire site down. We only want to cull a container when we're sure that we're leaving enough other containers to handle the load. For us, that means we're comfortable letting up to 10 containers shut themselves down without any manual intervention. After that point, something is going wrong and we'll want an engineer in the loop.

let isUp = true;
const EXCESSIVE_ERROR_COUNT = 5;
const delay = (ms: Number) => new Promise((resolve) => setTimeout(resolve, ms));

export async function check () {
  if (!isUp) return;
  if (getPastMinuteErrorCount() >= EXCESSIVE_ERROR_COUNT && await canHaveShutdownToken()) {
    isUp = false;
    return;
  }

  await delay(10_000);
  check();
}

export function getIsUp () {
  return isUp;
}

At this point, we can use getIsUp to start reporting that we're down to our load balancer and to our container manager. We'll go through our regular graceful server shutdown logic and when our container manager brings up a new container, starting from scratch should make us likely to avoid whatever issue caused the problem in the first place.

router.get("/api/haproxy", () => {
  if (getIsUp()) return 200;
  return 400;
});

We use redis for our leaky token bucket. If something goes wrong with the connection to Redis, our culling algorithm won't work and we're OK with that. We don't need our algorithm to be perfect -- we just want it to be good enough to increase the chance that a container is able to recover from a problem on its own.

For our leaky token bucket, we decided to do the bare minimum: we wanted to have something simple to understand and test. For our use case, it's OK to have the leaky token bucket fully refill every ten minutes.

/**
 * returns errorWatcher:0, errorWatcher:1,... errorWatcher:5
 * based on the current minute past the hour
 */
export function makeKey(now: Date) {
  const minutes = Math.floor(now.getMinutes() / 10);
  return `errorWatcher:${minutes}`;
}

const TEN_MINUTES_IN_SECONDS = 10 * 60;
const BUCKET_CAPACITY = 10;
export async function canHaveShutdownToken(now = new Date()): Promise<boolean> {
  const key = makeKey(now);
  const multi = client.multi();
  multi.incr(key);
  multi.expire(key, TEN_MINUTES_IN_SECONDS);
  try {
    const results = await multi.execAsync<[number, number]>();
    return results[0] <= BUCKET_CAPACITY;
  } catch (err) {
    // if we fail here, we want to know about it
    // but we don't want our error watcher to cause more errors
    sampleLog("errorWatcher.token_fetch_error", err);
    return false;
  }
}

See Even better rate-limiting for a description of how to set up a leaky token bucket that incorporates data from the previous time period to avoid sharp discontinuities between time periods.

Our container culling code has been running in production for several months now, and has been working quite well! Over the past two weeks, it successfully shut down 14 containers that weren't going to be able to recover on their own and saved a few engineers from needing to do any manual interventions. The one drawback has been that it makes it easier to ignore some of these issues causing these containers to get into these bad states to begin with, but it's a tradeoff we're happy to make.

    Automated and semi-automated code migrations using shell text manipulation tools are great! Turning a migration task that might take multiple days or weeks of engineering effort into one that you can accomplish in a few minutes can be a huge win. I'm not remotely an expert at these migrations, but I thought it'd still be useful to write up the patterns that I use consistently.

    Use ag, rg, or git grep to list files

    Before anything else, you need to edit the right files! If you don't have a way of finding your codebase's files, you might accidentally edit random cache files, package files, editor files, or other dependencies. Editing those files is a good way to end up throwing away a codebase and cloning it from scratch again.

    I normally use ag -l . to list files because ag, the Silver Searcher, is set up to respect .gitignore already. A simple find and replace might look like ag -l . | xargs gsed -i 's|bad pattern|replacement|'. It'd be simpler to do that replacement with your editor, but the ag -l . | xargs gsed -i pattern is one that you can expand on in a larger script.

    Pause for user input: not all migrations are fully automatable

    A lot of migrations can't actually be fully automated. In those cases, it can be worth building a miniature tool to make editing faster (and more fun!).

    # spaces in file names will kill this for loop
    # thankfully, I've never worked in a code base where people put spaces in filenames
    for file in $(ag -l bad_pattern); do
      echo "how should we replace bad_pattern in ${file}? Here's context:"
      ag -C 3 bad_pattern "${file}"
      echo ""
      read good_pattern
      # quoting in sed commands is tricky!
      # using `${var}` rather than $var avoids potential problems here
      gsed -i "s|bad_pattern|${good_pattern}|" "${file}"
    done
    

    You can expand this pattern to look for a number and choose an appropriate option, but just having something that speeds up going through files makes life better!

    Handle relative import paths with for loops

    I've often needed to add a new import statement with a relative path to files as part of a migration, and every time I've been surprised that my editor hasn't been able to help me out more: what am I missing? I normally use a for loop and increase both the max-depth of files I'm looking at and the number of ../ on the path:

    dots="."
    import_path="/file/path"
    for ((depth=0; depth<5; depth++)); do
     dots="$dots/..";
     for file in $(ag -l --depth $depth | grep .ts); do
       if ! grep $import_path $file; then
         gsed -i "1i import '${dots}${import_path}';" $file;
       fi
     done
    done
    

    Rely on your code formatter

    Not needing to worry about code formatting is AMAZING. If your codebase is set up with a code formatter (like prettier or gofmt), it allows you to make changes without worrying about whitespace and then let the code formatter fix things later. It may even make sense to intentionally remove white-space from a pattern in order to make a replacement simpler to write!

    Use the right tool for the job

    1. Some code migrations require a tool that looks at the AST rather than the text in a code file and transforms that AST. These tools are more powerful & flexible than shell tools, but they require a bit more effort to get working. In NodeJS, there's jscodeshift and codemods. I don't know what's available for other languages.
    2. Your editor & language might support advanced migrations. If it does, learning how to do those migrations with your editor will likely be more effective than using these techniques or may prove a useful complement to these techniques.
    3. Bash tools like sed, awk, grep, and cut are designed to deal with text and files. Code is text and files! Other tools work, but they might not be designed to deal with files and streams of text.
    4. Shell tools are great, but a tool you know well and are excited about using is better than a tool you don't want to learn! Whatever programming language you're most comfortable with should have ways of dealing with and changing files and text. Having some way of manipulating text & files is important!. There are even tools like rb or nq (I wrote this one!) that let you use the Ruby or NodeJS syntax you're familiar with on the command line in a script you're writing.

    Use sed: it's designed for this

    sed is the streaming text editor, and it's the perfect tool for many code migrations. A surprising number of code migrations boil down to replacing a code pattern that happens on a single line with a different code pattern: sed makes that easy. Here are a few notes:

    1. If you're on a mac, you'll want to download a modern version of sed. I use gnu-sed: brew install gnu-sed
    2. use | (or anything else!) as your delimiter rather than /. sed takes the first character after the command as the delimiter, and / will show up in things that you want to replace pretty often! Writing gsed 's|/path/file.js|/path/file.ts|' is nicer than gsed 's/\/path\/file.js/\/path\/file.ts/'.
    3. In gsed, the --null-data (-z) option separates lines by NUL characters which lets you easily match and edit multiline patterns. If you use this, don't forget to use the g flag at the end to get all matches: everything in a file will be on the same 'line' for sed.
    4. When referring to shell variables, use ${VAR_NAME} rather than $VAR_NAME. This will simplify using them in sed commands.
    5. Use -E (or -r with gsed) for extended regular expressions and use capture groups in your regular expressions. git grep -l pattern | xargs gsed -Ei 's|pat(tern)|\1s are birds|g'

    ("perl pie" (perl -pi -e) can be another good tool for finding and replacing patterns! It's just not one I know.)

    Many migrations might take multiple steps

    When you're migrating code, don't worry about migrating everything at once. If you can break down the problem into a few different commands, those individual commands can be simple to write: you might first replace a function call with a different one and then update import statements to require the new function that you added.

    When you write a regular expression in a find-and-replace, you can sometimes get false positives. Rather than trying to update your regular expression to skip the false positives, I often find it simpler to write a regular expression to replace those false positives with a temporary pattern, update the remaining matches, and then replace the temporary pattern.

    With all of this, you'll need to rely on git (or another version control system). It's really easy to make mistakes! If you don't have an easy way to undo mistakes, you'll be sad.

    Automate ALL the code migrations!

    Manipulating text & files like this is a skill, and it's one that takes some practice to learn. Even if it's much slower to automate a code change, spending the time to automate it will help you build the skills to automate larger, more complex, and more valuable code migrations. I remember spending over an hour trying to figure out how to automate changing a pattern that was only in 10 spots in our codebase. It would have taken 5 minutes to do manually, but I'm glad I spent 10x the time doing it the slow way with shell tools because that experience made me capable of tackling more complex migrations that wouldn't be feasible to do manually.

      In the dark times before AsyncLocalStorage, it could be hard to tell why a request would occasionally time out. Were there multiple relatively slow queries somewhere in that route's code? Was another request on the same container saturating a database connection pool? Did another request block the event loop? It was possible to use tracing, profiling, and logs to track down problems like these, but it could be tricky; setting up per route metrics using AsyncLocalStorage makes it a ton easier!

      When ClassDojo set up our AsyncLocalStorage-based per-route instrumentation we found things like:

      • a route that occasionally made 30,000+ database requests because it was fanning out over a large list of items
      • another route that blocked the event-loop for 15-20 seconds a few times a day, and caused timeouts for any other requests that our server was handling at the same time
      • a third route that was occasionally fetching 500,000+ items to render simple counts to return to clients

      I wrote about this a bit more in AsyncLocalStorage Makes the Commons Legible. If you're not familiar with ClassDojo, it's a parent-teacher communication platform. Our monolithic NodeJS web-server backend API normally serves ~10,000 requests per second.

      I'd like to go through some of the details of how we set up this per-request instrumentation. For this post, we'll be starting with a relatively standard NodeJS web-server with pre-router middleware, and a router that finds an appropriate route to handle a request. It should look something like this:

      app.use(({ req, res }, next) => {
        const start = Date.now();
        onFinished(res, () => afterResponse(req, res, start));
        next();
      });
      app.use(rateLimitingMiddleware);
      app.use(bodyParserMiddleware);
      app.use(allOfTheRestOfOurMiddlware);
      
      app.use(router);
      app.use(notFoundMiddleware);
      

      To add instrumentation to this setup, we'll want to do the following:

      1. Create a per-request async store in our first middleware
      2. Store details about the database request caused by our request in our request's async store
      3. Send the request's database request details to our data lake.
      4. If any of the database request details violate our per-request limits, we log it as a server-error so that a team can see it & take action

      Starting our pre-request async store

      In a NodeJS web server, each middleware calls the next, so if we start an async local storage context in our very first middleware, every subsequent middleware should have access to the same storage context. (I had a lot of trouble understanding why this worked, so I wrote up a simplified gist that hopefully demonstrates what's going on.)

      import { AsyncLocalStorage } from "async_hooks";
      
      export const requestTrackingAsyncLocalStore = new AsyncLocalStorage();
      
      // requestTrackingAsyncLocalStoreMiddleware wraps the downstream koa middlewares inside an async local storage context
      export function requestTrackingAsyncLocalStoreMiddleware({ req, res }, next) {
        const store = {
          requestCost,
          req,
          res,
        };
        // running the next middleware in the chain in the context of this 'run' makes sure that all calls
        // to getStore() in the scope of this requests are bound to the correct store instance
        return requestTrackingAsyncLocalStore.run(store, next);
      }
      
      // add this to the router! (this would be in a different file)
      app.use(requestTrackingAsyncLocalStoreMiddleware);
      app.use(rateLimitingMiddleware);
      app.use(....);
      

      Store details about request about behavior in our pre-request async store

      Now that we have a pre-request async local store, we can grab it and start using it! We'll want to learn:

      1. How many database requests do we make over the course of an HTTP request? Are we running into the N+1 query problem on any of our routes?
      2. How long do those database requests take in total? Requests that take a long time can indicate spots where we're doing a lot of expensive work.
      3. How many documents are these requests returning? If we're processing 10,000s of documents in NodeJS, that can slow down a server quite a bit, and we may want to move that work to our database instead.
      export function increment(type: "request_count" | "duration" | "document_count", table: string, n: number = 1) {
        const store = requestTrackingAsyncLocalStore.getStore();
         // we'll probably want to track this to see if we're losing async context over the course of a request
        if (!store) return;
        _.set(store, ["requestCost", type], _.get(store, ["requestCost", type], 0) + n);
        _.set(store, ["requestCost", "byTable", table, type], _.get(store, ["requestCost", "byTable", table,], 0) + n);
      }
      

      If we add code that wraps our database client's request, it should hopefully be easy to add these increment calls at an appropriate point.

      Handle the request report

      Once we have this request report, we can do whatever we'd like with it! At ClassDojo, we log a server-error whenever a route is doing anything particularly egregious: that way, we get quick feedback when we've made a mistake. We also use a firehose to send this data to redshift (our data lake) so that we can easily query it. Either way, this is something that we can do after we're done sending our response to the client:

      app.use(requestTrackingAsyncLocalStoreMiddleware);
      app.use(({ req, res }, next) => {
        // this use of new AsyncResource will preserve the async context
        res.on("finished", new AsyncResource("requestTrackingLogging").bind(() => {
            const store = requestTrackingAsyncLocalStore.getStore();
            if (!store) throw new Error(`Something has gone awry with our async tracking!`);
            if (isEgregiouslyBad(store.requestCost)) logOutBadRequest(store);
            requestCostFirehose.write(store);
        }))
        next();
      });
      

      Tracking down places where we lose async context

      While the async store might feel like magic, it's not, and sommon common situations will cause you to lose async context:

      1. using callbacks rather than promises. In those situations, you'll need to create an AsyncResource to bind the current async context
      setTimeout(new AsyncResource("timeout").bind(() => doRequestTrackingThings()), 1);
      redisClient.get("key", new AsyncResource("timeout").bind(() => doRequestTrackingThings()))
      
      1. Some promise libraries might not support async-hooks. Bluebird does, but requires setting asyncHooks to true: Bluebird.config({ asyncHooks: true });.

      It may take a bit of work to track down and fix all of the places where you're losing async context. Setting up your increment calls to log out details about those situations can help!

      export function increment(type: "request_count" | "duration" | "document_count", table: string, n: number = 1) {
        const store = requestTrackingAsyncLocalStore.getStore();
        if (!store) {
          logServerError(`We lack async context for a call to increment ${type} ${table} by ${n}`, new Error().stack);
          return;
        }
        ...
      }
      

      Increased Observability is great!

      Putting effort into increasing the observability of a system can make that system much easier to manage. For a NodeJS web-server, we've found a lot of benefits in using AsyncLocalStorage to improve per-request visibility: it has let us improve latency on a few routes, reduced our event-loop blocking, and given us a better view of opportunities to improve performance.

        Newer posts
        Older posts