Culling Containers with a Leaky Bucket

ClassDojo occasionally has a few containers get into bad states that they're not able to recover from. This normally happens when a connection for a database gets into a bad state -- we've seen this with Redis, MySQL, MongoDB, and RabbitMQ connections. We do our best to fix these problems, but we also want to make it so that our containers have a chance of recovering on their own without manual intervention. We don't want to wake people up at night if we don't need to! Our main strategy to make that happen is having our containers decide whether they should try restarting themselves.

The algorithm we use for this is straightforward: every ten seconds, the container checks if it's seen an excessive number of errors. If it has, it tries to claim a token from our shutdown bucket. If it's able to claim a token, it starts reporting that it's down to our load balancer and container manager (in this case, nomad). Our container manager will take care of shutting down the container and bringing up a new one.

On every container, we keep a record of how many errors we've seen over the past minute. Here's a simplified version of what we're doing:

1let recentErrorTimes: number[] = [];
2function serverError(...args: things[]) {
3 recentErrorTimes.push(Date.now());
4}
5
6export function getPastMinuteErrorCount () {
7 return recentErrorTimes.count((t) => t >= Date.now() - 60_000);
8}

Check out ERROR, WARN, and INFO aren't actionable logging levels for some more details on ClassDojo's approach to logging and counting errors.

After tracking our errors, we can then check whether we've seen an excessive number of errors on an interval. If we've seen an excessive number of errors we'll use a leaky token bucket to decide whether or not we should shut down. Having a leaky token bucket for deciding whether or not we should try to shut down the container is essential: if we don't have that, a widespread issue that's impacting all of our containers would cause ALL of our containers to shut down and we'd bring the entire site down. We only want to cull a container when we're sure that we're leaving enough other containers to handle the load. For us, that means we're comfortable letting up to 10 containers shut themselves down without any manual intervention. After that point, something is going wrong and we'll want an engineer in the loop.

1let isUp = true;
2const EXCESSIVE_ERROR_COUNT = 5;
3const delay = (ms: Number) => new Promise((resolve) => setTimeout(resolve, ms));
4
5export async function check () {
6 if (!isUp) return;
7 if (getPastMinuteErrorCount() >= EXCESSIVE_ERROR_COUNT && await canHaveShutdownToken()) {
8 isUp = false;
9 return;
10 }
11
12 await delay(10_000);
13 check();
14}
15
16export function getIsUp () {
17 return isUp;
18}

At this point, we can use getIsUp to start reporting that we're down to our load balancer and to our container manager. We'll go through our regular graceful server shutdown logic and when our container manager brings up a new container, starting from scratch should make us likely to avoid whatever issue caused the problem in the first place.

1router.get("/api/haproxy", () => {
2 if (getIsUp()) return 200;
3 return 400;
4});

We use redis for our leaky token bucket. If something goes wrong with the connection to Redis, our culling algorithm won't work and we're OK with that. We don't need our algorithm to be perfect -- we just want it to be good enough to increase the chance that a container is able to recover from a problem on its own.

For our leaky token bucket, we decided to do the bare minimum: we wanted to have something simple to understand and test. For our use case, it's OK to have the leaky token bucket fully refill every ten minutes.

1/**
2 * returns errorWatcher:0, errorWatcher:1,... errorWatcher:5
3 * based on the current minute past the hour
4 */
5export function makeKey(now: Date) {
6 const minutes = Math.floor(now.getMinutes() / 10);
7 return `errorWatcher:${minutes}`;
8}
9
10const TEN_MINUTES_IN_SECONDS = 10 * 60;
11const BUCKET_CAPACITY = 10;
12export async function canHaveShutdownToken(now = new Date()): Promise<boolean> {
13 const key = makeKey(now);
14 const multi = client.multi();
15 multi.incr(key);
16 multi.expire(key, TEN_MINUTES_IN_SECONDS);
17 try {
18 const results = await multi.execAsync<[number, number]>();
19 return results[0] <= BUCKET_CAPACITY;
20 } catch (err) {
21 // if we fail here, we want to know about it
22 // but we don't want our error watcher to cause more errors
23 sampleLog("errorWatcher.token_fetch_error", err);
24 return false;
25 }
26}

See Even better rate-limiting for a description of how to set up a leaky token bucket that incorporates data from the previous time period to avoid sharp discontinuities between time periods.

Our container culling code has been running in production for several months now, and has been working quite well! Over the past two weeks, it successfully shut down 14 containers that weren't going to be able to recover on their own and saved a few engineers from needing to do any manual interventions. The one drawback has been that it makes it easier to ignore some of these issues causing these containers to get into these bad states to begin with, but it's a tradeoff we're happy to make.

    On our teams, we do our best to ensure that we're fully focused on the most important thing that our team could be doing. To do that, we often "swarm" on the top priority: this is some internal documentation on what that looks like in practice!

    Swarming is when a team is as fully dedicated to their absolute top priority as possible. This is the ideal state for maximum productivity. This is a list of patterns for swarming. No one pattern works in every scenario. Some patterns can be combined! It might be the right move for a group to change patterns in the middle work! We're probably also missing great patterns! Get creative! These are just suggestions and starting points.

    💡 Also please note that full-team swarming is often impossible or suboptimal in many cases. You can't just keep adding more people to the same task and always expect that addition to be effective (e.g., nine women can't make a baby in one month).

    Group Programming

    Group programming is any kind of programming with more than 1 person on one screen at a time. There are many different kinds and you should decide which one makes sense instead of just adopting one for every situation. You will want to consider the goal of this type of collaboration, the skill levels and background knowledge of everyone involved, the responsibilities of everyone in the group, and the type of work you're trying to accomplish.

    Roles

    • "navigator" — the person talking about what needs to be done on a high level
    • "driver" — the person with their hands on the keyboard making that happen
    • "observer" — anyone else that is just observing

    Every group programming pattern here has a single navigator and single driver for the sake of simplicity.

    Observers will have an extremely hard time staying engaged unless they are very regularly rotated into more engaging roles.

    Notes

    • Group programming is most useful when you're introducing new people to a code-base, or working on an integration between two platforms.
    • Group programming is extremely exhausting and should involve heavy, heavy use of breaks. I'd recommend 15 minutes per hour AT MINIMUM and adjusting as you feel necessary.
    • Not everyone in the group needs to be a software developer for this to be useful.
    • Group programming with just two people is generally called Pair programming or pairing.
    • Group programming with more people is generally called mob programming or mobbing

    Pair Programming

    Peer Pairing

    A type of pair programming where both developers have roughly the same level of expertise and background knowledge.

    • Recommended for: Really hard things or really unfun work that is less unfun with a friend.
    • Not Recommended for: Simple straightforward tasks: try Divide and Conquer instead. Cases where the work would benefit from more eyes, or when we want to spread knowledge/skills with more people in similar ways: try mob programming.

    Mentor Pairing

    A type of pair programming where one developer is trying to teach another developer something as they do the work together. The two developers could have things to learn from each other, and one of them could possibly not even be a developer at all.

    You'll want to be very thoughtful about who should be the navigator and who should be the driver depending on your teaching goals. You may or may not want to rotate roles.

    • Recommended for: Cases where we want one person to teach another person something in a hands-on way. Hands-on teaching on real work is generally pretty engaging and practical compared to more academic approaches.
    • Not Recommended For: Times when multiple people need the same mentoring — try Directed Mobbing.

    Ping-Pong Pairing

    A type of pair programming where the developers practice TDD with one person writing a test and the other writing the app code that passes it. You could have only one person writing the tests during the session, or the duties could rotate. Recommended for: really complex and detailed uses cases where an adversarial testing technique can drive out a better result.

    Not recommended for: straightforward building (like CRUD stuff)

    Mob Programming

    Mob-programming is group programming with more than 2 people. The additional people beyond the driver and the navigator are almost always observers to keep things simple. The traditional recommendation is to change the driver every 10 minutes. Some examples are: Directed Mobbing , Peer Mobbing

    Directed Mobbing

    Directed mobbing is a form of mob programming when there's a group trying to learn something from a particular expert . In general that expert should never be driving so that the learners are always engaged and learning in a hands-on fashion. Usually the expert will be the navigator, but there may be a point when the learners can even get into the navigation role and allow the expert to continue just as an observer that's around for consultation. If there comes a point where the expert is no longer necessary at all, this style may no longer be useful.

    • Recommended for: teaching/guiding many people the same thing at once.
    • Not Recommended for: groups that have graduated to having the teacher as an observer for an extended amount of time. Move on to something more engaging/efficient!

    💡 Check out this great blog post written by Melissa Dirdo for more on ClassDojo's approach to mobbing.

    Peer Mobbing

    Peer mobbing is is a form of mob programming when everyone has roughly the same level of ability and it makes sense to rotate them into navigator/driver positions equally.

    • Recommended for: stuff the team is working though/learning together that they all want to know/understand. getting the team on the same page about practices/processes/conventions. working through a hairy problem that could use lots of different ideas/perspectives.
    • Not Recommended for: simple straightforward things. things that are low value for everyone to learn.

    Divide and Conquer

    If stories are sliced well vertically, a particular story may still have divisible parts along other lines that could allow them to be worked on simultaneously without much overhead. This requires that everyone involved gets in a group and does actual upfront planning and breakdown but it can be pretty quick ("You do the frontend and I'll do the backend", "You do the model and I'll do the controller"). This does not require any tracking whatsoever in asana (and probably shouldn't), so once things are broken up well vertically in asana, feel free to break up a particular story horizontally to collaborate on it if you think that's the best way. You'll probably want to solve integrations between layers first before actually building the layers. There may be other sub-pieces to cut out other than horizontal layers as well!

    • Recommended for: When everything is simple and straightforward and everyone knows what to do and how to do it. When everybody is exhausted from group programming and wants to just bang out some code and listen to Iron Maiden. When someone being mentored is ready to try flying solo to see if they really can apply their new learnings on their own.
    • Not recommended for: Work that can't be parallelized. Teams that don't have widespread capability on the subject matter (they're just going to continuously interrupt each other asking questions). Work that, when broken down, is not straightforward.

      Automated and semi-automated code migrations using shell text manipulation tools are great! Turning a migration task that might take multiple days or weeks of engineering effort into one that you can accomplish in a few minutes can be a huge win. I'm not remotely an expert at these migrations, but I thought it'd still be useful to write up the patterns that I use consistently.

      Use ag, rg, or git grep to list files

      Before anything else, you need to edit the right files! If you don't have a way of finding your codebase's files, you might accidentally edit random cache files, package files, editor files, or other dependencies. Editing those files is a good way to end up throwing away a codebase and cloning it from scratch again.

      I normally use ag -l . to list files because ag, the Silver Searcher, is set up to respect .gitignore already. A simple find and replace might look like ag -l . | xargs gsed -i 's|bad pattern|replacement|'. It'd be simpler to do that replacement with your editor, but the ag -l . | xargs gsed -i pattern is one that you can expand on in a larger script.

      Pause for user input: not all migrations are fully automatable

      A lot of migrations can't actually be fully automated. In those cases, it can be worth building a miniature tool to make editing faster (and more fun!).

      1# spaces in file names will kill this for loop
      2# thankfully, I've never worked in a code base where people put spaces in filenames
      3for file in $(ag -l bad_pattern); do
      4 echo "how should we replace bad_pattern in ${file}? Here's context:"
      5 ag -C 3 bad_pattern "${file}"
      6 echo ""
      7 read good_pattern
      8 # quoting in sed commands is tricky!
      9 # using `${var}` rather than $var avoids potential problems here
      10 gsed -i "s|bad_pattern|${good_pattern}|" "${file}"
      11done

      You can expand this pattern to look for a number and choose an appropriate option, but just having something that speeds up going through files makes life better!

      Handle relative import paths with for loops

      I've often needed to add a new import statement with a relative path to files as part of a migration, and every time I've been surprised that my editor hasn't been able to help me out more: what am I missing? I normally use a for loop and increase both the max-depth of files I'm looking at and the number of ../ on the path:

      1dots="."
      2import_path="/file/path"
      3for ((depth=0; depth<5; depth++)); do
      4 dots="$dots/..";
      5 for file in $(ag -l --depth $depth | grep .ts); do
      6 if ! grep $import_path $file; then
      7 gsed -i "1i import '${dots}${import_path}';" $file;
      8 fi
      9 done
      10done

      Rely on your code formatter

      Not needing to worry about code formatting is AMAZING. If your codebase is set up with a code formatter (like prettier or gofmt), it allows you to make changes without worrying about whitespace and then let the code formatter fix things later. It may even make sense to intentionally remove white-space from a pattern in order to make a replacement simpler to write!

      Use the right tool for the job

      1. Some code migrations require a tool that looks at the AST rather than the text in a code file and transforms that AST. These tools are more powerful & flexible than shell tools, but they require a bit more effort to get working. In NodeJS, there's jscodeshift and codemods. I don't know what's available for other languages.
      2. Your editor & language might support advanced migrations. If it does, learning how to do those migrations with your editor will likely be more effective than using these techniques or may prove a useful complement to these techniques.
      3. Bash tools like sed, awk, grep, and cut are designed to deal with text and files. Code is text and files! Other tools work, but they might not be designed to deal with files and streams of text.
      4. Shell tools are great, but a tool you know well and are excited about using is better than a tool you don't want to learn! Whatever programming language you're most comfortable with should have ways of dealing with and changing files and text. Having some way of manipulating text & files is important!. There are even tools like rb or nq (I wrote this one!) that let you use the Ruby or NodeJS syntax you're familiar with on the command line in a script you're writing.

      Use sed: it's designed for this

      sed is the streaming text editor, and it's the perfect tool for many code migrations. A surprising number of code migrations boil down to replacing a code pattern that happens on a single line with a different code pattern: sed makes that easy. Here are a few notes:

      1. If you're on a mac, you'll want to download a modern version of sed. I use gnu-sed: brew install gnu-sed
      2. use | (or anything else!) as your delimiter rather than /. sed takes the first character after the command as the delimiter, and / will show up in things that you want to replace pretty often! Writing gsed 's|/path/file.js|/path/file.ts|' is nicer than gsed 's/\/path\/file.js/\/path\/file.ts/'.
      3. In gsed, the --null-data (-z) option separates lines by NUL characters which lets you easily match and edit multiline patterns. If you use this, don't forget to use the g flag at the end to get all matches: everything in a file will be on the same 'line' for sed.
      4. When referring to shell variables, use ${VAR_NAME} rather than $VAR_NAME. This will simplify using them in sed commands.
      5. Use -E (or -r with gsed) for extended regular expressions and use capture groups in your regular expressions. git grep -l pattern | xargs gsed -Ei 's|pat(tern)|\1s are birds|g'

      ("perl pie" (perl -pi -e) can be another good tool for finding and replacing patterns! It's just not one I know.)

      Many migrations might take multiple steps

      When you're migrating code, don't worry about migrating everything at once. If you can break down the problem into a few different commands, those individual commands can be simple to write: you might first replace a function call with a different one and then update import statements to require the new function that you added.

      When you write a regular expression in a find-and-replace, you can sometimes get false positives. Rather than trying to update your regular expression to skip the false positives, I often find it simpler to write a regular expression to replace those false positives with a temporary pattern, update the remaining matches, and then replace the temporary pattern.

      With all of this, you'll need to rely on git (or another version control system). It's really easy to make mistakes! If you don't have an easy way to undo mistakes, you'll be sad.

      Automate ALL the code migrations!

      Manipulating text & files like this is a skill, and it's one that takes some practice to learn. Even if it's much slower to automate a code change, spending the time to automate it will help you build the skills to automate larger, more complex, and more valuable code migrations. I remember spending over an hour trying to figure out how to automate changing a pattern that was only in 10 spots in our codebase. It would have taken 5 minutes to do manually, but I'm glad I spent 10x the time doing it the slow way with shell tools because that experience made me capable of tackling more complex migrations that wouldn't be feasible to do manually.

        Newer posts
        Older posts