Canary Containers at ClassDojo in Too Much Detail

Will Keleher

kelwill

2021-11-02

Canary releases are pretty great! ClassDojo uses them as part of our continuous delivery pipeline: having a subset of real users use & validate our app before continuing with deploys allows us to safely & automatically deploy many times a day.

Our canary releases are conceptually simple:

we start canary containers with a new container image
we then route some production traffic to these containers
we monitor them: if a container sees a problem, we stop our pipeline. If they don't see problems, we start a full production deploy

Simple enough, right? There are a few details that go into setting up a system like this, and I'd like to take you through how ClassDojo does it. Our pipeline works well for our company's needs, and I think it's a good example of what this kind of canary-gated deploy can look like.

The key pieces of our system:

We have a logging taxonomy that lets us accurately detect server-errors that we want to fix. ("Errors" that we don't want to fix aren't actually errors!)
HAProxy, Consul, and Nomad let us route a subset of production traffic to a group of canary containers running new code
Our canary containers expose a route with the count of seen errors and the count of total requests that a monitoring script in our jenkins pipeline can hit
The monitoring script will stop our deployment if it sees a single error. If it sees 75,000 successful production requests, it will let the deploy go to production. (75,000 is an arbitrary number that gives us a 99.9% chance of catching errors that happen 1/10^4 requests. )

Starting canary containers

ClassDojo uses Nomad for our container orchestration, so once we've built a docker image and tagged it with our updated_image_id, we can deploy it by running nomad run api-canary.nomad.

// api-canary.nomad
job "api-canary" {
  group "api-canary-group" {
    count = 8
    task "api-canary-task" {
      driver = "docker"
      config {
        image = "updated_image_id"

      }
      service {
        name = "api-canary"
        port = "webserver_http"
       // this registers this port on these containers with consul as eligible for “canary” traffic
      }
      resources {
        cpu = 5000 # MHz
        memory = 1600

        network {
          port "webserver_http"{}
        }
      }
    }
  }
}

Nomad takes care of running these 8 (count = 8) canary containers on our nomad clients. At this point, we have running containers, but they're not serving any traffic.

Routing traffic to our canary containers

Remember that nomad job file we looked at above? Part of what it was doing was registering a service in consul. We tell consul that the webserver_http port can provide the api-canary service.

service {
  name = "api-canary"
  port = "webserver_http"
}

We use HAProxy for load-balancing, and we use consul-template to generate updated haproxy configs every 30 seconds based on the service information that consul knows about.

backend api
  mode http
  # I'm omitting a *ton* of detail here!
  # See https://engineering.classdojo.com/2021/07/13/haproxy-graceful-server-shutdowns talks about how we do graceful deploys with HAProxy

{{ range service "api-canary" }}
  server canary_{{ .Address }}:{{ .Port }} {{ .Address }}:{{ .Port }}
{{ end }}

# as far as HAProxy is concerned, the canary containers above should be treated the same as our regularly deployed containers. It will round robin traffic to all of them
{{ range service "api" }}
  server api_{{ .Address }}:{{ .Port }} {{ .Address }}:{{ .Port }}
{{end}}

Monitoring canary

Whenever we see an error, we increment a local counter saying that we saw the error. What counts as an error? For us, an error is something we need to fix (most often 500s or timeouts): if something can't be fixed, it's part of the system, and we need to design around it. If you're curious about our approach to categorizing errors, Creating An Actionable Logging Taxonomy digs into the details. Having an easy way of identifying real problems that should stop a canary deploy is the key piece that makes this system work.

let errorCount: number = 0;
export const getErrorCount = () => errorCount;
export function logServerError(errorDetails: ErrorDetails) {
  errorCount++;
  metrics.increment("serverError");
  winstonLogger.log("error", errorDetails);
}

Similarly, whenever we finish with a request, we increment another counter saying we saw the request. We can then expose both of these counts on our status route. There are probably better ways of publishing this information to our monitoring script rather than via our main server, but it works well enough for our needs.

router.get("/api/errorAndRequestCount", () => {
  return {
    errorCount: getErrorCount(),
    requestCount: getRequestsSeenCount(),
    ...otherInfo,
  });
});

Finally, we can use consul-template to re-generate our list of canary hosts & ports, and write a monitoring script to check the /api/errorAndRequestCount route on all of them. If we see an error, we can run nomad job stop api-canary && exit 1, and that will stop our canary containers & our deployment pipeline.

consul-template -template canary.tpl:canary.txt -once

{{ range service "api-canary" }}
  {{ .Address }}:{{ .Port }}
{{end -}}

Our monitoring script watches our canary containers until it sees that they've handled 75,000 requests without an error. (75,000 is a little bit of an arbitrary number: it's large enough that we'll catch relatively rare errors, and small enough that we can serve that traffic on a small number of containers within a few minutes.)

const fs = require("fs");
const canaryContainers = fs
  .readFileSync("./canary.txt")
  .toString()
  .split("\n")
  .map((s) => s.trim())
  .filter(Boolean);
const fetch = require("node-fetch");
const { execSync } = require("child_process");
const GOAL_REQUEST_COUNT = 75_000;

const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));

(async function main() {
  while (true) {
    let totalRequestCount = 0;
    for (const container of canaryContainers) {
      const { errorCount, requestCount } = await fetch(
        `${container}/api/errorAndRequestCount`
      ).then((res) => res.json());
      totalRequestCount += requestCount;
      if (errorCount) {
        // stopping our canary containers is normally handled by the next stage in our pipeline
        // putting it here for illustration
        console.error("oh no! canary failed");
        execSync(`nomad job stop api-canary`);
        return process.exit(1);
      }
    }

    if (totalRequestCount >= GOAL_REQUEST_COUNT) {
      console.log("yay! canary succeeded");
      execSync(`nomad job stop api-canary`);
      return process.exit(0);
    }

    await delay(1000);
  }
})();

Nary an Error with Canary

We've been running this canary setup (with occasional changes) for over eight years now, and it's been a key part of our continuous delivery pipeline, and has let us move quickly and safely. Without it, we would have shipped a lot more errors fully out to production, our overall error rate would likely be higher, and our teams would not be able to move as quickly as they can. Our setup definitely isn't perfect, but it's still hugely valuable, and I hope that sharing our setup will help your team create a better one.

Will Keleher

A former teacher, Will is an engineering manager focused on database performance, team effectiveness, and site-reliability. He believes most problems can be solved with judicious use of `sed` and `xargs`.