Canary Containers at ClassDojo in Too Much Detail

Canary releases are pretty great! ClassDojo uses them as part of our continuous delivery pipeline: having a subset of real users use & validate our app before continuing with deploys allows us to safely & automatically deploy many times a day.

Our canary releases are conceptually simple:

  1. we start canary containers with a new container image
  2. we then route some production traffic to these containers
  3. we monitor them: if a container sees a problem, we stop our pipeline. If they don't see problems, we start a full production deploy

Simple enough, right? There are a few details that go into setting up a system like this, and I'd like to take you through how ClassDojo does it. Our pipeline works well for our company's needs, and I think it's a good example of what this kind of canary-gated deploy can look like.

The key pieces of our system:

  1. We have a logging taxonomy that lets us accurately detect server-errors that we want to fix. ("Errors" that we don't want to fix aren't actually errors!)
  2. HAProxy, Consul, and Nomad let us route a subset of production traffic to a group of canary containers running new code
  3. Our canary containers expose a route with the count of seen errors and the count of total requests that a monitoring script in our jenkins pipeline can hit
  4. The monitoring script will stop our deployment if it sees a single error. If it sees 75,000 successful production requests, it will let the deploy go to production. (75,000 is an arbitrary number that gives us a 99.9% chance of catching errors that happen 1/10^4 requests. )

Starting canary containers

ClassDojo uses Nomad for our container orchestration, so once we've built a docker image and tagged it with our updated_image_id, we can deploy it by running nomad run api-canary.nomad.

1// api-canary.nomad
2job "api-canary" {
3 group "api-canary-group" {
4 count = 8
5 task "api-canary-task" {
6 driver = "docker"
7 config {
8 image = "updated_image_id"
9
10 }
11 service {
12 name = "api-canary"
13 port = "webserver_http"
14 // this registers this port on these containers with consul as eligible for “canary” traffic
15 }
16 resources {
17 cpu = 5000 # MHz
18 memory = 1600
19
20 network {
21 port "webserver_http"{}
22 }
23 }
24 }
25 }
26}

Nomad takes care of running these 8 (count = 8) canary containers on our nomad clients. At this point, we have running containers, but they're not serving any traffic.

Routing traffic to our canary containers

Remember that nomad job file we looked at above? Part of what it was doing was registering a service in consul. We tell consul that the webserver_http port can provide the api-canary service.

1service {
2 name = "api-canary"
3 port = "webserver_http"
4}

We use HAProxy for load-balancing, and we use consul-template to generate updated haproxy configs every 30 seconds based on the service information that consul knows about.

1backend api
2 mode http
3 # I'm omitting a *ton* of detail here!
4 # See https://engineering.classdojo.com/2021/07/13/haproxy-graceful-server-shutdowns talks about how we do graceful deploys with HAProxy
5
6{{ range service "api-canary" }}
7 server canary_{{ .Address }}:{{ .Port }} {{ .Address }}:{{ .Port }}
8{{ end }}
9
10# as far as HAProxy is concerned, the canary containers above should be treated the same as our regularly deployed containers. It will round robin traffic to all of them
11{{ range service "api" }}
12 server api_{{ .Address }}:{{ .Port }} {{ .Address }}:{{ .Port }}
13{{end}}

Monitoring canary

Whenever we see an error, we increment a local counter saying that we saw the error. What counts as an error? For us, an error is something we need to fix (most often 500s or timeouts): if something can't be fixed, it's part of the system, and we need to design around it. If you're curious about our approach to categorizing errors, Creating An Actionable Logging Taxonomy digs into the details. Having an easy way of identifying real problems that should stop a canary deploy is the key piece that makes this system work.

1let errorCount: number = 0;
2export const getErrorCount = () => errorCount;
3export function logServerError(errorDetails: ErrorDetails) {
4 errorCount++;
5 metrics.increment("serverError");
6 winstonLogger.log("error", errorDetails);
7}

Similarly, whenever we finish with a request, we increment another counter saying we saw the request. We can then expose both of these counts on our status route. There are probably better ways of publishing this information to our monitoring script rather than via our main server, but it works well enough for our needs.

1router.get("/api/errorAndRequestCount", () => {
2 return {
3 errorCount: getErrorCount(),
4 requestCount: getRequestsSeenCount(),
5 ...otherInfo,
6 });
7});

Finally, we can use consul-template to re-generate our list of canary hosts & ports, and write a monitoring script to check the /api/errorAndRequestCount route on all of them. If we see an error, we can run nomad job stop api-canary && exit 1, and that will stop our canary containers & our deployment pipeline.

consul-template -template canary.tpl:canary.txt -once

1{{ range service "api-canary" }}
2 {{ .Address }}:{{ .Port }}
3{{end -}}

Our monitoring script watches our canary containers until it sees that they've handled 75,000 requests without an error. (75,000 is a little bit of an arbitrary number: it's large enough that we'll catch relatively rare errors, and small enough that we can serve that traffic on a small number of containers within a few minutes.)

1const fs = require("fs");
2const canaryContainers = fs
3 .readFileSync("./canary.txt")
4 .toString()
5 .split("\n")
6 .map((s) => s.trim())
7 .filter(Boolean);
8const fetch = require("node-fetch");
9const { execSync } = require("child_process");
10const GOAL_REQUEST_COUNT = 75_000;
11
12const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
13
14(async function main() {
15 while (true) {
16 let totalRequestCount = 0;
17 for (const container of canaryContainers) {
18 const { errorCount, requestCount } = await fetch(
19 `${container}/api/errorAndRequestCount`
20 ).then((res) => res.json());
21 totalRequestCount += requestCount;
22 if (errorCount) {
23 // stopping our canary containers is normally handled by the next stage in our pipeline
24 // putting it here for illustration
25 console.error("oh no! canary failed");
26 execSync(`nomad job stop api-canary`);
27 return process.exit(1);
28 }
29 }
30
31 if (totalRequestCount >= GOAL_REQUEST_COUNT) {
32 console.log("yay! canary succeeded");
33 execSync(`nomad job stop api-canary`);
34 return process.exit(0);
35 }
36
37 await delay(1000);
38 }
39})();

Nary an Error with Canary

We've been running this canary setup (with occasional changes) for over eight years now, and it's been a key part of our continuous delivery pipeline, and has let us move quickly and safely. Without it, we would have shipped a lot more errors fully out to production, our overall error rate would likely be higher, and our teams would not be able to move as quickly as they can. Our setup definitely isn't perfect, but it's still hugely valuable, and I hope that sharing our setup will help your team create a better one.

Interested in working in an engineering culture that values automated testing, continuous delivery, and high collaboration? ClassDojo is hiring and we'd love to chat!

Many companies, like Unity and Apple are using Entity Component Systems (ECS) to make games because having tightly packed data leads to cache efficiency and great performance. ECS isn’t just great for performance though: it leads to good code-decoupling, easy composition, and lends itself to TDD and automated testing.

Entity Component Systems are Easy to Compose

In the standard Object Oriented Programming (OOP) approach to writing games, you often create extremely complex and deep inheritance trees; a Player class might inherit from a SceneObject, and that SceneObject might inherit from a Transform class, and that Transform class might rely on a PlayerManager class to even be instantiated. This can be quite complex to understand and write tests for.

In ECS this is modeled by using composition. The entity (E) is the ID of an entity, components (C) are data associated with that entity, and then systems (S) operate on and modify groups of components. You then build up functionality by adding multiple components to an entity, and creating systems that look for groups of components.

Example pseudocode:

1PositionComponent {
2 x, y, z
3}
4
5WindComponent {
6 //No data - just used as a tag
7}
8
9WindSystem {
10 update() {
11 world.each(position: PositionComponent, wind: WindComponent) {
12 position.x++;
13 }
14 }
15}
16
17//Add entities to the world
18playerEntity = world.addEntity(PositionComponent());
19enemyEntity = world.addEntity(PositionComponent(), WindComponent());
20
21//Add systems
22world.addSystem(WindSystem());
23
24while(true) {
25 //Update all the systems each frame
26 world.systems(system) {
27 system.update();
28 }
29}

In this example code above the enemy entity will continuously move every frame while the player entity will not. This is because the WindSystem is looking for entities that have both a PositionComponent and a WindComponent. The great part about using composition is it's easy to add and remove functionality from entities.

So, if something happened that caused us to want the player to be affected by wind as well, we could simply add the WindComponent to the player entity.

Example pseudocode:

1world.addComponent(playerEntity, WindComponent());

We can also make the system itself slightly more complex by removing the WindComponent if a component reaches the side of the screen.

Example pseudocode:

1WindSystem {
2 update() {
3 world.each(position: PositionComponent, wind: WindComponent) {
4 if (position.x > 100) {
5 world.removeComponent(wind)
6 } else {
7 position.x++;
8 }
9 }
10 }
11}

These examples are slightly simplified, however even real systems tend to be small and focused because they only have one job to perform. This makes them easy to understand and reason about.

Entity Component Systems are Easy to Test

ECS also helps when it comes to automated testing. In OOP our class might have a model, animation, and all sorts of other data attached to it that we don’t necessarily want to load for every test. Whereas with composition we can just create the components and systems we are currently testing.

  • Example tests that we could write for our WindSystem include:
  • Entities with PositionComponent and WindComponent move the expected amount per update
  • Entities without the WindComponent don't move
  • Entities lose the WindComponent after x > 100

These tests are easy to write and fast to run since they don’t load or run any unnecessary logic.

Example pseudocode:

1test(‘Entities without the WindComponent dont move’) {
2 position = PositionComponent();
3 world.addEntity(position);
4 world.addSystem(WindSystem());
5
6 startingPosition = position;
7 world.update();
8 expect(position == startingPosition);
9}

Conclusion

ECS has a lot of benefits outside of performance that can really help create better code that is easier to reason about and modify. We’ve started using it at ClassDojo for some projects and are really enjoying it.

Does this sound like a better way to create games? Come join us in ClassDojo Engineering, we’re hiring!

I've done a few Advent of Code problems, and one thing that's occasionally tripped me up has been my erroneous assumption that accessing the key of an object is fast. It's a fast constant time operation, but whatever hashing node is doing internally to find a key can sometimes be a performance hot-spot. There's a massive difference in speed between Array[index] and Object[key], and that made me start to wonder whether Object[key] was slow enough that there were situations where it'd make more sense to use Array.includes instead.

When I benchmarked it, I was pretty astounded to see that on a sorted array, Array.includes is faster on arrays of 10,000 strings than Object[key]. Internally, I'm guessing Array.includes is using binary-search when it knows that the array is sorted, which is a cool optimization! Here's the gist if you want to test it out. (I'm running these tests on Node 14 on a Mac Mini)

What about unsorted arrays?

When the array is unsorted, Object[key] look-ups start beating Array.includes pretty quickly: once we have ~20 documents, we'd expect to see faster look-ups from keyed object access. This is much more in line with what I expected when setting up this test.

What about Set.has?

Set.has is consistently slower than Object[key]. It can handle more cases than Object[key] can, but from a speed perspective, it's not in the running. I still often use it over Object[key] because I think it does a better job of expressing the intent to someone who's reading the code, and it's still a fast-enough constant time look-up.

What does this all mean?

Absolutely nothing! I still use Set.has for the majority of situations where I need to check membership in a list. For most of the code I write, expressing clear intent is far more important than optimizing performance, and I have yet to profile non-Advent-of-Code performance and find Object[key] or Set.has being a hotspot.

Benchmark Results

110 object access x 25,715,807 ops/sec ±0.70% (94 runs sampled)
2 10 set x 18,422,013 ops/sec ±0.85% (96 runs sampled)
3 10 array x 28,105,533 ops/sec ±0.88% (89 runs sampled)
4Fastest for ordered 10 is array
5 40 object access x 20,175,473 ops/sec ±0.90% (92 runs sampled)
6 40 set x 19,632,409 ops/sec ±0.73% (92 runs sampled)
7 40 array x 27,999,865 ops/sec ±1.09% (94 runs sampled)
8Fastest for ordered 40 is array
9 100 object access x 21,842,706 ops/sec ±0.70% (95 runs sampled)
10 100 set x 18,283,045 ops/sec ±0.78% (95 runs sampled)
11 100 array x 27,860,691 ops/sec ±1.03% (94 runs sampled)
12Fastest for ordered 100 is array
13 1000 object access x 20,026,184 ops/sec ±0.82% (92 runs sampled)
14 1000 set x 14,656,626 ops/sec ±1.37% (89 runs sampled)
15 1000 array x 23,392,774 ops/sec ±1.31% (90 runs sampled)
16Fastest for ordered 1000 is array
17 10000 object access x 14,367,754 ops/sec ±0.58% (93 runs sampled)
18 10000 set x 12,213,161 ops/sec ±0.51% (95 runs sampled)
19 10000 array x 28,067,504 ops/sec ±0.94% (92 runs sampled)
20Fastest for ordered 10000 is array
21 100000 object access x 9,106,145 ops/sec ±1.55% (93 runs sampled)
22 100000 set x 7,918,164 ops/sec ±0.92% (91 runs sampled)
23 100000 array x 78,374 ops/sec ±58.42% (86 runs sampled)
24Fastest for ordered 100000 is object access
25 1000000 object access x 3,296,346 ops/sec ±1.05% (93 runs sampled)
26 1000000 set x 2,076,943 ops/sec ±1.24% (91 runs sampled)
27 1000000 array x 490 ops/sec ±3.50% (40 runs sampled)
28Fastest for ordered 1000000 is object access
29 10 object access x 24,869,178 ops/sec ±0.80% (91 runs sampled)
30 10 set x 19,181,356 ops/sec ±0.98% (92 runs sampled)
31 10 array x 27,691,570 ops/sec ±0.99% (90 runs sampled)
32Fastest for random 10 is array
33 40 object access x 27,450,529 ops/sec ±1.30% (86 runs sampled)
34 40 set x 12,483,804 ops/sec ±1.30% (91 runs sampled)
35 40 array x 16,590,091 ops/sec ±0.92% (92 runs sampled)
36Fastest for random 40 is object access
37 100 object access x 31,040,999 ops/sec ±1.24% (87 runs sampled)
38 100 set x 12,386,070 ops/sec ±1.14% (91 runs sampled)
39 100 array x 16,389,384 ops/sec ±1.22% (91 runs sampled)
40Fastest for random 100 is object access
41 1000 object access x 25,365,248 ops/sec ±1.52% (88 runs sampled)
42 1000 set x 11,275,049 ops/sec ±1.51% (91 runs sampled)
43 1000 array x 16,294,268 ops/sec ±1.23% (88 runs sampled)
44Fastest for random 1000 is object access
45 10000 object access x 16,907,288 ops/sec ±1.47% (89 runs sampled)
46 10000 set x 8,905,038 ops/sec ±1.47% (91 runs sampled)
47 10000 array x 14,777,156 ops/sec ±1.19% (91 runs sampled)
48Fastest for random 10000 is object access
49...
Engineering
Older posts