;

Posts By: Will Keleher

Red and Blue Function Mistakes in JavaScript

Bob Nystrom's What Color is Your Function does an amazing job of describing why it can be painful when programming languages have different rules for calling synchronous and asynchronous functions. Promises and async/await have simplified things in JavaScript, but it's still a language with "red" (async) and "blue" (sync) functions, and I consistently see a few understandable errors from red vs. blue function confusion. Let's go through some of the most common mistakes – none of these are bad things to get wrong, they're just a symptom of how confusing this can be!

Omitting await from try/catch blocks

The most common mistake I see is omitting await from try/catch blocks with async functions. The code looks reasonable, but the catch block will only be able to catch synchronously thrown errors. To make matters worse, error handling logic is often less well tested than the happy path when everything works, which makes this pattern more likely to sneak its way into production code.

1async function throwsError () {
2 throw new Error("alas! an error");
3}
4
5try {
6 return throwsError();
7} catch (err) {
8 console.error("Oh no! This catch block isn't catching anything", err);
9}

An async function that throws is the equivalent of a Promise.reject, and when written that way, it's a bit clearer what's going on:

1try {
2 return Promise.reject(new Error("alas! an error"));
3} catch (err) {
4 console.error("It's clearer that this `catch` can't catch that `Promise.reject`. This is equivalent to the earlier code");
5}

Personally, I'm starting to wonder whether using try and catch blocks at all is a mistake when dealing with async code. They take up space and don't offer the same pattern matching that a library like Bluebirdjs can add to catch when you only want to catch some specific known errors: await tryThing().catch(NotFoundErrorClass, handleErrPattern) feels substantially cleaner to me than the equivalent try/catch block.

Array.filter(async () => false)

In recent years, JavaScript has added lots of useful Array methods like filter, map, forEach, and flatMap, and JavaScript programmers often use libraries like lodash to write functional code rather than writing for loops. Sadly, none of those Array methods or lodash helpers work with red async functions and are a common source of coding errors.

1const things = [true, false, 1, 0, "", new Date("not a date") - 0];
2const filteredThings = things.filter(async (thing) => thing);

How many things do we end up with in filteredThings? Surprisingly, the answer has little to do with JavaScript type coercion: filteredThings will be the same size as things. An async function returns a Promise and even a Promise that resolves to false is still a truthy value: Boolean(Promise.resolve(false)) === true. If we want to do any sort of filtering using an async function, we need to switch out of blue sync mode and into red async mode.

1(async function () {
2 // You should use a library like Bluebird rather than filtering like this! this is only for illustration
3 const things = [true, false, 1, 0, "", new Date("not a date") - 0];
4 const predicateValues = await Promise.all(things.map(async (thing) => thing));
5 const filteredThings = things.filter((_thing, i) => predicateValues[i]);
6})();

When you see Array.filter(async (thing) => thing) written out like that, the mistake is pretty clear. It can be harder to notice when you see code like const goodThings = things.filter(isGoodThing); you need to check whether isGoodThing is red or blue.

Array.forEach(async...

We see a similar problem when people use Array.forEach with an async function:

1const fruitStatus = {};
2["apple", "tomato", "potato"].forEach(async (food) => {
3 fruitStatus[food] = await isFruit(food);
4});
5return fruitStatus;

In some ways, this is a more dangerous pattern. Depending on when you check, fruitStatus may have some, none, or all of the correct isFruit values. If isFruit is normally fast, problems and bugs might not manifest until isFruit slows down. A bug that only shows up some of the time is much harder to debug than one that's always there.

Await off my shoulders

Despite how easy it is to make mistakes with async/await, I still love it – it feels easier to work with than Promises or callbacks. Dealing with asynchronous code is still one of the harder parts of programming in JavaScript, but tools like bluebird, the TypesScript no-unnecessary-condition rule, and the eslint promise plugin can help surface these easy-to-make red/blue function mistakes early. Hopefully, seeing the mistakes we often make will help you avoid some frustrating minutes debugging.

Canary releases are pretty great! ClassDojo uses them as part of our continuous delivery pipeline: having a subset of real users use & validate our app before continuing with deploys allows us to safely & automatically deploy many times a day.

Our canary releases are conceptually simple:

  1. we start canary containers with a new container image
  2. we then route some production traffic to these containers
  3. we monitor them: if a container sees a problem, we stop our pipeline. If they don't see problems, we start a full production deploy

Simple enough, right? There are a few details that go into setting up a system like this, and I'd like to take you through how ClassDojo does it. Our pipeline works well for our company's needs, and I think it's a good example of what this kind of canary-gated deploy can look like.

The key pieces of our system:

  1. We have a logging taxonomy that lets us accurately detect server-errors that we want to fix. ("Errors" that we don't want to fix aren't actually errors!)
  2. HAProxy, Consul, and Nomad let us route a subset of production traffic to a group of canary containers running new code
  3. Our canary containers expose a route with the count of seen errors and the count of total requests that a monitoring script in our jenkins pipeline can hit
  4. The monitoring script will stop our deployment if it sees a single error. If it sees 75,000 successful production requests, it will let the deploy go to production. (75,000 is an arbitrary number that gives us a 99.9% chance of catching errors that happen 1/10^4 requests. )

Starting canary containers

ClassDojo uses Nomad for our container orchestration, so once we've built a docker image and tagged it with our updated_image_id, we can deploy it by running nomad run api-canary.nomad.

1// api-canary.nomad
2job "api-canary" {
3 group "api-canary-group" {
4 count = 8
5 task "api-canary-task" {
6 driver = "docker"
7 config {
8 image = "updated_image_id"
9
10 }
11 service {
12 name = "api-canary"
13 port = "webserver_http"
14 // this registers this port on these containers with consul as eligible for “canary” traffic
15 }
16 resources {
17 cpu = 5000 # MHz
18 memory = 1600
19
20 network {
21 port "webserver_http"{}
22 }
23 }
24 }
25 }
26}

Nomad takes care of running these 8 (count = 8) canary containers on our nomad clients. At this point, we have running containers, but they're not serving any traffic.

Routing traffic to our canary containers

Remember that nomad job file we looked at above? Part of what it was doing was registering a service in consul. We tell consul that the webserver_http port can provide the api-canary service.

1service {
2 name = "api-canary"
3 port = "webserver_http"
4}

We use HAProxy for load-balancing, and we use consul-template to generate updated haproxy configs every 30 seconds based on the service information that consul knows about.

1backend api
2 mode http
3 # I'm omitting a *ton* of detail here!
4 # See https://engineering.classdojo.com/2021/07/13/haproxy-graceful-server-shutdowns talks about how we do graceful deploys with HAProxy
5
6{{ range service "api-canary" }}
7 server canary_{{ .Address }}:{{ .Port }} {{ .Address }}:{{ .Port }}
8{{ end }}
9
10# as far as HAProxy is concerned, the canary containers above should be treated the same as our regularly deployed containers. It will round robin traffic to all of them
11{{ range service "api" }}
12 server api_{{ .Address }}:{{ .Port }} {{ .Address }}:{{ .Port }}
13{{end}}

Monitoring canary

Whenever we see an error, we increment a local counter saying that we saw the error. What counts as an error? For us, an error is something we need to fix (most often 500s or timeouts): if something can't be fixed, it's part of the system, and we need to design around it. If you're curious about our approach to categorizing errors, Creating An Actionable Logging Taxonomy digs into the details. Having an easy way of identifying real problems that should stop a canary deploy is the key piece that makes this system work.

1let errorCount: number = 0;
2export const getErrorCount = () => errorCount;
3export function logServerError(errorDetails: ErrorDetails) {
4 errorCount++;
5 metrics.increment("serverError");
6 winstonLogger.log("error", errorDetails);
7}

Similarly, whenever we finish with a request, we increment another counter saying we saw the request. We can then expose both of these counts on our status route. There are probably better ways of publishing this information to our monitoring script rather than via our main server, but it works well enough for our needs.

1router.get("/api/errorAndRequestCount", () => {
2 return {
3 errorCount: getErrorCount(),
4 requestCount: getRequestsSeenCount(),
5 ...otherInfo,
6 });
7});

Finally, we can use consul-template to re-generate our list of canary hosts & ports, and write a monitoring script to check the /api/errorAndRequestCount route on all of them. If we see an error, we can run nomad job stop api-canary && exit 1, and that will stop our canary containers & our deployment pipeline.

consul-template -template canary.tpl:canary.txt -once

1{{ range service "api-canary" }}
2 {{ .Address }}:{{ .Port }}
3{{end -}}

Our monitoring script watches our canary containers until it sees that they've handled 75,000 requests without an error. (75,000 is a little bit of an arbitrary number: it's large enough that we'll catch relatively rare errors, and small enough that we can serve that traffic on a small number of containers within a few minutes.)

1const fs = require("fs");
2const canaryContainers = fs
3 .readFileSync("./canary.txt")
4 .toString()
5 .split("\n")
6 .map((s) => s.trim())
7 .filter(Boolean);
8const fetch = require("node-fetch");
9const { execSync } = require("child_process");
10const GOAL_REQUEST_COUNT = 75_000;
11
12const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
13
14(async function main() {
15 while (true) {
16 let totalRequestCount = 0;
17 for (const container of canaryContainers) {
18 const { errorCount, requestCount } = await fetch(
19 `${container}/api/errorAndRequestCount`
20 ).then((res) => res.json());
21 totalRequestCount += requestCount;
22 if (errorCount) {
23 // stopping our canary containers is normally handled by the next stage in our pipeline
24 // putting it here for illustration
25 console.error("oh no! canary failed");
26 execSync(`nomad job stop api-canary`);
27 return process.exit(1);
28 }
29 }
30
31 if (totalRequestCount >= GOAL_REQUEST_COUNT) {
32 console.log("yay! canary succeeded");
33 execSync(`nomad job stop api-canary`);
34 return process.exit(0);
35 }
36
37 await delay(1000);
38 }
39})();

Nary an Error with Canary

We've been running this canary setup (with occasional changes) for over eight years now, and it's been a key part of our continuous delivery pipeline, and has let us move quickly and safely. Without it, we would have shipped a lot more errors fully out to production, our overall error rate would likely be higher, and our teams would not be able to move as quickly as they can. Our setup definitely isn't perfect, but it's still hugely valuable, and I hope that sharing our setup will help your team create a better one.

Interested in working in an engineering culture that values automated testing, continuous delivery, and high collaboration? ClassDojo is hiring and we'd love to chat!

When multiple teams share a resource like a code-base, a database, or an analytics pipeline, that resource can often suffer from the Tragedy of the Commons. Well intentioned engineers can add slow code, not prioritize important improvements because those improvements aren't aligned with team incentives, and just generally not leave the codebase or database better than they found it.

ClassDojo has a monolithic backend web-server API, but it and its backing databases often suffered from hard to diagnose performance problems because engineers would accidentally add code with poorly indexed queries, with database fanouts, or that blocked the event loop, and they wouldn't realize that that code was causing any problems. We intentionally made our Model layer request-agnostic, which helps us create better tests & simpler code, but it meant that we didn't have a way of instrumenting database & performance cost by request until Node 14 started supporting AsyncLocalStorage.

AsyncLocalStorage "creates stores that stay coherent through asynchronous operations." This lets us reference a consistent context over the course of a request to get information about the route, store information about how many database requests that request has triggered, count how many documents we've fetched, and dramatically increase our visibility into what every request is doing. Because every single one of our routes is tied to a product area, and each of our product areas is owned by a team, we were suddenly able to give each team insight into any performance problems they'd inadvertently added to the codebase. None of our engineers and teams ever wanted to cause performance problems, but the lack of pre-AsyncLocalStorage system legibility meant that these issues were difficult to track down, and teams always had higher priority work than tracking down unattributed performance problems.

When we set up our AsyncLocalStorage per-request instrumentation system, we found some crazy things:

  • a route that occasionally made 30,000+ database requests because it was fanning out over a large list of items
  • another route that blocked the event-loop for 15-20 seconds a few times a day, and caused timeouts for any other requests that our server was handling at the same time
  • a third route that was occasionally fetching 500,000+ items to render information to return to clients
  • and plenty of other routes that weren't doing things nearly this egregious, but were still bad enough that we wanted to fix them

Once we had this instrumentation, it was incredibly simple for the right team to find and fix these issues. We'd looked into some of these & similar issues before, but finding & fixing them was incredibly time-consuming. Now, we have a single log that triggers an alert and tells you everything you need to know to track down the problem.

What did we track?

  1. Whenever we query one of our databases, we now include a comment with the query with the route name that triggered the request. When we go through our database's slow query logs, this comment shows us where the query actually came from, and lets us accurately attribute our database load. This makes our database optimization work so much simpler because many of our query patterns could be caused by many different routes!
  2. We also track:
  • the number of database queries made per database and per table-model over the course of a request
  • the number of items returned over the course of a request per database and per table-model
  • the total duration spent querying per-database and per table-model

Whenever we exceed safe thresholds for any of these metrics over the course of a request, we log out the details and send an alert to the appropriate team. We also send all of this data to our data warehouse, Redshift, so we can run more detailed queries about what each route is doing.

  1. We track overall event-loop-lag per container, and whenever event-loop-lag crosses 1 second (which is incredibly high!), we log out the same AsyncLocalStorage based request-cost details for all recent requests. This always points to situations where we've fetched a lot of data, and are then doing something computationally complex with it.

Monoliths need Transparency

It's hard to overestimate how much using AsyncLocalStorage like this has helped easily improve our backend web-server's performance. One of the major drawbacks to a monolith like ours can be that teams aren't able to easily isolate their code's performance from that of the rest of the monolith. Having insight into what is actually happening on a granular level is what's going to allow us to continue having an easy time scaling our monolith for a long time.

Older posts