When multiple teams share a resource like a code-base, a database, or an analytics pipeline, that resource can often suffer from the Tragedy of the Commons. Well intentioned engineers can add slow code, not prioritize important improvements because those improvements aren't aligned with team incentives, and just generally not leave the codebase or database better than they found it.
ClassDojo has a monolithic backend web-server API, but it and its backing databases often suffered from hard to diagnose performance problems because engineers would accidentally add code with poorly indexed queries, with database fanouts, or that blocked the event loop, and they wouldn't realize that that code was causing any problems. We intentionally made our Model layer request-agnostic, which helps us create better tests & simpler code, but it meant that we didn't have a way of instrumenting database & performance cost by request until Node 14 started supporting AsyncLocalStorage
.
AsyncLocalStorage "creates stores that stay coherent through asynchronous operations." This lets us reference a consistent context over the course of a request to get information about the route, store information about how many database requests that request has triggered, count how many documents we've fetched, and dramatically increase our visibility into what every request is doing. Because every single one of our routes is tied to a product area, and each of our product areas is owned by a team, we were suddenly able to give each team insight into any performance problems they'd inadvertently added to the codebase. None of our engineers and teams ever wanted to cause performance problems, but the lack of pre-AsyncLocalStorage system legibility meant that these issues were difficult to track down, and teams always had higher priority work than tracking down unattributed performance problems.
When we set up our AsyncLocalStorage per-request instrumentation system, we found some crazy things:
- a route that occasionally made 30,000+ database requests because it was fanning out over a large list of items
- another route that blocked the event-loop for 15-20 seconds a few times a day, and caused timeouts for any other requests that our server was handling at the same time
- a third route that was occasionally fetching 500,000+ items to render information to return to clients
- and plenty of other routes that weren't doing things nearly this egregious, but were still bad enough that we wanted to fix them
Once we had this instrumentation, it was incredibly simple for the right team to find and fix these issues. We'd looked into some of these & similar issues before, but finding & fixing them was incredibly time-consuming. Now, we have a single log that triggers an alert and tells you everything you need to know to track down the problem.
What did we track?
- Whenever we query one of our databases, we now include a comment with the query with the route name that triggered the request. When we go through our database's slow query logs, this comment shows us where the query actually came from, and lets us accurately attribute our database load. This makes our database optimization work so much simpler because many of our query patterns could be caused by many different routes!
- We also track:
- the number of database queries made per database and per table-model over the course of a request
- the number of items returned over the course of a request per database and per table-model
- the total duration spent querying per-database and per table-model
Whenever we exceed safe thresholds for any of these metrics over the course of a request, we log out the details and send an alert to the appropriate team. We also send all of this data to our data warehouse, Redshift, so we can run more detailed queries about what each route is doing.
- We track overall event-loop-lag per container, and whenever event-loop-lag crosses 1 second (which is incredibly high!), we log out the same AsyncLocalStorage based request-cost details for all recent requests. This always points to situations where we've fetched a lot of data, and are then doing something computationally complex with it.
Monoliths need Transparency
It's hard to overestimate how much using AsyncLocalStorage
like this has helped easily improve our backend web-server's performance. One of the major drawbacks to a monolith like ours can be that teams aren't able to easily isolate their code's performance from that of the rest of the monolith. Having insight into what is actually happening on a granular level is what's going to allow us to continue having an easy time scaling our monolith for a long time.