18,957 tests in under 6 minutes: ClassDojo's approach to backend testing

We're pretty proud of our backend test suite. We have a lot of tests, and developers can run the full test suite locally in under six minutes. These aren't simple unit tests—they're tests that hit multiple databases and routes with only a minimal amount of stubbing for external dependencies.

9 years ago, we were proud of running 2,000 tests in 3 minutes. Not much has changed from that initial post—we're still writing a bunch of tests, but we've put a lot of effort over the years into making sure our test suite has stayed acceptably fast for people.

Why do we run our tests this way?

First off though, why are we making things so hard for ourselves? When we write our tests, we don't stub out our databases at all. Many of our tests are resource tests—those tests hit a real running server, the resource code issues real queries against Redis/MySQL/MongoDB/memcached containers, and if it makes any changes to those databases, we need to reset the databases fully before the next test run.

We think that the database is an integral part of the system that we're testing. When you stub a database query, that means that you're not testing the query. And I don't know about you, but I've gotten plenty of database queries wrong.

Similarly, we like to run a full server for any resource level tests. The middleware that runs for each resource matters. We want our tests to match our production environment as much as possible.

How do we make the tests fast?

You'll see recommendations online to limit this style of testing, where you have full databases that you're querying, not because it's worse, but because it ends up being too slow. We've needed to put a lot of work into test speed over the years, and if you take nothing else away from this post it should be that if you treat test speed as an organizational priority, you can make a pretty big impact.

  1. Make sure engineers have fast computers. First off, if we had done nothing else over the past 9 years, our tests would have gotten faster because computers have gotten better over that time period. And we make sure to buy nice computers for engineers on our team because we care that tests and builds are speedy. (The M1 and M2 chips for Macs have been quite nice!)

  2. Use Orbstack rather than Docker Desktop. The single easiest change to speed up our tests was switching from Docker Desktop to Orbstack to run containers locally. On Macs, it is so much faster than Docker Desktop. On some engineers' machines, tests run twice as fast on Orbstack. Aside from speed, we've found that it's been more stable for folks—fewer randomly missing volumes and needing to restart Docker to get things going again. We're huge fans.

    That said though, it's worth noting using docker/orbstack will still slow down your tests. If we ran databases directly on our machines rather than through docker, our tests would be faster. But the extra effort to get everyone to install and maintain MySQL, Redis, MongoDB, Memcached, and everything else just isn't worth the test speed increases that it brings for us. Other organizations might have different trade-offs.

  3. Speed up fixture resets. The slowest part of our tests is resetting fixtures. Whenever one of our tests writes to a database, we need to undo those changes before the next test starts. The core trick to doing this quickly is to only undo changes to the tables that actually changed rather than resetting every single table. All of our database operations go through the same code, so it's relatively straightforward to track which tables are "dirty" and then only reset those tables.

    A few details:

    • We tested out tracking things at the row level rather than the table level for resets, but it didn't improve performance. For MySQL, our basic table resetting strategy is turning off foreign key checks, truncating the table, and then using LOCAL DATA INFILE loads from a volume that's mounted into the MySQL container.
    • For MongoDB resets, we found that the fastest technique was creating a shadow collection for every collection that we could restore from whenever we needed to.
    • When there's a hard MySQL delete, we don't know whether it might be a cascading delete, so we have code that counts how many rows are in each table. If a table has fewer rows after the test we should reset it. And if it doesn't have fewer rows because data has also been inserted, our regular code will have marked that table as dirty.
    • For MySQL updates, we have some (slightly janky) code to pull out the list of tables that might be updated by the update query when it's a query with multiple tables.
  4. Run tests in parallel. The next important piece to fast tests is being able to run the test suite in parallel, which means we need multiple copies of our databases. This was a relatively straightforward task that took a lot of blood, sweat, and tears to actually make happen. We use mocha to run our tests which supports a --parallel option, so our tests look for MOCHA_WORKER_ID in the environment to decide which database to connect to: test_db_${MOCHA_WORKER_ID}.

  5. Measure what's slow. Like any other optimization problem, the first step is measuring how long things actually take. Having guesses about why tests are slow can lead to a ton of wasted effort that doesn't actually move the needle. We haven't done any fancy profiling here—instead, we hook into our existing instrumentation to generate a report of where time is being spent over the course of our tests. It's not perfect, but it gives us a good enough sense of where time is going to be useful.

The future

We're proud of where our tests are, but there's still a ton of space to improve things over the next 9 years. We're going to keep writing lots of tests, so we need to make sure that those tests continue to be speedy. A few things that are on our minds:

  • Setting up fixture "scenarios." Currently, we always reset our fixtures back to the same base scenario. That base scenario is difficult to change because it can impact a huge number of tests if we're tweaking fixture data that is used in a lot of spots, so it'd be nice to have better support for temporarily setting up a new "base" state that multiple tests can reference.
  • Retrying the row-level resets. In theory, these should be faster than truncating and restoring the tables, so we want to try out better profiling to make that happen.
  • Improving our Redis fixture resets. Redis is fast enough that we've been lazy with our Redis fixture resets—we just flush the db and then restore it, so there's room to improve performance
  • Run our tests with some more detailed profiling to generate a flame-graph to see if there are any hotspots in our code that we could improve. Funnily enough, we've actually sped up our production app a few times when optimizing our tests—things that are slow in testing are often slow in production too!

    Image of test file with a fixtureId imported. The editor is hovering over fixtureId.teacher1Id which shows a wealth of information about which classes, students, parents, and schools that teacher is connected to.

    Our tests used to be full of hundreds of random ids like 5233ac4c7220e9000000000d, 642c65770f6aa00887d97974, 53c450afac801fe5b8000019 that all had relationships to one another. 5233ac4c7220e9000000000d was a school, right? Was 642c65770f6aa00887d97974 a parent in that school? What if I needed a teacher and student in that school too—how did I find appropriate test entities? It wasn't an insurmountable problem—you could either poke around the CSVs that generated our fixtures or query the local database—but it slowed down writing and maintaining tests. Some engineers even had a few of these ids memorized; they'd say thing like "Oh, but 000d is parent4! Let's use 002f instead. They're in a school." or "I quite like 9797, it's a solid class for story-related tests."

    To make navigating our fixture IDs and their relationships a bit simpler, I wrote a simple script to query our database and decorate names for these fixture ids (e.g., student2Id, school5Id) with the most common relationships for that entity. For a school, we show teachers, parents, students, and classes. For a parent, we show children, teachers, schools, classes, and message threads.

    /**
     * - name: Student TwoParents
     * - **1 current classes**: classroom3Id
     * - **0 past classes**:
     * - **2 parents**: parent10Id, parent11Id
     * - **1 current teachers**: teacher1Id
     * - **schoolId**: none
     */
    export const student5Id = "57875a885eb4ec6cb0184d68";
    

    Being able to write a line like import { student5Id } from "../fixtureIds"; and then hover over it and see that if we wanted a parent for that student, we could use parent10Id, makes writing tests a bit more pleasant. The script to make generate this fixture-id file was pretty straightforward:

    1. Get ordered lists of all of the entities in our system and assign them names like parent22Id or teacher13Id[^1]
    2. Set up a map between an id like 57875a885eb4ec6cb0184d68 and student5Id.
    3. For each entity type, write model queries to get all of the relationship IDs that we're interested in.
    4. Use JS template strings to create nicely formatted JSDoc-decorated strings and write those strings to a file.

    One small pain point I ran into was migrating all of our existing test code to reference these new fixture IDs. I wrote a script to find lines like const X = /[0-9a-f]{24}/;, delete those lines, update the variable name with the fixture-id name from the file, and then add an appropriate import statement to the top of the file. (Shell patterns for easy automated code migrations talks through patterns I use to do migrations like this one.)

    After setting up these initial fixtureId JSDoc comments, we've added JSDoc comments for more and more of our collections; it's proven to be a useful tool. We also set up a complementary fixtureEntities file that exports the same information as TS documents so that it's straightforward to programmatically find appropriate entities. All in all, it's made our test code nicer to work with—I just wish we'd made the change sooner!

    [^1]: Whenever we add any new IDs to our fixtures, we need to make sure that they come after the most recent fixtureID. Otherwise we'll end up with fixtureID conflicts!

      Intro

      In my first week at ClassDojo, a small bug fix I was working on presented an intriguing opportunity. A UI component was displaying stale data after an update due to a bug within a ClassDojo library named Fetchers, a React Hooks implementation for fetching and managing server state on the client. I couldn't help but ask “Why do we have a custom React Query lookalike in our codebase?” A quick peek at the Git history revealed that Fetchers predate React Query and many other similar libraries, making it our best option at the time. Four years have passed since then and we now have several other options to consider. We could stick with Fetchers, but why not use an established library that does the exact same thing? After discussing the tradeoffs with other engineers, it became apparent that React Query was a better fit for our codebase. With a vision set, we formulated a plan to migrate from Fetchers to React Query.

      The Tradeoffs, or Lack Thereof

      Like most engineering problems, deciding between Fetchers and React Query meant evaluating some tradeoffs. With Fetchers, we had complete control and flexibility over the API, designing it directly against our use cases. With React Query, we would have to relinquish control over the API and adapt to its interface. What ended up being a small downgrade in flexibility was a huge upgrade in overall cost. Maintaining Fetchers involved time & effort spent writing, evolving, debugging, testing, and documenting the library, and that was not cheap. Fortunately, React Query supports all the existing use cases that Fetchers did and then some, so we’re not really giving up anything.

      As if that wasn't enough to convince us, Fetchers also had a few downsides that were crucial in our decision-making. The first was that Fetchers was built on top of redux, a library we’re actively working at removing from our codebase (for other unrelated reasons). The second, due to the first, is that Fetchers didn’t support callbacks or promises for managing the lifecycle of mutations. Instead, we only returned the status of a mutation through the hook itself. Often, prop drilling would separate the mutation function from the status props, splitting the mutation trigger and result/error handling across separate files. Sometimes the status props were ignored completely since it wasn’t immediately obvious if a mutation already had handling set up elsewhere.

      // Fetchers Example
      
      // Partial typing to illustrate the point
      type FetcherResult<Params> = {
        start: (params: Params) => void;
        done: boolean;
        error: Error;
      }
      
      // More on this later...
      const useFetcherMutation = makeOperation({ ... });
      
      const RootComponent = () => {
        const { start, done, error }: FetcherResult = useFetcherMutation();
      
        useEffect(() => {
          if (error) {
            // handle error
          } else if (done) {
            // handle success
          }
        }, [done, error]);
      
        return (
          <ComponentTree>
            <LeafComponent start={start} />
          </ComponentTree>
        )
      }
      
      const LeafComponent = ({ start }) => {
        const handleClick = () => {
          // No way to handle success/error here, we can only call it.
          // There may or may not be handling somewhere else...?
          start({ ... });
        };
      
        return <button onClick={start}>Start</button>;
      }
      

      With React Query, the mutation function itself allows for handling the trigger & success/error cases co-located:

      // React Query
      
      // Partial typing to illustrate the point
      type ReactQueryResult<Params, Result> = {
        start: (params: Params) => Promise<Result>;
      }
      
      // More on this later...
      const useReactQueryMutation = makeOperation({ ... });
      
      const RootComponent = () => {
        const { start }: FetcherResult = useReactQueryMutation();
      
        return (
          <ComponentTree>
            <LeafComponent start={start} />
          </ComponentTree>
        )
      }
      
      const LeafComponent = ({ start }) => {
        // Mutation trigger & success/error cases colocated
        const handleClick = async () => {
          try {
            const result = await start({ ... });
            // handle success
          } catch(ex) {
            // handle error
          }
        }
      
        return <button onClick={handleClick}>Start</button>;
      }
      

      Finally, Fetchers’ cache keys were string-based, which meant they couldn’t provide granular control for targeting multiple cache keys like React Query does. For example, a cache key’s pattern in Fetchers looked like this:

      const cacheKey = 'fetcherName=classStoryPost/params={"classId":"123","postId":"456"}'
      

      In React Query, we get array based cache keys that support objects, allowing us to target certain cache entries for invalidation using partial matches:

      const cacheKey = ['classStoryPost', { classId: '123', postId: '456' }];
      
      // Invalidate all story posts for a class
      queryClient.invalidateQueries({ queryKey: ['classStoryPost', { classId: '123' }] });
      

      The issues we were facing were solvable problems, but not worth the effort. Rather than continuing to invest time and energy into Fetchers, we decided to put it towards migrating our codebase to React Query. The only question left was “How?”

      The Plan

      At ClassDojo, we have a weekly “web guild” meeting for discussing, planning, and assigning engineering work that falls outside the scope of teams and their product work. We used these meetings to drive discussions and gain consensus around a migration plan and divvy out the work to developers.

      To understand the plan we agreed on, let’s review Fetchers. The API consists of three primary functions: makeMemberFetcher, makeCollectionFetcher, and makeOperation. Each is a factory function for producing hooks that query or mutate our API. The hooks returned by each factory function are almost identical to React Query’s useQuery, useInfiniteQuery, and useMutation hooks. Functionally, they achieve the same things, but with different options, naming conventions, and implementations. The similarities between the hooks returned from Fetchers’ factory functions and React Query made for the perfect place to target our migration.

      The plan was to implement alternate versions of Fetcher’s factory functions using the same API interfaces, but instead using React Query hooks under the hood. By doing so, we could ship both implementations simultaneously and use a feature switch to toggle between the two. Additionally, we could rely on Fetchers’ unit tests to catch any differences between the two.

      Our plan felt solid, but we still wanted to be careful in how we rolled out the new implementations so as to minimize risk. Given that we were rewriting each of Fetchers’ factory functions, each had the possibility of introducing their own class of bugs. On top of that, our front end had four different apps consuming the Fetchers library, layering on additional usage patterns and environmental circumstances. Spotting errors thrown inside the library code is easy, but spotting errors that cascade out to other parts of the app as a result of small changes in behavior is much harder. We decided to use a phased rollout of each factory function one at a time, app by app so that any error spikes would be isolated to one implementation or app at a time, making it easy to spot which implementation had issues. Below is some pseudocode that illustrates the sequencing of each phase:

      for each factoryFn in Fetchers:
        write factoryFn using React Query
        for each app in ClassDojo:
          rollout React Query factoryFn using feature switch
          monitor for errors
          if errors:
            turn off feature switch
            fix bugs
            repeat
      

      What Went Well?

      Abstractions made the majority of this project a breeze. The factory functions provided a single point of entry to replace our custom logic with React Query hooks. Instead of having to assess all 365 usages of Fetcher hooks, their options, and how they map to a React Query hook, we just had to ensure that the hook returned by each factory function behaved the same way it did before. Additionally, swapping implementations between Fetchers and React Query was just a matter of changing the exported functions from Fetchers’ index file, avoiding massive PRs with 100+ files changed in each:

      // before migration
      
      export { makeMemberFetcher } from './fetchers';
      
      // during migration
      
      import { makeMemberFetcher as makeMemberFetcherOld } from './fetchers';
      import { makeMemberFetcher as makeMemberFetcherNew } from './rqFetchers';
      
      const makeMemberFetcher = isRQFetchersOn ? makeMemberFetcherNew : makeMemberFetcherOld;
      
      export { makeMemberFetcher };
      

      Our phased approach played a big role in the success of the project. The implementation of makeCollectionFetcher worked fine in the context of one app, but surfaced some errors in the context of another. It wasn’t necessarily easy to know what was causing the bug, but the surface area we had to scan for potential problems was much smaller, allowing us to iterate faster. Phasing the project also naturally lent itself well to parallelizing the development process and getting many engineers involved. Getting the implementations of each factory function to behave exactly the same as before was not an easy process. We went through many iterations of fixing broken tests before the behavior matched up correctly. Doing that alone would have been a slow and painful process.

      How Can We Improve?

      One particular pain point with this project were Fetchers’ unit tests. Theoretically, they should have been all we needed to verify the new implementations. Unfortunately, they were written with dependencies on implementation details, making it difficult to just run them against a new implementation. I spent some time trying to rewrite them, but quickly realized the effort wasn't worth the payoff. Instead, we relied on unit & end-to-end tests throughout the application that would naturally hit these codepaths. The downside was that we spent a lot of time stepping through and debugging those other tests to understand what was broken in our new implementations. This was a painful reminder to write unit tests that only observe the inputs and outputs.

      Another pain point was the manual effort involved in monitoring deployments for errors. When we rolled out the first phase of the migration, we realized it’s not so easy to tell whether we were introducing new errors or not. There was a lot of existing noise in our logs that required babysitting the deployments and checking reported errors to confirm whether or not the error was new. We also realized we didn’t have a good mechanism for scoping our error logs down to the latest release only. We’ve since augmented our logs with better tags to make it easier to query for the “latest” version. We’ve also set up a weekly meeting to triage error logs to specific teams so that we don’t end up in the same situation again.

      What's Next?

      Migrating to React Query was a huge success. It rid us of maintaining a complex chunk of code that very few developers even understood. Now we’ve started asking ourselves, “What’s next?”. We’ve already started using lifecycle callbacks to deprecate our cache invalidation & optimistic update patterns. Those patterns were built on top of redux to subscribe to lifecycle events in Fetchers’ mutations, but now we can simply hook into onMutate, onSuccess, onError provided by React Query. Next, we’re going to look at using async mutations to simplify how we handle the UX for success & error cases. There are still a lot of patterns leftover from Fetchers and it will be a continued effort to rethink how we can simplify things using React Query.

      Conclusion

      Large code migrations can be really scary. There’s a lot of potential for missteps if you’re not careful. I personally believe that what made this project successful was treating it like a refactor. The end goal wasn’t to change the behavior of anything, just to refactor the implementation. Trying to swap one for the other without first finding their overlap could have made this a messy project. Instead, we wrote new implementations, verified they pass our tests, and shipped them one by one. This project also couldn’t have happened without the excellent engineering culture at ClassDojo. Instead of being met with resistance, everyone was eager and excited to help out and get things moving. I’m certain there will be more projects like this to follow in the future.

        Older posts