Even Better Rate Limiting

You may have read our post from a few years ago implementing a rolling-window rate limiter, where we talked about the implementation of our sliding log rate limiter. That approach has been working well since then, but with increases in traffic, we recently found ourselves rebuilding the rate limiter to improve performance. With some relatively small changes, we were able to improve our redis CPU usage by 40x!

Our requirements haven't really changed too much since that last post. We still want a rolling window. And we still need a distributed solution, which means we're sticking with redis. The final requirement from last time was having a minimum distance between requests; turns out, this hasn't been important for us. However, if that's something you need, do stay tuned until the end!

The Trouble with Sets

The challenge with our rate limiting approach (you can find the details in our previous post) was mainly caused by the expensive set operations that we were doing in redis: zrange and zadd have a time complexity of O(log(n)), which isn't bad, but zremrangebyscore is O(log(N) + M)! This operation removes M number of items from a sorted set, and in our case, we were using it to remove all request records from greater than one interval ago. So, as a user's request rate gets higher, M will get higher as well ... and instead of protecting us, the rate limiter could actually assist a single IP address in executing a successful DoS attack! As it was, our monitoring alerted us before that, but it still resulted in requests queuing up and timing out in our API and upstream in HAProxy.

The nice thing about the sorted-set-based rate limiting is that it has very "high resolution" rate limiting: it's actually tracking request timestamps, and so we can know exactly how many requests have happened in the given time interval. This means that we can know that, say, it's been exactly one interval since your last request, and let your new request go through. However, due to these performance issues, we decided to go with an approach that optimizes performance, while accepting a lower resolution result.

Sliding Windows

The solution we've switched to is called a sliding window, or sliding counter, algorithm. Instead of storing a sorted set in redis with a rateId key, we're storing simple integers in redis with a rateId-interval key.

The rateId is some way to identify or bucket the actions we're trying to limit. If it's a global rate limit, that might be the client IP address. If it's resource specific, it might be both the client IP address and the path, like 1.2.3.4:/api/widgets for example.

In our new approach, here's what happens when a request occurs:

  1. Based on the request, we can determine two keys: 1.2.3.4:/api/widgets-T0 for the previous interval and 1.2.3.4:/api/widgets-T1 for the current interval.
  2. We GET the 1.2.3.4:/api/widgets-T0 from redis (if it does not exist, we will default to 0).
  3. We INCR the 1.2.3.4:/api/widgets-T1 from redis (if it does not exist, redis will default to 0, then increment it, and then return 1).
  4. We sum these two counts together, multiplying the previous count by a weight. For example, if we're 42% of the way through the current interval, we'll multiply the previous count by 100% - 42% = 58%, or 0.58. The result is the estimate of the number of actions taken by that user in the last interval.

Essentially, we're using the previous and current interval counts to create a "synthetic" interval. And you can see how this is lower-resolution: we don't know the exact time of past requests, only the interval. This means that a user may have to wait up to 2 whole intervals to be able to max out in new requests.

For example, let's say your limit is 100 requests per 1 minute. And let's say a user made 100 requests in the first 0.25 minutes of an interval, then stopped, and didn't make another request for 1 minute. At 1.25 minutes (0.25 minutes into the second interval) our previous interval count will be 100, and our current interval count will be 1 (because we just incremented it for the current request). So our "number of actions taken in the last interval" will be 1 + (0.75*100) or 76. We would only allow the user to make 24 more requests (and the current one would succeed, so 25 in total).

If the user waited until the 1.75 minutes, the math would be 1 + (0.25*100) and we would allow 74 (plus the 1 current) requests.

T0                              T1                              T2
|---|---:-------:-------:-------|-------:-------:-------:-------|
    |                                   |               | 
 100 reqs                            25 reqs allowed    75 reqs allowed

Of course, this low resolution can work against us as well. Let's say those initial 100 requests had come in right at the end of the first interval, say, at 0.99 minutes. At 1.25 minutes (only ~16 seconds later), we'd still allow 25 requests.

Is that acceptable? Well, depends on your infrastructure and application. But, there is a way to tweak this.

Tweaking the Resolution

Notice that this produces the lowest possible resolution for a sliding window approach, because we track one counter per interval; this is why our practical interval is twice as long as we actually configure.

We could improve the resolution here by tracking multiple counters per interval. For example, If your limit is 100 requests per 1 minute, you could track intervals of 30 seconds in redis. Then, for a given request, you'll get the current interval (n) and the previous two intervals (n-1 and n-2), and weight the n-2 count in the same way. Try this math with the first example above, and you'll see that at 1.25 minutes, we would allow 50 requests instead of just 25.

Or, in our second example, if those 100 request came in at 0.99 minutes, we would still be blocking all requests at 1.25 minutes.

Basically, the smaller your interval counter, the higher the resolution of your result. It'd be pretty simple to parameterize that counter size to allow custom resolutions per endpoint, client, and so on.

What about minimum distance?

One feature that our previous rate limiter supported—which we dropped here—is enforcing a minimum amount of time between two individual requests from the same client. Since we're no longer tracking timestamps for individual requests, we can't really support that.

Or can we?

Actually, it's pretty simple: just add a second limit! For example, a given endpoint might have limits of 100 requests per minute, and 2 requests per second. By configuring both of these, we can ensure a particular cadence of requests from a given client.

Performance Wins

The performance improvement that we've seen in our redis usage has been pretty amazing. In our staging environment tests (using jmeter), we consistently see ~1% Engine CPU Utilization with the new rate limiter, where the old one sees ~40% with the same load.

It's not hard to see why. GET and INCR are O(1) operations in redis, and will remain fast no matter how much traffic we're seeing.

Performance improvements almost always come with a compromise: sometimes, that's increased complexity or a reduced feature set. In this case, we were willing to sacrifice some accuracy in order to keep our response times snappy, even under load.

    When multiple teams share a resource like a code-base, a database, or an analytics pipeline, that resource can often suffer from the Tragedy of the Commons. Well intentioned engineers can add slow code, not prioritize important improvements because those improvements aren't aligned with team incentives, and just generally not leave the codebase or database better than they found it.

    ClassDojo has a monolithic backend web-server API, but it and its backing databases often suffered from hard to diagnose performance problems because engineers would accidentally add code with poorly indexed queries, with database fanouts, or that blocked the event loop, and they wouldn't realize that that code was causing any problems. We intentionally made our Model layer request-agnostic, which helps us create better tests & simpler code, but it meant that we didn't have a way of instrumenting database & performance cost by request until Node 14 started supporting AsyncLocalStorage.

    AsyncLocalStorage "creates stores that stay coherent through asynchronous operations." This lets us reference a consistent context over the course of a request to get information about the route, store information about how many database requests that request has triggered, count how many documents we've fetched, and dramatically increase our visibility into what every request is doing. Because every single one of our routes is tied to a product area, and each of our product areas is owned by a team, we were suddenly able to give each team insight into any performance problems they'd inadvertently added to the codebase. None of our engineers and teams ever wanted to cause performance problems, but the lack of pre-AsyncLocalStorage system legibility meant that these issues were difficult to track down, and teams always had higher priority work than tracking down unattributed performance problems.

    When we set up our AsyncLocalStorage per-request instrumentation system, we found some crazy things:

    • a route that occasionally made 30,000+ database requests because it was fanning out over a large list of items
    • another route that blocked the event-loop for 15-20 seconds a few times a day, and caused timeouts for any other requests that our server was handling at the same time
    • a third route that was occasionally fetching 500,000+ items to render information to return to clients
    • and plenty of other routes that weren't doing things nearly this egregious, but were still bad enough that we wanted to fix them

    Once we had this instrumentation, it was incredibly simple for the right team to find and fix these issues. We'd looked into some of these & similar issues before, but finding & fixing them was incredibly time-consuming. Now, we have a single log that triggers an alert and tells you everything you need to know to track down the problem.

    What did we track?

    1. Whenever we query one of our databases, we now include a comment with the query with the route name that triggered the request. When we go through our database's slow query logs, this comment shows us where the query actually came from, and lets us accurately attribute our database load. This makes our database optimization work so much simpler because many of our query patterns could be caused by many different routes!
    2. We also track:
    • the number of database queries made per database and per table-model over the course of a request
    • the number of items returned over the course of a request per database and per table-model
    • the total duration spent querying per-database and per table-model

    Whenever we exceed safe thresholds for any of these metrics over the course of a request, we log out the details and send an alert to the appropriate team. We also send all of this data to our data warehouse, Redshift, so we can run more detailed queries about what each route is doing.

    1. We track overall event-loop-lag per container, and whenever event-loop-lag crosses 1 second (which is incredibly high!), we log out the same AsyncLocalStorage based request-cost details for all recent requests. This always points to situations where we've fetched a lot of data, and are then doing something computationally complex with it.

    Monoliths need Transparency

    It's hard to overestimate how much using AsyncLocalStorage like this has helped easily improve our backend web-server's performance. One of the major drawbacks to a monolith like ours can be that teams aren't able to easily isolate their code's performance from that of the rest of the monolith. Having insight into what is actually happening on a granular level is what's going to allow us to continue having an easy time scaling our monolith for a long time.

      When the ClassDojo engineering team started using TypeScript on our backend, I was incredibly frustrated. It felt like I needed to constantly work around the type system, and it felt like the types weren't providing any value at all. And, they weren't providing any value to me: I had fundamental misconceptions about how TypeScript worked, and my attempts to work around the Type system by using lots of any and as created some truly abominable TypeScript code.

      Type Inference

      Whenever you create a variable using let, const, or var, and don't provide a type, TypeScript will infer the type: this means it will use its best guess for the type. These best guesses are pretty good, but they aren't magic!

      If you use let or var, TypeScript will assume that it can re-assign to any similar value. If you write let method = "GET";, TypeScript will infer that method is a string, and will let you later reassign method: method = "KonMari". If instead you use const, const method = "GET", TypeScript will infer that method is of type "GET". This means when you use let, you'll often need to use a type. let method: HTTPMethod = "GET" will allow only type-safe reassignment.

      When TypeScript infers types for Objects and Arrays, it will similarly give its best guess. In JavaScript, objects are mutable: if you set up a variable like const request = { method: "GET" }, it'll assume that the type that you want is { method: string } to let you update the method field. A type of { method: string } won't be usable by something that wants { method: HTTPMethod }, so you need to either explicitly tell it the type (const request: { method: HTTPMethod } = { method: "GET" }, or tell TypeScript that it's safe to infer a stricter type (const request = { method: "GET" as const }).

      Here's a TypeScript playground that explores type inference in a bit more depth, and the type inference part of the handbook is great if you want a detailed introduction.

      Why isn't it narrowing?

      One of the things that I found most frustrating about TypeScript was writing a clause that felt like it should have narrowed the type of a variable and not have it actually narrow. I'd often use as to bypass the type system entirely and force something to narrow, but overriding TypeScript like this is a real anti-pattern. When I finally learned about type narrowing, and type guards, TypeScript became so much more pleasant to use.

      One of the very first things I learned about JavaScript was that you should never use in with Objects because in doesn't differentiate between an object's properties and those of its prototype. In TypeScript, in is crucial to use for type narrowing. If you have a function that takes a type like Error | (Error & { type: "NotFound", table: string }) | (Error & { type: "NotAllowed", reason: string }), you can't write if (!err.type) return because TypeScript doesn't know whether err has that field or not. Instead, you need to write if (!("type" in err)) return and TypeScript won't error, and will correctly narrow the type.

      One related confusion was why I couldn't use if statements to narrow a type. I'd try to write code like the following:

      Don't do this!

      type NotFound = Error & { NotFound: true; table: string };
      type ServerError = Error & { ServerError: true; message: string };
      type ClientError = Error & { ClientError: true; status: number };
      
      function getResponse(err: Error | NotFound | ServerError | ClientError) {
        if ((err as ClientError).ClientError) {
          return { status: (err as ClientError).status, message: "client error" };
        }
        if ((err as NotFound).NotFound) {
          const notFoundError = err as NotFound;
          return {
            status: 404,
            message: `not found in ${notFoundError.table}`;
          }
        }
      
        return {
          status: 500,
          message: (err as ServerError).message || "server error",
        }
      }
      

      TypeScript can automatically discriminate a union type if the members of a union all share a field. This can lead to much simpler, type-safe, and easier to work with code!

      
      type NotFound = Error & { type: "NotFound", table: string };
      type ServerError = Error & { type: "ServerError", message: string; };
      type ClientError = Error & { type: "ClientError", status: number; };
      
      function getResponse(err: Error | NotFound | ServerError | ClientError) {
        // doing this first lets TypeScript discriminate using the `type` property
        if (!("type" in err)) {
          return {
            status: 500,
            message: "server error",
          }
        }
      
        if (err.type === "ClientError") {
          // err now has type ClientError, so we can use status!
          return { status: err.status, message: "client error" };
        }
      
        if (err.type === "NotFound") {
          return {
            status: 404,
            message: `not found in ${err.table}`
          }
        }
      
        // it even narrows this to ServerError!
        // although it may be wiser to have an explicit `if` statement and assert that every case is handled
        return {
          status: 500,
          message: err.message,
        }
      }
      

      Finally, if there are more complex types that you need to narrow, writing a custom type guard function is always an option! This is a function that returns a boolean that lets TypeScript narrow a type whenever you call it!

      // `method isHTTPMethod` tells callers that any string that you've checked with this function is an HTTPMethod
      function isHTTPMethod (method: string): method is HTTPMethod {
        return ["GET", "PUT", "POST", "DELETE"].includes(method);
      }
      

      Here's a TypeScript playground where you can see how this works.

      "Brands" are required for Nominal Typing

      TypeScript uses "structural typing", which means that only the shape of something matters, and the name of the type is completely irrelevant. If something looks like a duck, it's a duck as far as TypeScript is concerned. This is surprising if you're coming from a "nominal typing" background where two types with the same shapes can't be assigned to each other.

      // this is valid TypeScript
      type EmailAddress = string;
      type Url = string;
      const emailAddress: EmailAddress = "myEmail@classdojo.com";
      const url: Url = emailAddress;
      

      If you want nominal-style types for a string or number, you need to use "Brands" to create "impossible" types.

      type EmailAddress = string & { __impossible_property: "EmailAddress" };
      type URL = string & { __impossible_property: "Url" };
      const emailAddress = "myEmail@classdojo.com" as EmailAddress; // we need to use `as` here because the EmailAddress type isn't actually possible to create in JS
      const url: URL= emailAddress; // this now errors the way we'd want
      

      (The above isn't how you'd actually want to write this code: I'm writing it in a way that hopefully makes it a bit more clear what's going on. If you'd like to see a more realistic example, take a look at this playground)

      TypeScript isn't just "JS with types"

      TypeScript is a great language, but just treating it as "JS with types" is a surefire recipe for frustration. It takes time to learn the language, and I wish I'd spent more of that time up-front reading through the docs: it'd have saved a ton of pain. The TypeScript handbook is great, and there are tons of great resources online to help understand TypeScript better. There's still a ton I don't know about TypeScript, but finally learning about inference, type narrowing, and structural typing has made developing with TypeScript so much nicer.

        Newer posts
        Older posts