Posts By: Andrew Burgess

Engineering Dojo, Episode 1: Building Payment Systems

As you might know, ClassDojo offers a paid subscription for parents, to give them extra tools for interacting with both their teachers and their kids. For several years, we built and supported a system for managing those subscriptions. That’s non-trival work: it involves integrations with several app stores and payment providers and wrangling different data models for each.

In the last year, we’ve switched to a third-party solution: RevenueCat. It’s not easy switching payment systems when you’ve got several years of data in the old system, and need to keep it running until the new integration is ready to go.

I recently sat down with Urjit and Sarah, two of the engineers who led this project, and asked them about the challenges they faced and the lessons they learned as they replaced the wheels of our bus without making a pit stop. You can listen to the conversation below:

Listen to Episode 1, Building Payment Systems

Note: we're still in the process of setting up a podcast feed, so stay tuned for that!

Engineering Dojo Podcast

I recently had to change 189 files in our code base, all in almost the same way. Rather than doing it manually, I decided to brush up on my command-line text manipulation ... and ended up taking it further than I expected.

The Mission

The changes were pretty simple. In our API code, we have TypeScript definitions for every endpoint. They look something like this:

1interface API {
2 "/api/widget/:widgetId": {
3 GET: {
4 params: {
5 widgetId: MongoId;
6 };
7 response: WidgetResponse;
8 }
9 }

You'll notice the params are defined twice: once in the URL key string (as :widgetId) and again in the GET attribute (under params); we are moving to a TypeScript template literal string parser to get the type information out of the URL key string itself, and so I wanted to remove the params key from these definitions. But with 189 files to change, the usual manual approach wasn't so inviting.

So, I set myself the challenge of doing it via the command line.

Step 1: Remove the lines

I'll be honest, when I started, this was the only step I had in mind. I needed to do a multi-line find-and-replace, to remove params: { ... }; a quick grep showed me that this pattern was unique to the places I wanted to change; however, I could have narrowed the set of files I was searching to just our endpoints in src/resources if necessary. For doing the replacement, I thought sed might be the right tool, but new lines can be challenging to work with ... so I ended up learning my first bit of perl to make this work.

Here's what I ended up doing (I've added line breaks for readability):

1grep -r --files-with-matches "params: {" ./src | while read file;
2 do
3 perl -0777 -pi -e 's/ *params: {[^}]*};\n//igs' "$file";
4 done

This one-liner uses grep to recursively search my src directory to find all the files that have the pattern I want to remove. Actually, I usually reach for ag (the silver searcher) or ripgrep, but grep is already available pretty much everywhere. Then, we'll loop over the files and use perl to replace that content.

Like I said, this was my first line of perl, but I'm fairly sure it won't be my last. This technique of using perl for find-and-replace logic is called a perl pie. Here's what it does:

  • 0777 means perl will read in the entire file
  • p wraps that one-liner in the conventional perl script wrapper.
  • i means that perl will change the file in place; if you aren't making this change in a git repo like I am, you can do something like i.backup and perl will create a copy of the original file, so you aren't making an irreversible change.
  • e expects an argument that is your one-line program

Oh, and the program itself:

1s/ *params: {[^}]*};\n//igs

This is typical 's/find/replace/flags' syntax, and you know how regexes work. The flags are global, case-insensitive, and single-line (where . will also match newlines).

So, this changed the 189 files, in exactly the way I wanted. At this point, I was feeling great about my change. Reviewed the changes, committed it and started the git push.

Step 2: Remove unused imports

Not so fast. Our pre-push hooks caught a TypeScript linting issue:

1error TS6133: 'MongoId' is declared but its value is never read.
35 import { MongoId } from "our-types";
4 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ah, yeah, that makes sense. URL parameters are strings, but we have a MongoId type that's a branded string. I forgot about this step, but that's why we have pre-push checks! We'll need to remove those imports.

How can we do this? Well, let's get a list of the files we changed in our most recent commit:

1git show --name-only | grep ^src

We add the grep to only find the files within our top-level src directory (and to remove the commit information).

Then, we need to find all the files that include MongoId only once. If a file references MongoId multiple times, then we don't want to remove the import, because clearly we're still using it. If the file only references MongoId once, we can remove the import ... but we have to consider that it might not be the only thing we're importing on that line. For starters, grep's -c flag to count the number of occurrences per file.

1for file in $(git show --name-only | grep ^src)
2 do
3 grep -c MongoId "$file"
4 done

A simple for loop works here, because I know the only whitespace is the linebreaks between the file names. Once we have the count, we can check to see that there's only 1 match:

1for file in $(git show --name-only | grep ^src)
2 do
3 if [ $(grep -c MongoId "$file") = 1 ]; then; echo "..."; fi
4 done

We're using an if statement here, to check that the occurrence count is 1. If it is, we want to do something. But what? Remember, we might be importing multiple things on that line, so that leaves us with three possible actions:

  1. Remove the whole line when MongoId is the only item imported.
  2. Remove MongoId, when it's the first item imported on that line. Don't miss that following comma!
  3. Remove , MongoId when it's not the first item on the that line. Don't miss the preceding comma!

There are many ways we could do this, so let's have some fun with reading input from the command line! To be clear, this isn't the best way to do it. We could easily match our three cases above with perl or sed. But we've already used that pattern in this project, and reading input in a shell script is an incredibly useful tool to have in your toolbox.

At this point, we probably want to move this into an actual shell script, instead of running it like a one-off on the command line:

3for file in $(git show --name-only | grep ^src)
4 do
5 if [ $(grep -c MongoId "$file") = 1 ]
6 then
7 echo ""
8 echo "====================="
9 echo "1 - remove whole line"
10 echo "2 - remove first import"
11 echo "3 - remove other import"
12 echo ""
13 echo "file: $file"
14 echo "line: $(grep MongoId "$file" | grep -v "^//")"
15 echo -n "> "
17 read choice
19 echo "your choice: $choice"
21 case "$choice" in
22 1)
23 sed -i '' "/MongoId/d" "$file";
24 ;;
25 2)
26 perl -i -pe "s/MongoId, ?//" "$file";
27 ;;
28 3)
29 perl -i -pe "s/, ?MongoId//" "$file";
30 ;;
31 *)
32 echo "nothing, skipping line"
33 ;;
34 esac
35 fi

Don't be intimidated by this, it's mostly echo statements. But we're doing some pretty cool stuff here.

Inside our if statement, we start by echoing some instructions, as well as the file name and the line that we're about to operate on. Then, we read an input from the command line. At this point, the script will pause and wait for us to type some input. Once we hit <enter> the script will resume and assign the value we entered to our choice variable.

Once we have determined our choice, we can do the correct replacement using the bash equivalent of a switch/case statement. For case 1, we're using sed's delete line command d. For cases 2 and 3, we'll use perl instead of sed, because it will operate only on the matched text, and not on the whole line. Finally, the default case will do nothing.

Running this script, we can now walk through the files, one by one, and review each change. It reduces our work to one keystroke per file, which is way less than opening each file, finding the line, removing the right stuff.

And that's it! While we don't use command-line editing commands every day, keeping these skills sharp will speed up your workflow when the right task comes along.

You may have read our post from a few years ago implementing a rolling-window rate limiter, where we talked about the implementation of our sliding log rate limiter. That approach has been working well since then, but with increases in traffic, we recently found ourselves rebuilding the rate limiter to improve performance. With some relatively small changes, we were able to improve our redis CPU usage by 40x!

Our requirements haven't really changed too much since that last post. We still want a rolling window. And we still need a distributed solution, which means we're sticking with redis. The final requirement from last time was having a minimum distance between requests; turns out, this hasn't been important for us. However, if that's something you need, do stay tuned until the end!

The Trouble with Sets

The challenge with our rate limiting approach (you can find the details in our previous post) was mainly caused by the expensive set operations that we were doing in redis: zrange and zadd have a time complexity of O(log(n)), which isn't bad, but zremrangebyscore is O(log(N) + M)! This operation removes M number of items from a sorted set, and in our case, we were using it to remove all request records from greater than one interval ago. So, as a user's request rate gets higher, M will get higher as well ... and instead of protecting us, the rate limiter could actually assist a single IP address in executing a successful DoS attack! As it was, our monitoring alerted us before that, but it still resulted in requests queuing up and timing out in our API and upstream in HAProxy.

The nice thing about the sorted-set-based rate limiting is that it has very "high resolution" rate limiting: it's actually tracking request timestamps, and so we can know exactly how many requests have happened in the given time interval. This means that we can know that, say, it's been exactly one interval since your last request, and let your new request go through. However, due to these performance issues, we decided to go with an approach that optimizes performance, while accepting a lower resolution result.

Sliding Windows

The solution we've switched to is called a sliding window, or sliding counter, algorithm. Instead of storing a sorted set in redis with a rateId key, we're storing simple integers in redis with a rateId-interval key.

The rateId is some way to identify or bucket the actions we're trying to limit. If it's a global rate limit, that might be the client IP address. If it's resource specific, it might be both the client IP address and the path, like for example.

In our new approach, here's what happens when a request occurs:

  1. Based on the request, we can determine two keys: for the previous interval and for the current interval.
  2. We GET the from redis (if it does not exist, we will default to 0).
  3. We INCR the from redis (if it does not exist, redis will default to 0, then increment it, and then return 1).
  4. We sum these two counts together, multiplying the previous count by a weight. For example, if we're 42% of the way through the current interval, we'll multiply the previous count by 100% - 42% = 58%, or 0.58. The result is the estimate of the number of actions taken by that user in the last interval.

Essentially, we're using the previous and current interval counts to create a "synthetic" interval. And you can see how this is lower-resolution: we don't know the exact time of past requests, only the interval. This means that a user may have to wait up to 2 whole intervals to be able to max out in new requests.

For example, let's say your limit is 100 requests per 1 minute. And let's say a user made 100 requests in the first 0.25 minutes of an interval, then stopped, and didn't make another request for 1 minute. At 1.25 minutes (0.25 minutes into the second interval) our previous interval count will be 100, and our current interval count will be 1 (because we just incremented it for the current request). So our "number of actions taken in the last interval" will be 1 + (0.75*100) or 76. We would only allow the user to make 24 more requests (and the current one would succeed, so 25 in total).

If the user waited until the 1.75 minutes, the math would be 1 + (0.25*100) and we would allow 74 (plus the 1 current) requests.

1T0 T1 T2
3 | | |
4 100 reqs 25 reqs allowed 75 reqs allowed

Of course, this low resolution can work against us as well. Let's say those initial 100 requests had come in right at the end of the first interval, say, at 0.99 minutes. At 1.25 minutes (only ~16 seconds later), we'd still allow 25 requests.

Is that acceptable? Well, depends on your infrastructure and application. But, there is a way to tweak this.

Tweaking the Resolution

Notice that this produces the lowest possible resolution for a sliding window approach, because we track one counter per interval; this is why our practical interval is twice as long as we actually configure.

We could improve the resolution here by tracking multiple counters per interval. For example, If your limit is 100 requests per 1 minute, you could track intervals of 30 seconds in redis. Then, for a given request, you'll get the current interval (n) and the previous two intervals (n-1 and n-2), and weight the n-2 count in the same way. Try this math with the first example above, and you'll see that at 1.25 minutes, we would allow 50 requests instead of just 25.

Or, in our second example, if those 100 request came in at 0.99 minutes, we would still be blocking all requests at 1.25 minutes.

Basically, the smaller your interval counter, the higher the resolution of your result. It'd be pretty simple to parameterize that counter size to allow custom resolutions per endpoint, client, and so on.

What about minimum distance?

One feature that our previous rate limiter supported—which we dropped here—is enforcing a minimum amount of time between two individual requests from the same client. Since we're no longer tracking timestamps for individual requests, we can't really support that.

Or can we?

Actually, it's pretty simple: just add a second limit! For example, a given endpoint might have limits of 100 requests per minute, and 2 requests per second. By configuring both of these, we can ensure a particular cadence of requests from a given client.

Performance Wins

The performance improvement that we've seen in our redis usage has been pretty amazing. In our staging environment tests (using jmeter), we consistently see ~1% Engine CPU Utilization with the new rate limiter, where the old one sees ~40% with the same load.

It's not hard to see why. GET and INCR are O(1) operations in redis, and will remain fast no matter how much traffic we're seeing.

Performance improvements almost always come with a compromise: sometimes, that's increased complexity or a reduced feature set. In this case, we were willing to sacrifice some accuracy in order to keep our response times snappy, even under load.