Posts By: Will Keleher

The 3 things I didn't understand about TypeScript

When the ClassDojo engineering team started using TypeScript on our backend, I was incredibly frustrated. It felt like I needed to constantly work around the type system, and it felt like the types weren't providing any value at all. And, they weren't providing any value to me: I had fundamental misconceptions about how TypeScript worked, and my attempts to work around the Type system by using lots of any and as created some truly abominable TypeScript code.

Type Inference

Whenever you create a variable using let, const, or var, and don't provide a type, TypeScript will infer the type: this means it will use its best guess for the type. These best guesses are pretty good, but they aren't magic!

If you use let or var, TypeScript will assume that it can re-assign to any similar value. If you write let method = "GET";, TypeScript will infer that method is a string, and will let you later reassign method: method = "KonMari". If instead you use const, const method = "GET", TypeScript will infer that method is of type "GET". This means when you use let, you'll often need to use a type. let method: HTTPMethod = "GET" will allow only type-safe reassignment.

When TypeScript infers types for Objects and Arrays, it will similarly give its best guess. In JavaScript, objects are mutable: if you set up a variable like const request = { method: "GET" }, it'll assume that the type that you want is { method: string } to let you update the method field. A type of { method: string } won't be usable by something that wants { method: HTTPMethod }, so you need to either explicitly tell it the type (const request: { method: HTTPMethod } = { method: "GET" }, or tell TypeScript that it's safe to infer a stricter type (const request = { method: "GET" as const }).

Here's a TypeScript playground that explores type inference in a bit more depth, and the type inference part of the handbook is great if you want a detailed introduction.

Why isn't it narrowing?

One of the things that I found most frustrating about TypeScript was writing a clause that felt like it should have narrowed the type of a variable and not have it actually narrow. I'd often use as to bypass the type system entirely and force something to narrow, but overriding TypeScript like this is a real anti-pattern. When I finally learned about type narrowing, and type guards, TypeScript became so much more pleasant to use.

One of the very first things I learned about JavaScript was that you should never use in with Objects because in doesn't differentiate between an object's properties and those of its prototype. In TypeScript, in is crucial to use for type narrowing. If you have a function that takes a type like Error | (Error & { type: "NotFound", table: string }) | (Error & { type: "NotAllowed", reason: string }), you can't write if (!err.type) return because TypeScript doesn't know whether err has that field or not. Instead, you need to write if (!("type" in err)) return and TypeScript won't error, and will correctly narrow the type.

One related confusion was why I couldn't use if statements to narrow a type. I'd try to write code like the following:

Don't do this!

1type NotFound = Error & { NotFound: true; table: string };
2type ServerError = Error & { ServerError: true; message: string };
3type ClientError = Error & { ClientError: true; status: number };
5function getResponse(err: Error | NotFound | ServerError | ClientError) {
6 if ((err as ClientError).ClientError) {
7 return { status: (err as ClientError).status, message: "client error" };
8 }
9 if ((err as NotFound).NotFound) {
10 const notFoundError = err as NotFound;
11 return {
12 status: 404,
13 message: `not found in ${notFoundError.table}`;
14 }
15 }
17 return {
18 status: 500,
19 message: (err as ServerError).message || "server error",
20 }

TypeScript can automatically discriminate a union type if the members of a union all share a field. This can lead to much simpler, type-safe, and easier to work with code!

1type NotFound = Error & { type: "NotFound", table: string };
2type ServerError = Error & { type: "ServerError", message: string; };
3type ClientError = Error & { type: "ClientError", status: number; };
5function getResponse(err: Error | NotFound | ServerError | ClientError) {
6 // doing this first lets TypeScript discriminate using the `type` property
7 if (!("type" in err)) {
8 return {
9 status: 500,
10 message: "server error",
11 }
12 }
14 if (err.type === "ClientError") {
15 // err now has type ClientError, so we can use status!
16 return { status: err.status, message: "client error" };
17 }
19 if (err.type === "NotFound") {
20 return {
21 status: 404,
22 message: `not found in ${err.table}`
23 }
24 }
26 // it even narrows this to ServerError!
27 // although it may be wiser to have an explicit `if` statement and assert that every case is handled
28 return {
29 status: 500,
30 message: err.message,
31 }

Finally, if there are more complex types that you need to narrow, writing a custom type guard function is always an option! This is a function that returns a boolean that lets TypeScript narrow a type whenever you call it!

1// `method isHTTPMethod` tells callers that any string that you've checked with this function is an HTTPMethod
2function isHTTPMethod (method: string): method is HTTPMethod {
3 return ["GET", "PUT", "POST", "DELETE"].includes(method);

Here's a TypeScript playground where you can see how this works.

"Brands" are required for Nominal Typing

TypeScript uses "structural typing", which means that only the shape of something matters, and the name of the type is completely irrelevant. If something looks like a duck, it's a duck as far as TypeScript is concerned. This is surprising if you're coming from a "nominal typing" background where two types with the same shapes can't be assigned to each other.

1// this is valid TypeScript
2type EmailAddress = string;
3type Url = string;
4const emailAddress: EmailAddress = "myEmail@classdojo.com";
5const url: Url = emailAddress;

If you want nominal-style types for a string or number, you need to use "Brands" to create "impossible" types.

1type EmailAddress = string & { __impossible_property: "EmailAddress" };
2type URL = string & { __impossible_property: "Url" };
3const emailAddress = "myEmail@classdojo.com" as EmailAddress; // we need to use `as` here because the EmailAddress type isn't actually possible to create in JS
4const url: URL= emailAddress; // this now errors the way we'd want

(The above isn't how you'd actually want to write this code: I'm writing it in a way that hopefully makes it a bit more clear what's going on. If you'd like to see a more realistic example, take a look at this playground)

TypeScript isn't just "JS with types"

TypeScript is a great language, but just treating it as "JS with types" is a surefire recipe for frustration. It takes time to learn the language, and I wish I'd spent more of that time up-front reading through the docs: it'd have saved a ton of pain. The TypeScript handbook is great, and there are tons of great resources online to help understand TypeScript better. There's still a ton I don't know about TypeScript, but finally learning about inference, type narrowing, and structural typing has made developing with TypeScript so much nicer.

A decent number of programming tasks boil down to manipulating text and files:

  • automating code migrations
  • exporting & transforming data
  • exploring logs
  • looking for patterns in a codebase

Knowing the basics of shell tools that are designed to manipulate text & files can make these tasks simple, and the interactive pipe-based programming environment of the shell makes figuring out the right command fun. It makes sense that tons of engineers evangelize learning tools like sed, awk, ag/rg, find, cut, head, tail, xargs, jq, and all of the other amazing shell tools that are designed for dealing with streams of text and files.

That said, while shell tools are useful, it is always possible to accomplish the same tasks with whatever programming language you regularly use! As long as you have the ability to easily manipulate text and files in a script, you'll find that it can make entire classes of tasks easy to accomplish. If you aren't comfortable manipulating text and files like this, you won't even notice the potential opportunities you have to make your own job easier, and to provide important business value.

At ClassDojo, we have people who use Bash, Go, and NodeJS for these scripts. All of these languages work well! Some of the high-value text & file manipulation scripts we've written have done things like:

  • automatically rewrite require statements in our tests to allow converting our tests to Typescript (Bash)
  • do a large code migration to allow targeted detection of changes to fixtures & indexes (NodeJS)
  • set up a MongoDB export job (NodeJS)
  • set up product-areas for all of our HTTP routes to allow better team ownership over errors (Bash)
  • set up fast local database restores (NodeJS and Bash)
  • parallelize tests (Go)
  • parallelize & cache lint steps (Bash)

Shell scripting can be difficult to get into, but it doesn't mean that you can't start getting the same value out of whatever language you regularly use. Getting comfortable with text manipulation, file manipulation, and regular expressions is incredibly useful no matter what tool you choose to get there.

Say you had a few petabytes of user ids lying around somewhere, and someone told you that they were going to use a few kb of memory to estimate the "cardinality", or the number of distinct ids, of those petabytes of ids with decently high accuracy. It'd feel like magic! The HyperLogLog algorithm makes that magic cardinality estimation happen through the Power of Random Numbers, and some clever use of data structures.

I wanted to set up a mongo-backed hyperloglog, but I really struggled to understand the wikipedia page, the initial paper [1], or any of the improvements of the HLL++ algorithm [2] because I didn't really understand the underlying ideas and how they fit together. This post is an attempt to help describe the ideas that make the HyperLogLog work without getting too much into the details.

The Power of Random Numbers

Before we start doing any cardinality estimation, let's flip some coins. I'm going to go into a different room, and for every coin in the room, I'm going to flip that coin until I flip it and it comes up heads. I'm then going to come back and tell you the maximum number of attempts it took me to flip a head. If I tell you the maximum number of attempts it took me before I saw a head was 2, you'd be justified in guessing that there aren't that many coins in the other room. If instead I tell you that one of the coins took me 7 flips before seeing a head, you might guess that there were a decent number of coins in the other room.

This game isn't particularly fun or useful, but we can use the same technique on random numbers. We can look at the maximum number of leading zeros in the binary representation of a random number to get a measure of its "rarity". By this arbitrary measure, 1xxxxxxx and 01xxxxxx will both be common, and 0000001x is quite rare. If we play the same game, and I go into another room and look through my random number collection and tell you that the largest stretch of 0s (or tails) I saw before a 1 (or head) was 7, you similarly might guess that my random number collection is quite extensive (and it is).

We can formalize the idea that 000001x is low & therefore relatively rare a little bit more. If we go digit by digit through the number, we can treat each digit as a coin flip, which would have a 50/50 chance of being a 1. So, if we look at the number of leading 0s (the number of flipped "tails") or the number of leading 1s (the number of flipped "heads"), we can get a sense for how rare a particular number is.

  • if the number starts with a no 0s (i.e., a 1), that's not that rare: 1/2 of randomly generated numbers start with 1
  • if the number has at least 2 0s followed by a 1, that's more rare: 1/2 ^ 2 randomly generated numbers will start with 2 0s
  • if a number has 6 leading zeros, that's even more rare: 1 / 2 ^ 6 randomly generated numbers will start with 6 leading zeros.
1number, rarity
200000000, 1 / 2 ^ 8
300000001, 1 / 2 ^ 7
40000001x, 1 / 2 ^ 6
5000001xx, 1 / 2 ^ 5
600001xxx, 1 / 2 ^ 4
70001xxxx, 1 / 2 ^ 3
8001xxxxx, 1 / 2 ^ 2
901xxxxxx, 1 / 2 ^ 1
101xxxxxxx, 1 / 2 ^ 0

Now that we have all that, we can take those "rarity" chances, and come up with some atrociously bad cardinality estimates for how many distinct numbers someone generated. If I say that the maximum number of tails I saw before seeing heads was 1 (01xxxxxx), guessing that I'd seen somewhere around 2 coins in the other room feels reasonable. If instead I told you that the maximum number of tails I saw before seeing a head was 3 tails (0001xxxx), I probably flipped about 8 different coins.

This estimator is incredibly inaccurate, and only works on numbers, but at least it's small!

(It's worth noting that we could have chosen to look at any pattern in these numbers for our rarity guesses. Trailing 0s, leading 1s, or any other pattern that we can turn into a chance)

Inaccurate string estimation

Going from an awful estimator that works on randomly small generated numbers to an awful estimator that works on randomly generated strings is simple! We can use a hash function on the strings to generate pseudo-randomly distributed numbers, and use that for counts of leading zeros. We'll also start using numbers that go up to 2^32 to have a few more leading zeros to work with for larger estimates.

Let's say we were going through the dictionary and we ran into the word "hello":

  1. We use a hashing function to generate a random number: md5 -s 'hello': 5d41402abc4b2a76b9719d911017c592
  2. We can convert that long hex number to a usable binary number by taking the first 8 characters 5d41402a to create 01011101010000010100000000101010 (1564557354), which has 1 leading 0.

If we go through the dictionary on my computer with this algorithm, we'll run into the word "confervoid" (resembling confervae (a type of algae) especially in being made up of branching filaments) that has an md5 hash that starts with 00003b2e and whose binary representation (00000000000000000011101100101110) has 18 leading zeros. At this point (which is not actually that far through the dictionary!), our algorithm would estimate that this dictionary has 2^18 (262,144) words in it, and no other word causes our algorithm to exceed this estimate. The dictionary on my computer actually has 235,886 words in it, which is surprisingly close to the ballpark estimate; we got quite lucky. In general, rough estimates like this one can have incredibly high variance, and much of the HyperLogLog algorithm is trying to deal with that variance.

1cat /usr/share/dict/words | \
2 xargs -n 1 md5 -s `# generate hashes for all words` | \
3 sed -E 's/.+ = //' `# get rid of "MDF (word) = " prefix`| \
4 nq --string-input 's => s.slice(0, 8)' | \
5 nq 's => Math.clz32(parseInt(s, 16))'| \
6 nq --reduce 0 '(max, curr) => curr > max ? curr : max'
7# returns 18
8# md5 is slow, so running this on the full dictionary takes a bit

(nq is a little library that makes it easier to use node syntax on the command line)

The Wisdom of the Buckets

Only tracking a single estimate for the cardinality of a set can lead to estimates with incredibly high variance. The insight behind the LogLog family of cardinality estimators is that you can track multiple estimates, and then average them together in some way.

One simple way of keeping multiple estimates is using multiple hashing functions (or using the same hashing function with different seeds) to generate different pseudorandom numbers. You can then average the estimates together to get a decent final estimate. You can even improve the estimate a bit if you throw away some of the outlying estimates, or if you simply take the median estimate.

Rather than hashing the same value multiple times, the HyperLogLog algorithm instead hashes the value once. It uses part of the randomly generated number to choose an estimate to update, and the rest to calculate that same count of leading zeros. If we take our same 'hello' example from before:

  1. We use a hashing function to generate a random number: md5 -s 'word': c47d187067c6cf953245f128b5fde62a
  2. Let's take the first character to choose which estimate to update (0-f): we'll update estimate c. Because we're using the first character to decide on which estimate to update, that means that we'll end up with 16 estimates.
  3. We then take the next 8 characters 47d18706 to calculate the number of leading zeros (it happens to have 1 leading zero). If that's more leading zeros than the current estimate has, we'll update the estimate.

Then, once we have those estimates, we need to combine them together. One relatively naive way of sticking them together could be something like taking the median estimate and then multiplying it by the number of estimates: estimate = median_estimate * number_of_estimates. When people looked at this problem carefully, they figured out a better algorithm to figure out how to combine those partial estimates into an estimate of the whole. The formula they came up with is: estimate = number_of_estimates * harmonic_mean_of_estimates * magic_collisions_constant. Harmonic means are relatively insensitive to outliers, and the magic constant adjusts for the chance that multiple values hash to the same bucket & value.

Dealing with small cardinalities

If you test out this HyperLogLog algorithm, you'll start noticing variance is quite high for low cardinalities (meaning the estimates for smaller sets can be pretty far off). The algorithm described above starts being decently accurate around 3 * number_of_estimates, and many implementations keep around 2^14 = 16,384 distinct estimates, which means this algorithm is inaccurate below ~50k. That's a pretty big range to be inaccurate!

(Redis's implementation of hyperloglog includes formulae improvements described in New cardinality estimation algorithms for HyperLogLog sketches[3] that reduce variance and improve the accuracy of how you stick the different estimates together, but most implementations & discussions I've seen online use Linear Counting instead)

The estimate = number_of_estimates * harmonic_mean_of_estimates * magic_collisions_constant formula has high variance for low cardinalities, so let's not use it at all when estimated cardinalities are low. Other algorithms, like Linear Counting, work quite well when you know the range of cardinalities that you'll want accurate estimates for, and take up very little memory.

Linear Counting is a beautiful little algorithm: set up a lot of zeros or "buckets" (say 2^32 of them). Whenever a value (like "world") comes in, hash it to a pseudo-random number (like 5d41402a = 1564557354), modulo it by the number of zeros, and then put a 1 (a "marker") at that spot. A naive counting technique would be to just count the number of markers, and this works pretty well! A more accurate counting technique is to include the chance that you're seeing a hash collision from two different values, which increases as you add more items, and use the formula -total_bucket_count * log(buckets_at_0 / total_bucket_count).

Rather than having separate Linear Counting buckets, and HyperLogLog estimates, almost all HyperLogLog implementations have their estimates (or "registers") perform double duty. The presence of any value in a bucket is used as a Linear Counting marker when estimated cardinality is low. This saves space, and is beautiful, and wonderfully clever. But, if you wanted increased accuracy at low cardinalities, you could set up a completely distinct Linear Counter for low cardinalities and increase the cardinality estimator's size a bit.

Dealing with large cardinalities

Most HyperLogLog implementations use 2^64 bit estimators that are accurate up to ~2^57, which is such an absurdly large number that the theoretical bias here shouldn't be an actual concern. You can use a bias correction table (described in [2]), or a better formula (described in [3]), but almost every use case won't need to worry about counting numbers that large.

Counting this all up

Despite what the Wikipedia article might have you believe, HyperLogLogs boil down to few simple things:

  1. Looking at the rarity of random patterns from hashed functions give us rough cardinality estimates
  2. Figuring out some way to average estimates reduces variance. Using a harmonic mean has so far proved to be a good estimate averaging technique
  3. Even averaged estimates have high variance at low cardinalities, so we should use a completely different algorithm there that has nothing whatsoever to do with the zero counting and averaging algorithm except that it can take advantage of the same data store (Linear Counting)

Hopefully, that bit of background makes future readings about HyperLogLogs more intelligible. There's some great stuff out there!

[1] Philippe Flajolet, Eric Fusy, Olivier Gandouet, et al. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

[2] Stefan Heule, Marc Nunkesser, Alex Hall HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm https://dl.acm.org/doi/abs/10.1145/2452376.2452456)

[3]: Otmar Ertl, New cardinality estimation algorithms for HyperLogLog sketches https://arxiv.org/abs/1702.01284

Newer posts
Older posts