Posts By: Will Keleher

HyperLogLog-orrhea

Say you had a few petabytes of user ids lying around somewhere, and someone told you that they were going to use a few kb of memory to estimate the "cardinality", or the number of distinct ids, of those petabytes of ids with decently high accuracy. It'd feel like magic! The HyperLogLog algorithm makes that magic cardinality estimation happen through the Power of Random Numbers, and some clever use of data structures.

I wanted to set up a mongo-backed hyperloglog, but I really struggled to understand the wikipedia page, the initial paper [1], or any of the improvements of the HLL++ algorithm [2] because I didn't really understand the underlying ideas and how they fit together. This post is an attempt to help describe the ideas that make the HyperLogLog work without getting too much into the details.

The Power of Random Numbers

Before we start doing any cardinality estimation, let's flip some coins. I'm going to go into a different room, and for every coin in the room, I'm going to flip that coin until I flip it and it comes up heads. I'm then going to come back and tell you the maximum number of attempts it took me to flip a head. If I tell you the maximum number of attempts it took me before I saw a head was 2, you'd be justified in guessing that there aren't that many coins in the other room. If instead I tell you that one of the coins took me 7 flips before seeing a head, you might guess that there were a decent number of coins in the other room.

This game isn't particularly fun or useful, but we can use the same technique on random numbers. We can look at the maximum number of leading zeros in the binary representation of a random number to get a measure of its "rarity". By this arbitrary measure, 1xxxxxxx and 01xxxxxx will both be common, and 0000001x is quite rare. If we play the same game, and I go into another room and look through my random number collection and tell you that the largest stretch of 0s (or tails) I saw before a 1 (or head) was 7, you similarly might guess that my random number collection is quite extensive (and it is).

We can formalize the idea that 000001x is low & therefore relatively rare a little bit more. If we go digit by digit through the number, we can treat each digit as a coin flip, which would have a 50/50 chance of being a 1. So, if we look at the number of leading 0s (the number of flipped "tails") or the number of leading 1s (the number of flipped "heads"), we can get a sense for how rare a particular number is.

  • if the number starts with a no 0s (i.e., a 1), that's not that rare: 1/2 of randomly generated numbers start with 1
  • if the number has at least 2 0s followed by a 1, that's more rare: 1/2 ^ 2 randomly generated numbers will start with 2 0s
  • if a number has 6 leading zeros, that's even more rare: 1 / 2 ^ 6 randomly generated numbers will start with 6 leading zeros.
number, rarity
00000000, 1 / 2 ^ 8
00000001, 1 / 2 ^ 7
0000001x, 1 / 2 ^ 6
000001xx, 1 / 2 ^ 5
00001xxx, 1 / 2 ^ 4
0001xxxx, 1 / 2 ^ 3
001xxxxx, 1 / 2 ^ 2
01xxxxxx, 1 / 2 ^ 1
1xxxxxxx, 1 / 2 ^ 0

Now that we have all that, we can take those "rarity" chances, and come up with some atrociously bad cardinality estimates for how many distinct numbers someone generated. If I say that the maximum number of tails I saw before seeing heads was 1 (01xxxxxx), guessing that I'd seen somewhere around 2 coins in the other room feels reasonable. If instead I told you that the maximum number of tails I saw before seeing a head was 3 tails (0001xxxx), I probably flipped about 8 different coins.

This estimator is incredibly inaccurate, and only works on numbers, but at least it's small!

(It's worth noting that we could have chosen to look at any pattern in these numbers for our rarity guesses. Trailing 0s, leading 1s, or any other pattern that we can turn into a chance)

Inaccurate string estimation

Going from an awful estimator that works on randomly small generated numbers to an awful estimator that works on randomly generated strings is simple! We can use a hash function on the strings to generate pseudo-randomly distributed numbers, and use that for counts of leading zeros. We'll also start using numbers that go up to 2^32 to have a few more leading zeros to work with for larger estimates.

Let's say we were going through the dictionary and we ran into the word "hello":

  1. We use a hashing function to generate a random number: md5 -s 'hello': 5d41402abc4b2a76b9719d911017c592
  2. We can convert that long hex number to a usable binary number by taking the first 8 characters 5d41402a to create 01011101010000010100000000101010 (1564557354), which has 1 leading 0.

If we go through the dictionary on my computer with this algorithm, we'll run into the word "confervoid" (resembling confervae (a type of algae) especially in being made up of branching filaments) that has an md5 hash that starts with 00003b2e and whose binary representation (00000000000000000011101100101110) has 18 leading zeros. At this point (which is not actually that far through the dictionary!), our algorithm would estimate that this dictionary has 2^18 (262,144) words in it, and no other word causes our algorithm to exceed this estimate. The dictionary on my computer actually has 235,886 words in it, which is surprisingly close to the ballpark estimate; we got quite lucky. In general, rough estimates like this one can have incredibly high variance, and much of the HyperLogLog algorithm is trying to deal with that variance.

cat /usr/share/dict/words | \
  xargs -n 1 md5 -s  `# generate hashes for all words` | \
  sed -E 's/.+ = //' `# get rid of "MDF (word) = " prefix`| \
  nq --string-input 's => s.slice(0, 8)' | \
  nq 's => Math.clz32(parseInt(s, 16))'| \
  nq --reduce 0 '(max, curr) => curr > max ? curr : max'
# returns 18
# md5 is slow, so running this on the full dictionary takes a bit

(nq is a little library that makes it easier to use node syntax on the command line)

The Wisdom of the Buckets

Only tracking a single estimate for the cardinality of a set can lead to estimates with incredibly high variance. The insight behind the LogLog family of cardinality estimators is that you can track multiple estimates, and then average them together in some way.

One simple way of keeping multiple estimates is using multiple hashing functions (or using the same hashing function with different seeds) to generate different pseudorandom numbers. You can then average the estimates together to get a decent final estimate. You can even improve the estimate a bit if you throw away some of the outlying estimates, or if you simply take the median estimate.

Rather than hashing the same value multiple times, the HyperLogLog algorithm instead hashes the value once. It uses part of the randomly generated number to choose an estimate to update, and the rest to calculate that same count of leading zeros. If we take our same 'hello' example from before:

  1. We use a hashing function to generate a random number: md5 -s 'word': c47d187067c6cf953245f128b5fde62a
  2. Let's take the first character to choose which estimate to update (0-f): we'll update estimate c. Because we're using the first character to decide on which estimate to update, that means that we'll end up with 16 estimates.
  3. We then take the next 8 characters 47d18706 to calculate the number of leading zeros (it happens to have 1 leading zero). If that's more leading zeros than the current estimate has, we'll update the estimate.

Then, once we have those estimates, we need to combine them together. One relatively naive way of sticking them together could be something like taking the median estimate and then multiplying it by the number of estimates: estimate = median_estimate * number_of_estimates. When people looked at this problem carefully, they figured out a better algorithm to figure out how to combine those partial estimates into an estimate of the whole. The formula they came up with is: estimate = number_of_estimates * harmonic_mean_of_estimates * magic_collisions_constant. Harmonic means are relatively insensitive to outliers, and the magic constant adjusts for the chance that multiple values hash to the same bucket & value.

Dealing with small cardinalities

If you test out this HyperLogLog algorithm, you'll start noticing variance is quite high for low cardinalities (meaning the estimates for smaller sets can be pretty far off). The algorithm described above starts being decently accurate around 3 * number_of_estimates, and many implementations keep around 2^14 = 16,384 distinct estimates, which means this algorithm is inaccurate below ~50k. That's a pretty big range to be inaccurate!

(Redis's implementation of hyperloglog includes formulae improvements described in New cardinality estimation algorithms for HyperLogLog sketches[3] that reduce variance and improve the accuracy of how you stick the different estimates together, but most implementations & discussions I've seen online use Linear Counting instead)

The estimate = number_of_estimates * harmonic_mean_of_estimates * magic_collisions_constant formula has high variance for low cardinalities, so let's not use it at all when estimated cardinalities are low. Other algorithms, like Linear Counting, work quite well when you know the range of cardinalities that you'll want accurate estimates for, and take up very little memory.

Linear Counting is a beautiful little algorithm: set up a lot of zeros or "buckets" (say 2^32 of them). Whenever a value (like "world") comes in, hash it to a pseudo-random number (like 5d41402a = 1564557354), modulo it by the number of zeros, and then put a 1 (a "marker") at that spot. A naive counting technique would be to just count the number of markers, and this works pretty well! A more accurate counting technique is to include the chance that you're seeing a hash collision from two different values, which increases as you add more items, and use the formula -total_bucket_count * log(buckets_at_0 / total_bucket_count).

Rather than having separate Linear Counting buckets, and HyperLogLog estimates, almost all HyperLogLog implementations have their estimates (or "registers") perform double duty. The presence of any value in a bucket is used as a Linear Counting marker when estimated cardinality is low. This saves space, and is beautiful, and wonderfully clever. But, if you wanted increased accuracy at low cardinalities, you could set up a completely distinct Linear Counter for low cardinalities and increase the cardinality estimator's size a bit.

Dealing with large cardinalities

Most HyperLogLog implementations use 2^64 bit estimators that are accurate up to ~2^57, which is such an absurdly large number that the theoretical bias here shouldn't be an actual concern. You can use a bias correction table (described in [2]), or a better formula (described in [3]), but almost every use case won't need to worry about counting numbers that large.

Counting this all up

Despite what the Wikipedia article might have you believe, HyperLogLogs boil down to few simple things:

  1. Looking at the rarity of random patterns from hashed functions give us rough cardinality estimates
  2. Figuring out some way to average estimates reduces variance. Using a harmonic mean has so far proved to be a good estimate averaging technique
  3. Even averaged estimates have high variance at low cardinalities, so we should use a completely different algorithm there that has nothing whatsoever to do with the zero counting and averaging algorithm except that it can take advantage of the same data store (Linear Counting)

Hopefully, that bit of background makes future readings about HyperLogLogs more intelligible. There's some great stuff out there!

[1] Philippe Flajolet, Eric Fusy, Olivier Gandouet, et al. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

[2] Stefan Heule, Marc Nunkesser, Alex Hall HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm https://dl.acm.org/doi/abs/10.1145/2452376.2452456)

[3]: Otmar Ertl, New cardinality estimation algorithms for HyperLogLog sketches https://arxiv.org/abs/1702.01284

    The standard syslog-based way to handle logs is to divide logs into categories like FATAL, ERROR, WARN,INFO, and DEBUG and use those categories to adjust which logs you see. Having a standard logging taxonomy is incredibly useful when you're dealing with logs for many different systems, but for our backend web-servers, we didn't find these syslog-based divisions to be actionable or easily searchable. We instead created our own categories that map to how we want to handle our logs. We divided our logs into server-errors, client-errors, and sampled-investigation-logs, and added tags to route logs in these categories to the team & person who's best suited to handle them.

    What's the ERROR with ERROR logs?

    ERROR is a pretty broad category! For our team, we find it's useful to divide into errors that should be fixed on the web-server (server-errors: these are normally 5xxs), and integration errors that should be fixed on a client (client-errors: these are normally 4xxs). If something can't be fixed, then it's not an error: it's part of the system that we need to instrument and design around. Separating these out lets us alert on appropriate levels separately, and makes it easier to route these logs to the right team. Just because something is a 4xx doesn't mean it's necessarily a client-error: we often have routes that have expected 4xx responses.

    Additionally, just because a dependency errored, doesn't mean that counts as an error for our application. There are plenty of times when a dependency throws or logs an error that we're able to easily recover from. Those situations shouldn't be treated as if something went wrong: those are times when something went right! We likely want to instrument and alert on these failures, but they don't count as errors for our application.

    Finally, we want to route both server-errors and client-errors to the correct team, so we've decorated all of routes with a product area, and use that product area to tag these server/client-error logs with the appropriate team. This lets us set up per-team error budgets, and gives each team an easy way to search through the things that they care about. This gives us many of the visibility benefits of a microservices architecture while letting us continue to maintain a single monolith.

    What do people use INFO and DEBUG logs for?

    In general, INFO and DEBUG logs are useful when we run into a system error that we want to understand better, or when we're trying to understand a phenomenon.

    Many teams add INFO and DEBUG logs on the off-chance that they'll turn out to be useful, and when something goes wrong in production, they can temporarily turn on that detailed logging to try and debug. Rather than doing that, we instead keep a record of useful actions & info, and log those actions and info out whenever we log a server or client error. For our backend web servers, that data is the current request, the user making the request, and database requests the server is currently making. For our web clients, we store enough recent redux actions that we can see how a user got into an error state. The only difference is we don't log any of that information until it becomes useful, and then we package it with our server-error or client-error logs.

    The other main use for these INFO/DEBUG logs can be to explore a phenomenon. Some things happen too often to log every occurence, so we set up exploratory sample logs. Whenever we call our sampleLog function with the name of the phenomenon we're exploring, we increment a metric tracking how many times it's happened, and if an individual container hasn't logged info about the phenomenon recently, we log out the details. This lets us casually add new logs to our codebase without worrying about volume.

    Here's a pseudocode example of how that might work:

    const sampleLogRateLimiter: Record<string, number> = {};
    function sampleLog(
      phenomenonName: string,
      message: string,
      details: Record<string, any>
    ) {
      const metricName = toMetricName(phenomenonName);
      metrics.increment(metricName);
      const now = Date.now();
      const lastLoggedAt = sampleLogRateLimiter[phenomenonName] || 0;
      const shouldLog = now - lastLoggedAt > ms("1 minute");
      if (shouldLog) {
        sampleLogRateLimiter[phenomenonName] = now;
        // we normally use winston here :)
        console.log(
          `${phenomenonName} ${message}`,
          JSON.stringify({ stack: new Error().stack, ...details })
        );
      }
    }
    

    What about WARNING logs?

    WARNING logs are almost always ignored, and it's easy for them to get completely out of control. We don't have any warning logs: we try to either remove them, or turn them into legitimate server or client error logs that we'll actually pay attention to.

    Our Logging Taxonomy

    • things we want a backend engineer on a product team to look at and fix: server-error logs
    • things we want a frontend client engineer on a product team look at and fix: client-error logs
    • things we're curious about: sampling exploratory logs

    Overall, this system works well for us! We've been able to keep our error rate incredibly low, and it's been easy to quickly discover & fix new bugs & issues. It would have been much harder without setting up our logs in categories that match up to how we wanted to use them.

      Fully graceful incremental deploys are hard. Any number of events can deliver brief spates of 5xx errors to your clients, and getting all of the pieces right isn't trivial. And, if something isn't quite right, it can be hard to detect problems from a few seconds of downtime on a server over the course of a deploy.

      If you see 5xx errors from HAProxy because it was unable to connect to a server that it thought was available or because a server stopped sending data, it often points to issues with graceful shutdown logic. If you're at a company that deploys rarely and that isn't under heavy load, focusing on these few seconds might seem a bit crazy. So why should you care about those few seconds?

      There's a real sense of satisfaction that comes with making something solid and dependable that users can rely on — not to mention a few practical reasons:

      • It's simplest to alert on low error rates (especially 0!). If you have an "expected error rate" that needs to handle deploys, it may make it trickier to discover problems, or you may get false positive error alerts.
      • "Treat servers like cattle, not pets": Handling shutdowns gracefully allows you to safely rotate instances and put servers on spot instances.
      • Graceful deploys let you easily scale in and out as traffic changes over the course of a day without sending 5xxs to clients

      At ClassDojo, we deploy 10 to 20 times a day, and we need to scale in and out to handle traffic throughout the school day. Having graceful web server shutdowns behind HAProxy lets us make that happen.

      What does a graceful shutdown look like?

      Let's not worry about how we update HAProxy configuration, or how we actually get new code deployed to an instance (or a new container running on K8s or Nomad or wherever). Our HAProxy is happily running with this configuration:

      backend the-trunk
        mode http
        option httpchk GET /status
        timeout server 20s
        rspadd X-Clacks-Overhead:\ GNU\ Terry\ Pratchett
      
        server tower-0:version1 10.1.0.127:8080 minconn 1 maxconn 10 rise 2 fall 2 check inter 2s
        server tower-1:version2 10.1.0.127:8081 minconn 1 maxconn 10 rise 2 fall 2 check inter 2s
        server tower-2:version2 10.1.0.127:8082 minconn 1 maxconn 10 rise 2 fall 2 check inter 2s
      
      • option httpchk GET /status: make http GET requests to the /status route on the running server. (note: you can do this with a tcp connection check to a port, but I like the clarity and simplicity of HTTP. Doing these checks over tcp is much cheaper, and a good choice if you're running load balancers under heavier load)
      • server tower-0:version1 10.1.0.127:8080: where the web-server is
      • rise 2: require 2 200s to mark a server as up
      • fall 2: require 2 failing 4xx or 5xx errors to mark a server as down
      • check inter 2s: check /status every 2 seconds

      Given this configuration, our shutdown logic should look like:

      1. Send a signal (SIGTERM) to one of the web server processes.
      2. The web server updates its /status route to start returning 503s, indicating that it's down.
      3. After two failing checks, HAProxy stops sending new traffic to the server. The server may still be handling requests.
      4. The server waits for all of the remaining requests to complete. It can then safely clean up and shut down. If any requests aren't complete by the time the server is shutting down, it should log an error.

      Here's some simplified Node.js pseudocode that illustrates what those steps might look like. (We know there are more maintainable ways of writing much of this code: this is intended as illustration only.)

      const app = require('express')();
      app.listen(8081);
      
      const haproxyStatus = {
        outstandingRequestCount: 0,
        // 'up' may start off false if you need to do any setup before serving traffic
        up: true,
      }
      
      // tracking outstanding requests lets us see whether it's safe to shut down
      app.use((req, res, next) => {
        haproxyStatus.outstandingRequestCount++;
        res.once("finish", () => haproxyStatus.outstandingRequestCount--);
        next();
      });
      
      // health check route
      let reportUpToHaproxy = true;
      app.get('/status', (req, res) => haproxyStatus.up ? res.sendStatus(200) : res.sendStatus(503);
      
      // regular routes
      app.get('/small_gods', (req, res) => res.send("Time is a drug. Too much of it kills you. — Terry Pratchett"));
      app.get('/jingo', (req, res) => res.send("Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life. — Terry Pratchett"));
      
      function delay(ms: number) {
        return new Promise((resolve) => setTimeout(resolve, ms));
      }
      
      async function drainRequests () {
        if (haproxyStatus.up) {
          throw new Error("We cannot drainRequests until HAProxy is aware we're down");
        }
        while (true) {
          if (outstandingRequestCount === 0) return;
          await delay(100);
        }
      }
      
      async function reportDownToHAProxy() {
        // we start reporting that we're down to HAProxy
        // but it takes time for HAProxy to run its health checks `fall` times
        haproxyStatus.up = false;
      
        // server tower-0 10.1.0.127:8080 minconn 1 maxconn 10 rise 2 fall 2 check inter 2s port 8081
        const CHECK_INTER = 2_000; // check inter 2s
        const FALL_COUNT = 2; // fall 2
      
        // (note: if you have a single load balancer, you could count the number of `/status` requests that you've responded to to determine whether the load balancer is aware that you're down.
        // This works, but it gets more complicated when you have multiple load balancers that you need to wait for, and seems a little harder than just calculating a time that's _probably_ safe)
      
        await delay(FALL_COUNT * CHECK_INTER);
      }
      
      async function cleanUp () {/*any cleanup work you have*/}
      
      async function gracefulShutdown () {
        await reportDownToHAProxy();
        // at this point, HAProxy should be aware that this server is down and will stop sending us requests
        // we likely still have a few requests outstanding, and `outstandingRequests` will be > 0
        // (note: we're not worrying about connection timeouts here. If we were, we'd add time to our delay)
      
        // timeout server 20s
        const MAX_REQUEST_TIME = 20_000;
        await Promise.race(
          drainRequests(),
          delay(MAX_REQUEST_TIME),
        );
      
        // after draining requests, do any other cleanup work necessary to make exiting safe, such as closing database connections
        await Promise.race(
          cleanUp(),
          delay(5_000),
        )
      
        // at this point, all requests should be complete (or canceled by HAProxy)
        // and we should naturally exit because there are no longer any listeners or callbacks
        setTimeout(() => {
          console.error("We have to force exit because we didn't clean up all callbacks (unref means we don't need to worry about this setTimeout)")
          process.exit(1);
        }, 1000).unref()
      }
      
      // our deployment process sends a signal to the running process or container
      // to let it know that it's time to shut down
      process.on("SIGTERM", gracefulShutdown);
      

      How can you tell whether things are going wrong?

      One of the most important parts of software engineering is knowing when things are going wrong: thankfully, HAProxy logs make it pretty clear when something is broken with web server shutdowns. HAProxy logs have a ton of information in them, but we're only concerned with requests that terminate with sC-- or S*--. s indicates there was an HAProxy -> web-server connection error, and S indicates the server broke the TCP connection. The next character gives information about where in the request lifecycle the connection error happened. The HAProxy docs on termination states are incredibly useful for understanding these termination problems.

      Example log with SH-- termination state:

      10.1.0.127:8080 [01/Jul/2021:00:99:00.000] http-in the-trunk/tower-1:version0 0/0/0/-1/3022 502 0 - - SH-- 0/0/0/0/0 0/0 "GET /jingo HTTP/1.1"

      Shutting down this post

      This style of incremental graceful web-server deploy isn't the only way of tackling this problem. Some engineering teams only deploy rarely, and accept seconds or minutes of downtime during deploys. Other teams use blue-green deploys to switch over all web servers at once. Our engineering team values the speed, stability, reliability, and autoscaling that our graceful web-server shutdowns enable. And, while it takes a bit of code to make happen, the fundamental idea is simple: wait until HAProxy is aware that a server is down, and then start waiting for any outstanding requests to complete.

        Newer posts