Posts By: Will Keleher

Large Analytics SQL Queries are a Code Smell

A single large query in SQL can be hard to understand, test, debug, or change in the same way that an over-large function in code can be. A large query is also much harder to write! Feedback loops while writing large queries are slow and you'll often find yourself needing to guess at where the problem in your query is.

When I started writing analytics queries, I wrote some pretty rough ones that are now hard to debug and maintain! (Apologies to everyone who has had to deal with any of my old SQL queries.) Over time, I think I've gotten better at them, and I wanted to write down some of the things I'm doing differently now that I hope will be useful to other people in similar situations.

Finally, I'm a product engineer who rarely writes these sorts of queries! This advice is much less applicable to someone who does this daily & is using better & more focused tools to do analyses.

Temporary tables, views, and with

Large queries are hard, but it's pretty simple to break a large query into smaller pieces. For analytics queries, I normally create small temporary tables (often with the temporary keyword) that normalize data, filter out deleted rows and rows I'm not interested in, and organize my data into a format that makes querying easy. Views or with clauses can accomplish similar things, but I like using temporary tables for this because they cache results and make subsequent queries faster.

I also try to put constants into a temporary table or with clause. When working on a query, it can be easy to forget to update a constant in one spot and then get completely meaningless results. (Shoutout to Ben Haley for showing me this trick!)

All of this might sound a little abstract: let's take a somewhat contrived query and try to refactor it. We want to bucket and count US-based teachers who were active in 2021 by how many classes they created during that time period. Here's what that might look like as a single query:

select 
  case when class_count < 5 then class_count::varchar else 'many' end as bucket,
  count(*)
  from (
    select count(distinct class.classId) as class_count
    from teacher
    join user_teacher ON teacher.teacherId = user_teacher.teacherid
    -- left join class_teacher to make sure we're counting teachers who haven't created classes
    left join class_teacher on class_teacher.teacherId = user_teacher.teacherId and class_teacher.creator
    left join user USING(userId)
    join class using(classId)
    join (
      select distinct teacherId
      from teacher_active
      where active_date between '2021-01-01' and '2022-01-01'
    ) as ats on teacher.teacherId = ats.teacherId
    and class.createdat between '2021-01-01' and '2022-01-01'
    and not class.autocreated_demo
    and lower(user.country) in ('usa', 'us')
    group by teacherId
  )
group by 1
 

This query isn't particularly complex, but it's still enough logic that I'd be a little worried about changing it or verifying that it's correct. I'd be tempted to try to pull out constants and then separate out the filtering logic from the calculation logic.

drop table if exists _constant;
create temporary table _constant as (
 select '2021-01-01' as start, '2022-01-01' as end
);
 
drop table if exists _teacher;
create temporary table _teacher as (
 -- us_user is probably overkill: this might be better in the `where` clause!
 with us_user as (
   select userId
   from user
   where lower(country) in ('usa', 'us')
 )
 select distinct teacherId
 from teacher_active
 join user_teacher USING(teacherId)
 join us_user using(userid)
 where active_date between (select start from _constant)
   and (select end from _constant)
);
drop table if exists _class;
create temporary table _class (
 select classId
 from class
 where class.created between (select start from _constant)
   and (select end from _constant)
   and not class.autocreated_demo
);
 
drop table if exists _classes_created_by_teacher;
create temporary table _classes_created_by_teacher (
 with class_creator as (
   select class_teacher.*
   from class_teacher
   join _class USING(classId)
   where class_teacher.creator
 )
 select teacherId, count(distinct classId) as classes_created
 from _teacher
 left join class_creator using(teacherId)
 group by teacherId
);
 
select
 case when class_count < 5 then class_count::varchar else 'many' end as bucket,
 count(*)
from _classes_created_by_teacher
group by bucket;
 

It's arguable whether this is actually better! The initial query is short enough that it's not that much logic to understand: it might be the right size for the team that you're working with. There are also certainly better ways of factoring this same query that could make the logic even more clear. Overall though, I'd much rather work with the updated query:

  • if something is broken in the query, I can easily and quickly examine the tables that I've created to see if my results match my expectations
  • if I have a testing library of some sort, I can set up simple assertions about what the tables I'm using look like
  • the overall query will run faster because results are cached and because query-planners don't always generate optimum plans for large queries. While working on this query, I'll have faster feedback loops
  • I'll be able to tell which parts of this query are slow and optimize if necessary
  • it's easier to focus on adapting & improving a single part
  • The sub-queries that I've turned into tables are things I could take advantage of later if I ever tackle other similar problems by turning them into nice non-temporary cache tables

I think many data-focused engineers use jupyter notebooks and pandas to break down large queries. I think how you're breaking down a large query into smaller pieces is much less important than doing that breakdown!

Make feedback loops FAST!

One of the most frustrating parts of working on a large query is that feedback loops can be slow. Making a change and waiting tens of minutes can completely kill any programming flow or focus that you have.

  • Break up large queries into smaller ones that are quick to run!
  • Use select without a from to quickly test out behavior. You can run queries like select extract('year' from GETDATE()); or select 1 != null, 1 is not null to quickly check your understanding. This can be especially useful for testing out regular expressions and formatting date fields for charts. There's no need for a from or to run a full query1
  • If a full table is slow to query, it might make sense to create a temporary table with representative data that you can use to build up your queries
  • Good indexes or sort-keys can drastically improve query speed! The most common mistake I've made is setting up a compound sort key with a too-precise timestamp followed by other keys I'm interested in. If you use a second or millisecond precision key at the start of a compound sortkey, that key's precision will override any possible benefit from the other key. So, rather than creating a sortkey like (createdAtMs, event), it's probably better to instead sort and query on an index like (day, event). (This might mean that queries need to include both day and createdAtMs.)
  • validate your work as you go! The quicker you can realize that your assumptions about a table or column are incorrect the better. Running a query that checks whether a column is unique, what values a column can hold, or just what data looks like can save a ton of time!

In general, putting effort into how quickly you get feedback while working makes it much easier to find flow and be effective. A little bit of effort put into setting up nice tables, improving data layout, and optimizing sortkeys can pay large dividends.

    Bob Nystrom's What Color is Your Function does an amazing job of describing why it can be painful when programming languages have different rules for calling synchronous and asynchronous functions. Promises and async/await have simplified things in JavaScript, but it's still a language with "red" (async) and "blue" (sync) functions, and I consistently see a few understandable errors from red vs. blue function confusion. Let's go through some of the most common mistakes – none of these are bad things to get wrong, they're just a symptom of how confusing this can be!

    Omitting await from try/catch blocks

    The most common mistake I see is omitting await from try/catch blocks with async functions. The code looks reasonable, but the catch block will only be able to catch synchronously thrown errors. To make matters worse, error handling logic is often less well tested than the happy path when everything works, which makes this pattern more likely to sneak its way into production code.

    async function throwsError () {
      throw new Error("alas! an error");
    }
    
    try {
      return throwsError();
    } catch (err) {
      console.error("Oh no! This catch block isn't catching anything", err);
    }
    

    An async function that throws is the equivalent of a Promise.reject, and when written that way, it's a bit clearer what's going on:

    try {
      return Promise.reject(new Error("alas! an error"));
    } catch (err) {
      console.error("It's clearer that this `catch` can't catch that `Promise.reject`. This is equivalent to the earlier code");
    }
    

    Personally, I'm starting to wonder whether using try and catch blocks at all is a mistake when dealing with async code. They take up space and don't offer the same pattern matching that a library like Bluebirdjs can add to catch when you only want to catch some specific known errors: await tryThing().catch(NotFoundErrorClass, handleErrPattern) feels substantially cleaner to me than the equivalent try/catch block.

    Array.filter(async () => false)

    In recent years, JavaScript has added lots of useful Array methods like filter, map, forEach, and flatMap, and JavaScript programmers often use libraries like lodash to write functional code rather than writing for loops. Sadly, none of those Array methods or lodash helpers work with red async functions and are a common source of coding errors.

    const things = [true, false, 1, 0, "", new Date("not a date") - 0];
    const filteredThings = things.filter(async (thing) => thing);
    

    How many things do we end up with in filteredThings? Surprisingly, the answer has little to do with JavaScript type coercion: filteredThings will be the same size as things. An async function returns a Promise and even a Promise that resolves to false is still a truthy value: Boolean(Promise.resolve(false)) === true. If we want to do any sort of filtering using an async function, we need to switch out of blue sync mode and into red async mode.

    (async function () {
      // You should use a library like Bluebird rather than filtering like this! this is only for illustration
      const things = [true, false, 1, 0, "", new Date("not a date") - 0];
      const predicateValues = await Promise.all(things.map(async (thing) => thing));
      const filteredThings = things.filter((_thing, i) => predicateValues[i]);
    })();
    

    When you see Array.filter(async (thing) => thing) written out like that, the mistake is pretty clear. It can be harder to notice when you see code like const goodThings = things.filter(isGoodThing); you need to check whether isGoodThing is red or blue.

    Array.forEach(async...

    We see a similar problem when people use Array.forEach with an async function:

    const fruitStatus = {};
    ["apple", "tomato", "potato"].forEach(async (food) => {
      fruitStatus[food] = await isFruit(food);
    });
    return fruitStatus;
    

    In some ways, this is a more dangerous pattern. Depending on when you check, fruitStatus may have some, none, or all of the correct isFruit values. If isFruit is normally fast, problems and bugs might not manifest until isFruit slows down. A bug that only shows up some of the time is much harder to debug than one that's always there.

    Await off my shoulders

    Despite how easy it is to make mistakes with async/await, I still love it – it feels easier to work with than Promises or callbacks. Dealing with asynchronous code is still one of the harder parts of programming in JavaScript, but tools like bluebird, the TypesScript no-unnecessary-condition rule, and the eslint promise plugin can help surface these easy-to-make red/blue function mistakes early. Hopefully, seeing the mistakes we often make will help you avoid some frustrating minutes debugging.

      Canary releases are pretty great! ClassDojo uses them as part of our continuous delivery pipeline: having a subset of real users use & validate our app before continuing with deploys allows us to safely & automatically deploy many times a day.

      Our canary releases are conceptually simple:

      1. we start canary containers with a new container image
      2. we then route some production traffic to these containers
      3. we monitor them: if a container sees a problem, we stop our pipeline. If they don't see problems, we start a full production deploy

      Simple enough, right? There are a few details that go into setting up a system like this, and I'd like to take you through how ClassDojo does it. Our pipeline works well for our company's needs, and I think it's a good example of what this kind of canary-gated deploy can look like.

      The key pieces of our system:

      1. We have a logging taxonomy that lets us accurately detect server-errors that we want to fix. ("Errors" that we don't want to fix aren't actually errors!)
      2. HAProxy, Consul, and Nomad let us route a subset of production traffic to a group of canary containers running new code
      3. Our canary containers expose a route with the count of seen errors and the count of total requests that a monitoring script in our jenkins pipeline can hit
      4. The monitoring script will stop our deployment if it sees a single error. If it sees 75,000 successful production requests, it will let the deploy go to production. (75,000 is an arbitrary number that gives us a 99.9% chance of catching errors that happen 1/10^4 requests. )

      Starting canary containers

      ClassDojo uses Nomad for our container orchestration, so once we've built a docker image and tagged it with our updated_image_id, we can deploy it by running nomad run api-canary.nomad.

      // api-canary.nomad
      job "api-canary" {
        group "api-canary-group" {
          count = 8
          task "api-canary-task" {
            driver = "docker"
            config {
              image = "updated_image_id"
      
            }
            service {
              name = "api-canary"
              port = "webserver_http"
             // this registers this port on these containers with consul as eligible for “canary” traffic
            }
            resources {
              cpu = 5000 # MHz
              memory = 1600
      
              network {
                port "webserver_http"{}
              }
            }
          }
        }
      }
      

      Nomad takes care of running these 8 (count = 8) canary containers on our nomad clients. At this point, we have running containers, but they're not serving any traffic.

      Routing traffic to our canary containers

      Remember that nomad job file we looked at above? Part of what it was doing was registering a service in consul. We tell consul that the webserver_http port can provide the api-canary service.

      service {
        name = "api-canary"
        port = "webserver_http"
      }
      

      We use HAProxy for load-balancing, and we use consul-template to generate updated haproxy configs every 30 seconds based on the service information that consul knows about.

      backend api
        mode http
        # I'm omitting a *ton* of detail here!
        # See https://engineering.classdojo.com/2021/07/13/haproxy-graceful-server-shutdowns talks about how we do graceful deploys with HAProxy
      
      {{ range service "api-canary" }}
        server canary_{{ .Address }}:{{ .Port }} {{ .Address }}:{{ .Port }}
      {{ end }}
      
      # as far as HAProxy is concerned, the canary containers above should be treated the same as our regularly deployed containers. It will round robin traffic to all of them
      {{ range service "api" }}
        server api_{{ .Address }}:{{ .Port }} {{ .Address }}:{{ .Port }}
      {{end}}
      

      Monitoring canary

      Whenever we see an error, we increment a local counter saying that we saw the error. What counts as an error? For us, an error is something we need to fix (most often 500s or timeouts): if something can't be fixed, it's part of the system, and we need to design around it. If you're curious about our approach to categorizing errors, Creating An Actionable Logging Taxonomy digs into the details. Having an easy way of identifying real problems that should stop a canary deploy is the key piece that makes this system work.

      let errorCount: number = 0;
      export const getErrorCount = () => errorCount;
      export function logServerError(errorDetails: ErrorDetails) {
        errorCount++;
        metrics.increment("serverError");
        winstonLogger.log("error", errorDetails);
      }
      

      Similarly, whenever we finish with a request, we increment another counter saying we saw the request. We can then expose both of these counts on our status route. There are probably better ways of publishing this information to our monitoring script rather than via our main server, but it works well enough for our needs.

      router.get("/api/errorAndRequestCount", () => {
        return {
          errorCount: getErrorCount(),
          requestCount: getRequestsSeenCount(),
          ...otherInfo,
        });
      });
      

      Finally, we can use consul-template to re-generate our list of canary hosts & ports, and write a monitoring script to check the /api/errorAndRequestCount route on all of them. If we see an error, we can run nomad job stop api-canary && exit 1, and that will stop our canary containers & our deployment pipeline.

      consul-template -template canary.tpl:canary.txt -once

      {{ range service "api-canary" }}
        {{ .Address }}:{{ .Port }}
      {{end -}}
      

      Our monitoring script watches our canary containers until it sees that they've handled 75,000 requests without an error. (75,000 is a little bit of an arbitrary number: it's large enough that we'll catch relatively rare errors, and small enough that we can serve that traffic on a small number of containers within a few minutes.)

      const fs = require("fs");
      const canaryContainers = fs
        .readFileSync("./canary.txt")
        .toString()
        .split("\n")
        .map((s) => s.trim())
        .filter(Boolean);
      const fetch = require("node-fetch");
      const { execSync } = require("child_process");
      const GOAL_REQUEST_COUNT = 75_000;
      
      const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
      
      (async function main() {
        while (true) {
          let totalRequestCount = 0;
          for (const container of canaryContainers) {
            const { errorCount, requestCount } = await fetch(
              `${container}/api/errorAndRequestCount`
            ).then((res) => res.json());
            totalRequestCount += requestCount;
            if (errorCount) {
              // stopping our canary containers is normally handled by the next stage in our pipeline
              // putting it here for illustration
              console.error("oh no! canary failed");
              execSync(`nomad job stop api-canary`);
              return process.exit(1);
            }
          }
      
          if (totalRequestCount >= GOAL_REQUEST_COUNT) {
            console.log("yay! canary succeeded");
            execSync(`nomad job stop api-canary`);
            return process.exit(0);
          }
      
          await delay(1000);
        }
      })();
      

      Nary an Error with Canary

      We've been running this canary setup (with occasional changes) for over eight years now, and it's been a key part of our continuous delivery pipeline, and has let us move quickly and safely. Without it, we would have shipped a lot more errors fully out to production, our overall error rate would likely be higher, and our teams would not be able to move as quickly as they can. Our setup definitely isn't perfect, but it's still hugely valuable, and I hope that sharing our setup will help your team create a better one.

        Newer posts
        Older posts