Creating an Actionable Logging Taxonomy

Will Keleher

kelwill

2021-07-20

The standard syslog-based way to handle logs is to divide logs into categories like FATAL, ERROR, WARN,INFO, and DEBUG and use those categories to adjust which logs you see. Having a standard logging taxonomy is incredibly useful when you're dealing with logs for many different systems, but for our backend web-servers, we didn't find these syslog-based divisions to be actionable or easily searchable. We instead created our own categories that map to how we want to handle our logs. We divided our logs into server-errors, client-errors, and sampled-investigation-logs, and added tags to route logs in these categories to the team & person who's best suited to handle them.

What's the ERROR with `ERROR` logs?

ERROR is a pretty broad category! For our team, we find it's useful to divide into errors that should be fixed on the web-server (server-errors: these are normally 5xxs), and integration errors that should be fixed on a client (client-errors: these are normally 4xxs). If something can't be fixed, then it's not an error: it's part of the system that we need to instrument and design around. Separating these out lets us alert on appropriate levels separately, and makes it easier to route these logs to the right team. Just because something is a 4xx doesn't mean it's necessarily a client-error: we often have routes that have expected 4xx responses.

Additionally, just because a dependency errored, doesn't mean that counts as an error for our application. There are plenty of times when a dependency throws or logs an error that we're able to easily recover from. Those situations shouldn't be treated as if something went wrong: those are times when something went right! We likely want to instrument and alert on these failures, but they don't count as errors for our application.

Finally, we want to route both server-errors and client-errors to the correct team, so we've decorated all of routes with a product area, and use that product area to tag these server/client-error logs with the appropriate team. This lets us set up per-team error budgets, and gives each team an easy way to search through the things that they care about. This gives us many of the visibility benefits of a microservices architecture while letting us continue to maintain a single monolith.

What do people use `INFO` and `DEBUG` logs for?

In general, INFO and DEBUG logs are useful when we run into a system error that we want to understand better, or when we're trying to understand a phenomenon.

Many teams add INFO and DEBUG logs on the off-chance that they'll turn out to be useful, and when something goes wrong in production, they can temporarily turn on that detailed logging to try and debug. Rather than doing that, we instead keep a record of useful actions & info, and log those actions and info out whenever we log a server or client error. For our backend web servers, that data is the current request, the user making the request, and database requests the server is currently making. For our web clients, we store enough recent redux actions that we can see how a user got into an error state. The only difference is we don't log any of that information until it becomes useful, and then we package it with our server-error or client-error logs.

The other main use for these INFO/DEBUG logs can be to explore a phenomenon. Some things happen too often to log every occurence, so we set up exploratory sample logs. Whenever we call our sampleLog function with the name of the phenomenon we're exploring, we increment a metric tracking how many times it's happened, and if an individual container hasn't logged info about the phenomenon recently, we log out the details. This lets us casually add new logs to our codebase without worrying about volume.

Here's a pseudocode example of how that might work:

const sampleLogRateLimiter: Record<string, number> = {};
function sampleLog(
  phenomenonName: string,
  message: string,
  details: Record<string, any>
) {
  const metricName = toMetricName(phenomenonName);
  metrics.increment(metricName);
  const now = Date.now();
  const lastLoggedAt = sampleLogRateLimiter[phenomenonName] || 0;
  const shouldLog = now - lastLoggedAt > ms("1 minute");
  if (shouldLog) {
    sampleLogRateLimiter[phenomenonName] = now;
    // we normally use winston here :)
    console.log(
      `${phenomenonName} ${message}`,
      JSON.stringify({ stack: new Error().stack, ...details })
    );
  }
}

What about `WARNING` logs?

WARNING logs are almost always ignored, and it's easy for them to get completely out of control. We don't have any warning logs: we try to either remove them, or turn them into legitimate server or client error logs that we'll actually pay attention to.

Our Logging Taxonomy

things we want a backend engineer on a product team to look at and fix: server-error logs
things we want a frontend client engineer on a product team look at and fix: client-error logs
things we're curious about: sampling exploratory logs

Overall, this system works well for us! We've been able to keep our error rate incredibly low, and it's been easy to quickly discover & fix new bugs & issues. It would have been much harder without setting up our logs in categories that match up to how we wanted to use them.

Will Keleher

A former teacher, Will is an engineering manager focused on database performance, team effectiveness, and site-reliability. He believes most problems can be solved with judicious use of `sed` and `xargs`.

Creating an Actionable Logging Taxonomy

What's the ERROR with ERROR logs?

What do people use INFO and DEBUG logs for?

What about WARNING logs?

Our Logging Taxonomy

What's the ERROR with `ERROR` logs?

What do people use `INFO` and `DEBUG` logs for?

What about `WARNING` logs?