How We Built a DataOps Platform

Felix Yuan

sticksman

2022-02-07

DataOps has always been the vision for the ClassDojo team, even if we didn’t think of it using that term. Our goal has been to give every vertically integrated team ownership over its own metrics, tables, and data dictionaries, and the processes for generating those artifacts. These things should be written mostly in code, with some generated documentation.

But sometimes vision clashes with reality. Our old system grew organically, and every team had a separate set of metrics and queries for which it was responsible. We used some standard technologies to extract and load data from sources like in-house databases and cloud applications into Amazon Redshift and a Jenkins job that ran transformation queries in serial. In theory this should have empowered every team to write their own transformation queries and add their own replication sources.

But because the system had been built piecemeal and ad hoc, it was a massive mess. By early 2021, the number of queries in the Jenkins job had ballooned to more than 200, providing at least seven different measures of user engagement, while configuration management for the upstream pipelines languished with no ownership or improvement. From an engineering perspective, the data platform was some of the most unpleasant pieces of the ClassDojo codebase to work on. Though individual queries were straightforward and easy to debug, the platform itself had a number of performance issues that were difficult to understand.

We knew the old system was unsustainable, but we limped along until a catastrophic outage forced us to plan a major redesign. Though we still hadn’t crystallized around the term DataOps, we knew the system we planned had to fulfill the vision of a self-serve platform that empowered engineers and analysts to make changes without hand-holding.

Thus, a team of interested parties coalesced into a dedicated data team with engineering resources, and we began our months-long journey towards building the ClassDojo DataOps platform.

Find a Partner

To build a platform that actually conformed to our vision, we needed to completely redo the foundation. Unfortunately, the foundation was also the part of the system we had the least expertise with.

We chose a two-pronged approach to solve this problem. First, we redefined some roles. We split the job of data engineer into two roles — a data infrastructure engineer and an analytics engineer. One was in charge of maintaining the data platform, the other was in charge of understanding and fulfilling the business use cases for data.

Second, we searched for a partner that was aligned with our vision and had the track record needed to build the platform. We found it in Mutt Data, a small team of data experts that specialize in building out DataOps and ML pipelines for larger companies. Though we don’t take advantage of their ML expertise, we have been able to lean on their vast knowledge of how to build data tooling.

Together we were able to mold our vision into something actionable.

Define Requirements

The Mutt team were the ones who introduced us to the term DataOps. Defining what DataOps meant let us create requirements for what our system should be: a platform that includes standard data technologies with proven records of reliability and performance, where the most common use cases should be written in as little code as possible.

The outcome of our talks was a roadmap with concrete milestones and tasks.

Pay Down Technical Debt

First, though, we had to pay down our technical debt. As a general rule, startups lack the luxury of sitting down and planning to build something “right” from the start. For the sake of finding product market fit, rushing to get something out to market, or just a shrinking runway, it just doesn’t make sense for a growing, evolving company like ours to plan for a future that may not exist.

Unfortunately for us, data was a major debt item. The old system grew to meet needs instead of being built with specific requirements, and was developed only to fulfill the bare minimum of enabling reporting. Yet despite its flaws, and as much as we wanted to rid ourselves of the whole mess, it was our only reporting system, and thus had to function even as we rewrote the platform underneath.

Thus we spent the first few months of the rebuild dealing with performance of the old system and picking the pieces of old code to migrate.

Migrate Workflow Management to Airflow

As part of the migration, we set a goal of moving the transformation pipeline off of Jenkins and onto Airflow. That Amazon had a hosted Airflow service at the time was a huge bonus. While Jenkins is a competent cron runner with a log, Airflow is considered a data engineering standard. It offers a lot of flexibility, and new data hires are able to quickly be productive in its ecosystem.

We marked a number of queries for migration from Jenkins while axing some lesser-used ones to free up time for the mission-critical jobs. Most of these queries were pretty straightforward; others needed more attention.

Build a Data Lake

Despite the fact that our stabilization work had caused our transformation pipeline to finish in record time, some long-running queries were still taking longer than two hours to execute. We targeted the top 10 longest-running queries and migrated the input and result tables from Redshift into a data lake consisting of Amazon S3, Amazon Athena, and AWS Glue. The results were dramatic. Two-hour runtimes were cut to five minutes.

We were then able to take advantage of Glue and Amazon Redshift Spectrum to use the data as though it was in native Redshift tables. Though there was a bit of performance hit, it was good enough for most of our use cases.

Create Anomaly Detection

As with most companies, we have a product event stream that’s used to monitor feature usage and general business health. This event stream is the bedrock for all our major KPIs and downstream tables. For such a mission-critical piece of our business, we had shockingly little validation to be confident in its accuracy.

To validate our event stream, we added anomaly detection monitors to detect breakages in upstream-pipelines. These alarms forecast row counts using a FOSS project created by our partners called SoaM (Son of a Mutt). It’s especially useful for Dojo since our event patterns are very seasonal.

Once we had confidence in our event streams, we were able to move on to augmenting our downstream processes.

Add dbt

Dbt is a popular tool for data analysts. It functions like a souped-up version of the data transformation pipeline that we had in Jenkins in that it allows users to write SQL queries without having to worry about the more technical details underneath. This is really useful for our PMs and analysts who don’t (and shouldn’t) write Python.

But dbt has a lot of additional benefits for power users, like snapshotting and built-in incremental loads. On top of that, engineers get to take advantage of built-in unit tests, and the organization as a whole gets to take advantage of auto-generated documentation.

Augment with Great Expectations

Dbt unit tests are great, but we also wanted the option to add more complex validation where we could write simple assertions that are hard to translate into SQL. We got this with Great Expectations, a tool for validating, documenting, and profiling data. We found that we could hang a Great Expectations operator off of a dbt operator and gain both quick unit tests and more complex assertions. We could then upload the validation results to S3 and view them on a monitoring dashboard.

Migrate to Airbyte

We briefly touched on our upstream extraction pipelines. The old setup used data pipelines and some home-rolled technologies to replicate data from production databases into Redshift. Though the solutions worked well enough, they had no maintainers and a bit of stigma surrounding them.

The 2020 project Airbyte has been making a splash in data engineering circles. It promises easy loading between different data sources with a GUI and easy Airflow integration. Since it’s a newer project, we’ve been having some trouble integrating it with our existing technology stack, but the vision of a world where all upstream pipelines would be in the same place using a well-supported technology was too tantalizing to pass up.

We’ve tested output from Airbyte and are in the process of migrating existing pipelines off of an AWS data pipeline and onto Airbyte.

Throw in Some Easy Rollback

One of our core values here at ClassDojo is that failure recovery is more important than failure prevention. We hold this value to allow us to move fast without fear of failure. This means that building robust disaster recovery mechanisms for all of our major processes is a requirement for our platforms.

While we needed to build a few extra disaster prevention tools and processes as is natural with a stateful system, we’ve hewn to this value by building CI/CD tools that allow us to delete entire date ranges of data and backfill.

Tie It All Together

While most of these technologies and techniques are standard, each needs to be configured and toggled. To make a self-serve platform for both engineers and non-engineers, there needs to be some connective tissue that covers the most important use cases and allows for them to occur with as little code as possible.

Our final contribution to our DataOps platform was to build a Python layer that would detect and parse a short YAML configuration file and translate it into an Airflow DAG that has input sensors, a dbt transformation process, and optional tests and expectations. If a user doesn’t want to do anything complicated, they never need to write a line of Python.

Looking Forward

We’re proud of our new platform, but world-class data infrastructure means nothing if the data it manipulates isn’t leveraged. To make sure that happens, our data infrastructure engineering team hands off responsibility to our analytics engineering team. Their job is to mold our terabytes of raw data into a properly modeled star schema that gives the business a standard set of tables they can draw from for their reporting needs, which in turn aids us in our mission of creating a world-class educational experience that’s also loved by kids.

There has never been a more exciting time to be a part of the ClassDojo data organization. The problems are challenging, but there’s a clear path forward and plenty of support along the way. If you find the prospect of building the foundation for a business exciting, then join us by checking our jobs page and applying!

Editing 200 Files with Bash and Perl

Andrew Burgess

andrew8088

2022-01-13

I recently had to change 189 files in our code base, all in almost the same way. Rather than doing it manually, I decided to brush up on my command-line text manipulation ... and ended up taking it further than I expected.

The Mission

The changes were pretty simple. In our API code, we have TypeScript definitions for every endpoint. They look something like this:

interface API {
    "/api/widget/:widgetId": {
        GET: {
            params: {
                widgetId: MongoId;
            };
            response: WidgetResponse;
        }
    }
}

You'll notice the params are defined twice: once in the URL key string (as :widgetId) and again in the GET attribute (under params); we are moving to a TypeScript template literal string parser to get the type information out of the URL key string itself, and so I wanted to remove the params key from these definitions. But with 189 files to change, the usual manual approach wasn't so inviting.

So, I set myself the challenge of doing it via the command line.

Step 1: Remove the lines

I'll be honest, when I started, this was the only step I had in mind. I needed to do a multi-line find-and-replace, to remove params: { ... }; a quick grep showed me that this pattern was unique to the places I wanted to change; however, I could have narrowed the set of files I was searching to just our endpoints in src/resources if necessary. For doing the replacement, I thought sed might be the right tool, but new lines can be challenging to work with ... so I ended up learning my first bit of perl to make this work.

Here's what I ended up doing (I've added line breaks for readability):

grep -r --files-with-matches "params: {" ./src | while read file;
    do
        perl -0777 -pi -e 's/ *params: {[^}]*};\n//igs' "$file";
    done

This one-liner uses grep to recursively search my src directory to find all the files that have the pattern I want to remove. Actually, I usually reach for ag (the silver searcher) or ripgrep, but grep is already available pretty much everywhere. Then, we'll loop over the files and use perl to replace that content.

Like I said, this was my first line of perl, but I'm fairly sure it won't be my last. This technique of using perl for find-and-replace logic is called a perl pie. Here's what it does:

0777 means perl will read in the entire file
p wraps that one-liner in the conventional perl script wrapper.
i means that perl will change the file in place; if you aren't making this change in a git repo like I am, you can do something like i.backup and perl will create a copy of the original file, so you aren't making an irreversible change.
e expects an argument that is your one-line program

Oh, and the program itself:

s/ *params: {[^}]*};\n//igs

This is typical 's/find/replace/flags' syntax, and you know how regexes work. The flags are global, case-insensitive, and single-line (where . will also match newlines).

So, this changed the 189 files, in exactly the way I wanted. At this point, I was feeling great about my change. Reviewed the changes, committed it and started the git push.

Step 2: Remove unused imports

Not so fast. Our pre-push hooks caught a TypeScript linting issue:

error TS6133: 'MongoId' is declared but its value is never read.

5 import { MongoId } from "our-types";
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ah, yeah, that makes sense. URL parameters are strings, but we have a MongoId type that's a branded string. I forgot about this step, but that's why we have pre-push checks! We'll need to remove those imports.

How can we do this? Well, let's get a list of the files we changed in our most recent commit:

 git show --name-only | grep ^src

We add the grep to only find the files within our top-level src directory (and to remove the commit information).

Then, we need to find all the files that include MongoId only once. If a file references MongoId multiple times, then we don't want to remove the import, because clearly we're still using it. If the file only references MongoId once, we can remove the import ... but we have to consider that it might not be the only thing we're importing on that line. For starters, grep's -c flag to count the number of occurrences per file.

for file in $(git show --name-only | grep ^src)
    do
        grep -c MongoId "$file"
    done

A simple for loop works here, because I know the only whitespace is the linebreaks between the file names. Once we have the count, we can check to see that there's only 1 match:

for file in $(git show --name-only | grep ^src)
    do
        if [ $(grep -c MongoId "$file") = 1 ]; then; echo "..."; fi
    done

We're using an if statement here, to check that the occurrence count is 1. If it is, we want to do something. But what? Remember, we might be importing multiple things on that line, so that leaves us with three possible actions:

Remove the whole line when MongoId is the only item imported.
Remove MongoId, when it's the first item imported on that line. Don't miss that following comma!
Remove , MongoId when it's not the first item on the that line. Don't miss the preceding comma!

There are many ways we could do this, so let's have some fun with reading input from the command line! To be clear, this isn't the best way to do it. We could easily match our three cases above with perl or sed. But we've already used that pattern in this project, and reading input in a shell script is an incredibly useful tool to have in your toolbox.

At this point, we probably want to move this into an actual shell script, instead of running it like a one-off on the command line:

#!/bin/bash

for file in $(git show --name-only | grep ^src)
    do
		if [ $(grep -c MongoId "$file") = 1 ]
		then
			echo ""
            echo "====================="
            echo "1 - remove whole line"
            echo "2 - remove first import"
            echo "3 - remove other import"
            echo ""
            echo "file: $file"
            echo "line: $(grep MongoId "$file" | grep -v "^//")"
						echo -n  "> "

            read choice

            echo "your choice: $choice"

            case "$choice" in
                1)
                    sed -i '' "/MongoId/d" "$file";
                    ;;
                2)
                    perl -i -pe "s/MongoId, ?//" "$file";
                    ;;
                3)
                    perl -i -pe "s/, ?MongoId//" "$file";
                    ;;
                *)
                    echo "nothing, skipping line"
                    ;;
            esac
        fi
done

Don't be intimidated by this, it's mostly echo statements. But we're doing some pretty cool stuff here.

Inside our if statement, we start by echoing some instructions, as well as the file name and the line that we're about to operate on. Then, we read an input from the command line. At this point, the script will pause and wait for us to type some input. Once we hit <enter> the script will resume and assign the value we entered to our choice variable.

Once we have determined our choice, we can do the correct replacement using the bash equivalent of a switch/case statement. For case 1, we're using sed's delete line command d. For cases 2 and 3, we'll use perl instead of sed, because it will operate only on the matched text, and not on the whole line. Finally, the default case will do nothing.

Running this script, we can now walk through the files, one by one, and review each change. It reduces our work to one keystroke per file, which is way less than opening each file, finding the line, removing the right stuff.

And that's it! While we don't use command-line editing commands every day, keeping these skills sharp will speed up your workflow when the right task comes along.

Our Approach to Mob Programming

Melissa Dirdo

melissayu

2021-12-06

Our teams at ClassDojo have the freedom to choose how they want to work. Many of our teams have started spending a few hours each day mobbing because we've found it to be an effective form of collaboration. Here's how we do it!

What is Mob Programming?

Mob programming is similar to pair programming, but with more than two people working together. One person, the driver, does the actual typing but everyone is involved in the problem solving. Mob programming is often defined as “All the brilliant minds working on the same thing, at the same time, in the same space, and at the same computer.” We don’t follow the strict definition of mobbing, especially since we are a fully remote team, but we are continuously iterating on an approach that works for us.

Why do we mob?

Woody Zuill has a great writeup about how a whole range of issues just faded away once his teams started mobbing, including fading communication problems and decision making problems, without trying to address those issues directly. We’ve found similar benefits, and I’ll call out just a few:

Focus

When the team is working together on a single task, it means we’re focused on the top priority for our team. Although it may sound more productive to have multiple engineers working in parallel on separate tasks, that often means that the top priority is delayed when waiting for answers to questions. Having the whole team focused on the same thing greatly decreases the amount of context switching we need to do.

Knowledge Sharing

Without mobbing, it’s easy to develop silos of knowledge as individuals become experts in specific areas. Others might gain context through code reviews or knowledge sharing meetings. However, when the whole team works together on a piece of code, it almost eliminates the need for code reviews since all the reviewers were involved in writing it, and everyone already has shared knowledge. Mobbing is also really useful for onboarding new teammates and getting them up to speed.

Quality

More time is spent debugging and refactoring code than writing it. If you mob, you have more eyes on the code while it’s being written, rather than during code review or later when it needs to be updated or refactored. You increase the quality of your output, and that quality increase leads to long-term speed.

Collaboration

Especially with a fully remote engineering team, it can be isolating to only work on individual tasks. There is also the challenge of communication and having to wait for answers to blocking questions. By having everyone attend the mob, we eliminate that waiting time. Questions can be answered immediately and decisions are made as a group.

What does remote mobbing look like at ClassDojo?

Who: Most often, we have all the engineers of the same function (e.g. all the full-stack engineers) on a team join a mob. Depending on the task it can be helpful to have other functions like client engineers or product managers join as well, to quickly answer questions and unblock. The group will naturally include engineers of varying skill levels, which is a good thing! We rotate drivers often, but like to have the less experienced engineers drive as it keeps them engaged and learning.

When: This depends on the team’s preference and availability as well as the nature of the task, but we may schedule mobbing time for anywhere from an hour to almost the entire day, most days of the week. It’s important to block the same time off on each person’s calendar and protect that time from other meetings. During longer sessions, we set a timer to remind ourselves to take breaks often. We generally take a 10-15 minute break after every 45 minutes of focused mobbing.

What: We pick one task to focus on, and it should be the highest priority task for the team. It’s easy to get derailed by PRs that need reviewing, bugs that get reported, questions on slack, etc, but we make a conscious effort to avoid starting anything new until we finish the current task. The one exception we have is for P-now bugs, which we drop everything else for.

How: No special tools or complex setup required! We simply hop on a Zoom call and the driver shares their screen. If we’re coding, the driver will use their own IDE and when it’s time to switch drivers, the driver pushes the changes to a branch so the next driver can pull the latest. There are tools for collaborative coding, but we’ve found that they don’t offer much benefit over simply having someone share their screen. If we’re in a design phase, we often use Miro as a collaborative whiteboard.

As with everything we do, we have frequent retrospectives to reflect on what’s going well and what could be improved with how we mob, and we are open to trying new ideas. If you have any thoughts, we’d love to hear from you!

Newer posts

Older posts