How We Built a DataOps Platform

Felix Yuan

sticksman

2022-02-07

DataOps has always been the vision for the ClassDojo team, even if we didn’t think of it using that term. Our goal has been to give every vertically integrated team ownership over its own metrics, tables, and data dictionaries, and the processes for generating those artifacts. These things should be written mostly in code, with some generated documentation.

But sometimes vision clashes with reality. Our old system grew organically, and every team had a separate set of metrics and queries for which it was responsible. We used some standard technologies to extract and load data from sources like in-house databases and cloud applications into Amazon Redshift and a Jenkins job that ran transformation queries in serial. In theory this should have empowered every team to write their own transformation queries and add their own replication sources.

But because the system had been built piecemeal and ad hoc, it was a massive mess. By early 2021, the number of queries in the Jenkins job had ballooned to more than 200, providing at least seven different measures of user engagement, while configuration management for the upstream pipelines languished with no ownership or improvement. From an engineering perspective, the data platform was some of the most unpleasant pieces of the ClassDojo codebase to work on. Though individual queries were straightforward and easy to debug, the platform itself had a number of performance issues that were difficult to understand.

We knew the old system was unsustainable, but we limped along until a catastrophic outage forced us to plan a major redesign. Though we still hadn’t crystallized around the term DataOps, we knew the system we planned had to fulfill the vision of a self-serve platform that empowered engineers and analysts to make changes without hand-holding.

Thus, a team of interested parties coalesced into a dedicated data team with engineering resources, and we began our months-long journey towards building the ClassDojo DataOps platform.

Find a Partner

To build a platform that actually conformed to our vision, we needed to completely redo the foundation. Unfortunately, the foundation was also the part of the system we had the least expertise with.

We chose a two-pronged approach to solve this problem. First, we redefined some roles. We split the job of data engineer into two roles — a data infrastructure engineer and an analytics engineer. One was in charge of maintaining the data platform, the other was in charge of understanding and fulfilling the business use cases for data.

Second, we searched for a partner that was aligned with our vision and had the track record needed to build the platform. We found it in Mutt Data, a small team of data experts that specialize in building out DataOps and ML pipelines for larger companies. Though we don’t take advantage of their ML expertise, we have been able to lean on their vast knowledge of how to build data tooling.

Together we were able to mold our vision into something actionable.

Define Requirements

The Mutt team were the ones who introduced us to the term DataOps. Defining what DataOps meant let us create requirements for what our system should be: a platform that includes standard data technologies with proven records of reliability and performance, where the most common use cases should be written in as little code as possible.

The outcome of our talks was a roadmap with concrete milestones and tasks.

Pay Down Technical Debt

First, though, we had to pay down our technical debt. As a general rule, startups lack the luxury of sitting down and planning to build something “right” from the start. For the sake of finding product market fit, rushing to get something out to market, or just a shrinking runway, it just doesn’t make sense for a growing, evolving company like ours to plan for a future that may not exist.

Unfortunately for us, data was a major debt item. The old system grew to meet needs instead of being built with specific requirements, and was developed only to fulfill the bare minimum of enabling reporting. Yet despite its flaws, and as much as we wanted to rid ourselves of the whole mess, it was our only reporting system, and thus had to function even as we rewrote the platform underneath.

Thus we spent the first few months of the rebuild dealing with performance of the old system and picking the pieces of old code to migrate.

Migrate Workflow Management to Airflow

As part of the migration, we set a goal of moving the transformation pipeline off of Jenkins and onto Airflow. That Amazon had a hosted Airflow service at the time was a huge bonus. While Jenkins is a competent cron runner with a log, Airflow is considered a data engineering standard. It offers a lot of flexibility, and new data hires are able to quickly be productive in its ecosystem.

We marked a number of queries for migration from Jenkins while axing some lesser-used ones to free up time for the mission-critical jobs. Most of these queries were pretty straightforward; others needed more attention.

Build a Data Lake

Despite the fact that our stabilization work had caused our transformation pipeline to finish in record time, some long-running queries were still taking longer than two hours to execute. We targeted the top 10 longest-running queries and migrated the input and result tables from Redshift into a data lake consisting of Amazon S3, Amazon Athena, and AWS Glue. The results were dramatic. Two-hour runtimes were cut to five minutes.

We were then able to take advantage of Glue and Amazon Redshift Spectrum to use the data as though it was in native Redshift tables. Though there was a bit of performance hit, it was good enough for most of our use cases.

Create Anomaly Detection

As with most companies, we have a product event stream that’s used to monitor feature usage and general business health. This event stream is the bedrock for all our major KPIs and downstream tables. For such a mission-critical piece of our business, we had shockingly little validation to be confident in its accuracy.

To validate our event stream, we added anomaly detection monitors to detect breakages in upstream-pipelines. These alarms forecast row counts using a FOSS project created by our partners called SoaM (Son of a Mutt). It’s especially useful for Dojo since our event patterns are very seasonal.

Once we had confidence in our event streams, we were able to move on to augmenting our downstream processes.

Add dbt

Dbt is a popular tool for data analysts. It functions like a souped-up version of the data transformation pipeline that we had in Jenkins in that it allows users to write SQL queries without having to worry about the more technical details underneath. This is really useful for our PMs and analysts who don’t (and shouldn’t) write Python.

But dbt has a lot of additional benefits for power users, like snapshotting and built-in incremental loads. On top of that, engineers get to take advantage of built-in unit tests, and the organization as a whole gets to take advantage of auto-generated documentation.

Augment with Great Expectations

Dbt unit tests are great, but we also wanted the option to add more complex validation where we could write simple assertions that are hard to translate into SQL. We got this with Great Expectations, a tool for validating, documenting, and profiling data. We found that we could hang a Great Expectations operator off of a dbt operator and gain both quick unit tests and more complex assertions. We could then upload the validation results to S3 and view them on a monitoring dashboard.

Migrate to Airbyte

We briefly touched on our upstream extraction pipelines. The old setup used data pipelines and some home-rolled technologies to replicate data from production databases into Redshift. Though the solutions worked well enough, they had no maintainers and a bit of stigma surrounding them.

The 2020 project Airbyte has been making a splash in data engineering circles. It promises easy loading between different data sources with a GUI and easy Airflow integration. Since it’s a newer project, we’ve been having some trouble integrating it with our existing technology stack, but the vision of a world where all upstream pipelines would be in the same place using a well-supported technology was too tantalizing to pass up.

We’ve tested output from Airbyte and are in the process of migrating existing pipelines off of an AWS data pipeline and onto Airbyte.

Throw in Some Easy Rollback

One of our core values here at ClassDojo is that failure recovery is more important than failure prevention. We hold this value to allow us to move fast without fear of failure. This means that building robust disaster recovery mechanisms for all of our major processes is a requirement for our platforms.

While we needed to build a few extra disaster prevention tools and processes as is natural with a stateful system, we’ve hewn to this value by building CI/CD tools that allow us to delete entire date ranges of data and backfill.

Tie It All Together

While most of these technologies and techniques are standard, each needs to be configured and toggled. To make a self-serve platform for both engineers and non-engineers, there needs to be some connective tissue that covers the most important use cases and allows for them to occur with as little code as possible.

Our final contribution to our DataOps platform was to build a Python layer that would detect and parse a short YAML configuration file and translate it into an Airflow DAG that has input sensors, a dbt transformation process, and optional tests and expectations. If a user doesn’t want to do anything complicated, they never need to write a line of Python.

Looking Forward

We’re proud of our new platform, but world-class data infrastructure means nothing if the data it manipulates isn’t leveraged. To make sure that happens, our data infrastructure engineering team hands off responsibility to our analytics engineering team. Their job is to mold our terabytes of raw data into a properly modeled star schema that gives the business a standard set of tables they can draw from for their reporting needs, which in turn aids us in our mission of creating a world-class educational experience that’s also loved by kids.

There has never been a more exciting time to be a part of the ClassDojo data organization. The problems are challenging, but there’s a clear path forward and plenty of support along the way. If you find the prospect of building the foundation for a business exciting, then join us by checking our jobs page and applying!

Felix Yuan

Felix Yuan is a data infrastructure engineer at ClassDojo. He enjoys Python and all its foibles, overly complicated DAGs, and Killer Queen.