A/B Testing From Scratch at ClassDojo

At ClassDojo, we're working on building amazing classroom communities at scale. To succeed in this mission, we have to run lots of product experiments, and just as importantly, accurately measure their results. In this post, I'll outline how we built our A/B testing system, and how we used it to run one of those experiments.

The Problem

In order to build classroom communities, we need to make it easy for teachers to connect with every parent in their class. A few months ago, we set about redesigning our code sheets — pieces of paper that teachers print out and send home with students to give to their parents. The sheets give each parent instructions on how to join ClassDojo as well as a special code that will automatically associate them with their own child. (Side note: it might seem odd that pieces of paper are better than email or SMS signups for driving growth. But in fact, teachers are used to sending paper forms home all the time, so this method has proven surprisingly effective!)

We had a couple of designs for alternate versions of the code sheet, but we soon realized that we didn’t have the proper infrastructure in place to measure which version actually improved the metrics that we cared about. We wanted to be able to build out both new versions and not only compare them to the current version but also assess the statistical significance of any differences we saw. In short, we lacked the ability to properly instrument A/B tests, and make data-informed decisions based on them.

Building a Solution

We first looked at third party solutions to minimize engineering costs, but found their offerings prohibitively expensive. Instead, we realized we could build an in-house system that would meet all of our needs with just a few weeks of engineering time.

First, we created a new experiments table in our database, consisting of the following fields:

  • Audience: the set of users who are part of this experiment, defined through a series of subfields.
  • Variants: a list of possible experiences someone in the experiment might see, as well as how much of the audience (defined above) will see them. Control should be one of these variants.

We built a basic UI in our internal site for creating and modifying experiments:

experiment-ui

In this case, we defined our audience as English-speaking teachers who signed up after January 6, 2017. We also defined three variants: "control", "class_link_with_sheet", and "individual_link" (the second and third variants refer to new versions of the code sheets we were trying to test).

Each of the three variants will be launched to 20% of users. The remaining 40% is not considered to be part of the experiment. What's the difference between being in the control group and not being in the experiment? The user experience is the same, but we only send exposure logs for the control group! This allows us to enforce equal sizes across all of our experiment groups.

Then, we wrote an API for fetching which variant a given user should see. Here's where the checking occurs, on our React web client:

import getVariant from "app/utils/get_variant";

const variant = getVariant("digitalSignupLinks");

if (variant === "class_link_with_sheet") {
  renderClassLinkWithSheet();

} else if (variant === "individual_link") {
  renderIndividualLink();

} else { // "control" or null
  renderClassicCodeSheet();
}

Next, we built exposure logging, or recording when someone actually sees the experience in question. Logging exposures helps us measure user behavior more accurately by ignoring anything that happens to users who weren't even exposed to the experiment. We already had a service for logging product events, so we piggybacked on that, sending a request to this service whenever a user was exposed to either a control or treatment variant of an experiment. Here's the code that does this:

if (variant) {
  logClient.logExposure({
    experimentName: "digitalSignupLinks",
    variant,
  });
}

Finally, we built a SQL templating system to easily compare the different variants and assess statistical significance. It requires the experimenter to define a simple query, where each row consists of the time of a user’s exposure log to the experiment, whether that user is in the control or one of the treatment groups, and whether or not they "converted", i.e. did they ultimately do what the experiment was designed to help them do?

Since Redshift stores not only our exposure logs but also the rest of our analytics data, it's easy to define conversion events as anything from "clicked on this button" to "posted five or more times on Class Story." This flexibility was a major advantage over using a third-party service.

For our code sheets experiment, which was released to teachers, we decided to define our conversion metric as "connected at least five parents". Because we had two treatment groups in addition to control, we ran two separate analyses, each comparing one treatment group to the control group. We used this SQL for one of our results queries:

SELECT
  MAX(createdAt) AS "time",
  COUNT(parentId) >= 5 AS converted,
  BOOL_OR(eventValue = 'variant_class_link_with_sheet') AS treatment
FROM logs.product_event
LEFT JOIN production.teacher_parent_connections ON teacherId = product_event.entityId
WHERE eventName = 'sawExperiment_digitalSignupLinks'
AND (eventValue = 'variant_class_link_with_sheet' OR eventvalue = 'variant_control')
GROUP BY entityId
ORDER BY "time";

Our templating system takes this query and wraps it in additional SQL to produce an analysis table, where each row contains the following data: date of measurement, control population size, treatment population size, control conversion rate, treatment conversion rate, % change between conversion rates, and p-value. Typically, we use a p-value threshold of 0.05 to assess statistical significance. Here's a sample of the analysis table from the code sheets experiment (for privacy reasons, this is mock data):

results

Here, note that the p-value starts high, at 0.27, but as more and more users are exposed the experiment, the p-value converges to a value safely below 0.05 threshold, meaning (based on this mock data) we can say with confidence that this variation on our code sheets leads to a positive change in the number of teachers connecting 5+ parents!

Finally, if we want to represent our results visually, charting p-value and delta over time, we can easily plug our analysis table into any analytics software (e.g. Metabase).

Conclusion

Our new system makes it much easier to set up experiments: all the steps described in the example above can be completed in just an hour or two!

In the end, it took three engineers about two weeks to implement this system. Along the way, we learned a ton about A/B testing and statistical analysis, and hope to better understand and serve our users as a result.

  • Ab-testing
  • Statistics
  • Programming
Next Post
Previous Post