Creating Photo Collages With Node-canvas

Peter Hayes

peterkhayes

2016-05-26

Class Story is one of our core product features at ClassDojo. It allows teachers to post photos and videos that parents can view, like, and comment on, and teachers around the world use it to share cute, funny, or exciting moments from their classrooms. Since we’re approaching the end of the school year, we wanted to provide teachers with a something memorable and engaging to send home. But with a ton of new projects we’d like to ship before back-to-school in August, we also wanted something we could build quickly.

We settled on a photo collage of a classroom’s top posts throughout the year. We generated collages for all classrooms with at least 9 posts this year and surfaced them in-app for teachers to share with their parents. We hope parents, teachers, and students everywhere will enjoy seeing a year’s worth of memories!

Here’s an example of one of the collages, using posts from our internal Class Story:

In the process, we wrote a simple npm module, photo-collage, which generates photo collages. You can view it here, and see the source code here.

Getting the data

We pulled the list of eligible classrooms and photos from AWS's Redshift, which we use as our main data store for offline processing and analytics. The desired result from this step was a CSV file, with each row containing a classroom id and nine photos. Here's a slightly simplified version of our query:

select
  *
from (
  select
    classroom_id,
    max(case when rank = 1 then photo_url end) as photo_url_1,
    max(case when rank = 2 then photo_url end) as photo_url_2,
    max(case when rank = 3 then photo_url end) as photo_url_3,
    max(case when rank = 4 then photo_url end) as photo_url_4,
    max(case when rank = 5 then photo_url end) as photo_url_5,
    max(case when rank = 6 then photo_url end) as photo_url_6,
    max(case when rank = 7 then photo_url end) as photo_url_7,
    max(case when rank = 8 then photo_url end) as photo_url_8,
    max(case when rank = 9 then photo_url end) as photo_url_9
  from (
    select
      classroom_id,
      photo_url,
      row_number() over (
        partition by classroom_id order by (
          10 * like_count + 5 * comment_count + read_count
        ) desc
      ) as rank
    from storyposts
    where photo_url is not null
  )
  where rank <= 9
  group by 1
)
where photo_url_9 is not null;

Let's break that down a little. Our innermost query is:

select
  classroom_id,
  photo_url,
  row_number() over (
    partition by classroom_id order by (
      10 * like_count + 5 * comment_count + read_count
    ) desc
  ) as rank
from story_posts
where photo_url is not null

We used row_number and partition by to rank all posts in a classroom by an engagement score. The rank column will then take on integer values (beginning with 1) based on the value of this score - the post with the highest score in each classroom will have value 1, the second highest value 2, etc. So, after this step, our rows look something like this:

classroom_id_1  photo_1_a.jpg  1
classroom_id_1  photo_1_b.jpg  2
classroom_id_1  photo_1_c.jpg  3
...
classroom_id_2  photo_2_a.jpg  1
classroom_id_2  photo_2_b.jpg  2
classroom_id_2  photo_2_c.jpg  3
...
classroom_id_3  photo_3_a.jpg  1
classroom_id_3  photo_3_b.jpg  2
...

We now want to aggregate these rows into a single row with the top 9 photos for each classroom. To do this, we'll use this query:

select
  classroom_id,
  max(case when rank = 1 then photo_url end) as photo_url_1,
  max(case when rank = 2 then photo_url end) as photo_url_2,
  max(case when rank = 3 then photo_url end) as photo_url_3,
  max(case when rank = 4 then photo_url end) as photo_url_4,
  max(case when rank = 5 then photo_url end) as photo_url_5,
  max(case when rank = 6 then photo_url end) as photo_url_6,
  max(case when rank = 7 then photo_url end) as photo_url_7,
  max(case when rank = 8 then photo_url end) as photo_url_8,
  max(case when rank = 9 then photo_url end) as photo_url_9
from (
  -- Our query above, which returns (classroom_id, photo_url, rank)
)
where rank <= 9
group by classroom_id

The statement case when rank = 1 then photo_url end picks out the photo_url for the row with rank of 1. For any other row, it returns null. So if a classroom has 20 photo rows, this statement will return 1 photo_url and 19 nulls. Without groupBy and max, our output would now look like:

classroom_id_1  photo_1_a.jpg   null            null           ...
classroom_id_1  null            photo_1_b.jpg   null           ...
classroom_id_1  null            null            photo_1_c.jpg  ...
...
classroom_id_2  photo_2_a.jpg   null            null           ...
classroom_id_2  null            photo_2_b.jpg   null           ...
classroom_id_2  null            null            photo_2_c.jpg  ...
...
classroom_id_3  photo_3_a.jpg   null            null           ...
classroom_id_3  null            photo_3_b.jpg   null           ...
...

But we actually want one row per classroom, which means using a group by clause. We then need to tell SQL how to aggregate the rows, and in this case, max is the correct operator. Because each photo column has one actual value per classroom and the rest nulls, and since the max of null and a string is the string, max will return the single non-null value in this column, transforming our rows into:

classroom_id_1  photo_1_a.jpg   photo_1_b.jpg   photo_1_c.jpg  ...
classroom_id_2  photo_2_a.jpg   photo_2_b.jpg   photo_2_c.jpg  ...
classroom_id_3  photo_3_a.jpg   photo_3_b.jpg   null           ...
...

Finally, we need to filter out classrooms who don't have 9 photos, since we won't be able to generate a complete collage in these cases. This is easy:

select
  *
from (
  -- Our query above, which returns (classroom_id, photo_url_1, photo_url_2, ...)
)
where photo_url_9 is not null;

Making the collages

With our list of classrooms and photos in hand, we needed to generate the collages. Our first decision was to pre-compute the collages instead of creating them on demand, since this would greatly increase reliability. Since we'd be releasing these collages only once, it was important that they worked for everyone on the first try. Having the collages pre-generated also made it easy to determine which teachers should receive a notification to share their collage - we only need to check if a teacher teaches a classroom we've generated one for. However, given the non-trivial size of the data set, our generation script needed to be parallelizable and not use a huge amount of disk space.

To aid in creating the collages, we wrote the photo-collage library. It takes a set of source files and returns a canvas object with the collage. The sources can be buffers, URLs, or local files. Each source is loaded and images are placed onto the canvas, with the desired spacing and sizing. We were than able to use that library in a script that processes our CSV, parses the rows, creates the collages, and saves them to disk. Here's the code (excluding imports and constants):

highland(byline(fs.createReadStream(SOURCE_FILE)))
  .map((line) => {
    const elems = String(line).split(",");
    const classroomId = elems[0];
    const photoURLs = elems.slice(1);
    return {classroomId, photoURLs};
  })
  .map(({classroomId, photoURLs}) => {
    const collagePromise = photoCollage({
      sources: photoURLs,
      width: 3,
      height: 3,
      imageHeight: IMAGE_HEIGHT,
      imageWidth: IMAGE_WIDTH,
      spacing: SPACING,
      backgroundColor: BACKGROUND_COLOR,
    })
    .then((canvas) => {
      canvas.getContext("2d").drawImage(FOOTER_IMAGE, 0, 700, 1000, 300);
      return {classroomId, canvas};
    });
    return highland(collagePromise);
  })
  .mergeWithLimit(NUM_PARALLEL)
  .map(({classroomId, canvas}) => {
    const sourceStream = canvas.jpegStream();
    const destinationStream = fs.createWriteStream("./output" + classroomId + ".jpg");

    return highland((push) => 
      sourceStream
        .on("error", (err) => push(err))
        .pipe(destinationStream)
        .on("error", (err) => push(err))
        .on("finish", () => {
          push(null, classroomId);
          push(null, highland.nil);
        })
    );
  })
  .mergeWithLimit(NUM_PARALLEL)
  .errors((err) => console.error("Error generating collage"), err)
  .each((classroomId) => console.log("Generated collage for", classroomId))
  .done(() => console.log("All done!"))

Let's break this script down as well.

highland(byline(fs.createReadStream(SOURCE_FILE)))

We first generate a stream of lines from our CSV. Streams are great for processing data without buffering it all into memory, which is not always possible if the data set is large. Highland is an excellent library that provides elegant abstractions for working with streams of data. It wraps streams from node's standard library (along with arrays, promises, and more) and manipulates them, much like lodash or underscore do for regular collections.

We also used byline, which splits a raw file stream into lines. The output of this stream will therefore be the individual lines of our CSV.

.map((line) => {
  const elems = String(line).split(",");
  const classroomId = elems[0];
  const photoURLs = elems.slice(1);
  return {classroomId, photoURLs};
})

Next, we parse each row of our csv. Highland's map function transforms each item of a stream into a new item, much like the map method of an array. We could use a more fully-featured CSV parser for this, but for our use, a simple split statement was fine.

.map(({classroomId, photoURLs}) => {
  const collagePromise = photoCollage({
    sources: photoURLs,
    width: 3, // number of photos per row
    height: 3, // number of photos per column
    imageHeight: IMAGE_HEIGHT,
    imageWidth: IMAGE_WIDTH,
    spacing: SPACING, // pixels between each photo
    backgroundColor: BACKGROUND_COLOR,
  })
  .then((canvas) => {
    canvas.getContext("2d").drawImage(FOOTER_IMAGE, 0, 700, 1000, 300);
    return {classroomId, canvas};
  });
  return highland(collagePromise);
})
.mergeWithLimit(NUM_PARALLEL)

Now we're going to generate a collage canvas for each parsed row. This is an async operation, because the photoCollage library needs to download our photos, so it returns a promise of a canvas. Once that promise resolves, we can then add our custom footer on top of that canvas - that's the image of Mojo and the text that says "What a wonderful year!".

However, we can't just return a promise of a canvas. What we really want to do is to emit the canvas itself once it finishes, so that we can save it to disk. We can do this in two steps. First, calling the Highland constructor on a promise turns that promise into a stream. The stream will eventually emit the resolved value of the promise, and then end. So now we've created a stream of streams (one per classroom), each of which will emit a {canvas, classroomId} pair (we're including the classroomId so we know which file to write the collage to, and for logging).

Then, we can use one of several methods to join a stream of streams into a single stream of values, presented here in order of increasing speed:

sequence() will process the streams one at a time, emiting all of the items in one stream before beginning to process the next one.
parallel(n) will process up to n streams at once, but will emit all items on the i'th stream before beginning to emit any items from the i+1'th stream.
mergeWithLimit(n) will, like parallel, process up to n streams at once, but will emit their items as they are processed. Items that appear earlier in the i'th stream will appear before later items in the same stream, but no guarantees are made about the other of items in different streams.
merge() is like mergeWithLimit, but without any limit on the number of streams that can be processed at once. Like mergeWithLimit, order is preserved within a stream, but not between streams.

Since we don't care about the order in which our collages are written to disk, there was no reason to use sequence or parallel. However, since our images were being download from an image proxy service, and this service had limited capacity, we chose to use mergeWithLimit, with a configurable parameter on the amount of parallelism.

.map(({classroomId, canvas}) => {
  const sourceStream = canvas.jpegStream();
  const destinationStream = fs.createWriteStream("./output" + classroomId + ".jpg");

  return highland((push) => 
    sourceStream
      .on("error", (err) => push(err))
      .pipe(destinationStream)
      .on("error", (err) => push(err))
      .on("finish", () => {
        push(null, classroomId);
        push(null, highland.nil);
      });
  );
})
.mergeWithLimit(NUM_PARALLEL)

Now we need to write our canvases to disk. Canvas objects can create a stream with the JPEG data of their contents. We can then pipe this stream into a file. However, we can't just return this writable stream to Highland. Highland only works with streams that emit data, and node writable streams don't. Besides, we don't actually care about the bytes we're writing to a file. We just need to know which classrooms we've finished with, so we can log that information. So, let's create a new stream that will emit a single classroom id when we finish piping our JPEG data to disk, and then end.

To do this, we can construct a Highland stream from a function. Our function is called with a push function as an argument. This function works a lot like a node-style callback (though you're allowed to call it multiple times). If it's called with a first argument it considers that an error, and if it's called with a second argument it considers that as data. If called with highland.nil as the second argument, the stream ends. You can read more about constructing Highland streams from functions in the documentation.

In our case, we want to push an error if either of the source or destination streams error (note that calling .on on a stream returns the stream, so you can chain calls). Otherwise, once the file has been written, we can push the classroom id and then end the stream.

Finally, since we're mapping to a stream of streams, we can use merge to combine them into a single stream of the classroom ids we've successfully generated collages for. We don't need to use mergeWithLimit here, since writing to disk is a pretty cheap operation.

.errors((err) => console.error("Error generating collage"), err)
.each((classroomId) => console.log("Generated collage for", classroomId))
.done(() => console.log("All done!"))

The last step is to start consuming the stream. Highland streams are lazy, which means that they won't start pulling data from the source (in this case, the lines of our CSV) until you call one of a set of consumption functions. This allows you to define a pipeline of arbitrary length and complexity before any data starts passing through it.

In this case, we used each, which will call the provided function on each item. We also added a done handler, which is called after all items have processed. Finally, we want some visibility into any errors. By attaching an errors handler, we can log these errors instead of having them crash the program.

Wrapping up

With the script now complete, we ran it on the CSV generated by Redshift. Finally, we synced the output directory to a bucket on AWS's S3 and created notifications to surface the collages to teachers, who were prompted to share them with their class's parents.

If you received or sent a collage this year, we hope you enjoyed it! And if you're a developer, please check out photo-collage on npm!