Fs-Tail

Fs-Tail is a simple library we wrote that provides tail -f-like functionality using Node’s v0.10+ streams and fs modules. Install via npm:

1npm install fs-tail

As ClassDojo has grown, we’ve spent a lot of time improving our analytics infrastructure. An important part of those efforts has been building a pipeline that can quickly transfer hundreds of gigabytes of data from our primary Mongo cluster to Redshift.

One component of this pipeline uses a temporary file to share information between processes. We wanted to read that file as it was being written, rather than waiting for all of the data to come through. Most OSes already come with a nice utility to do this called tail, which outputs the last part of a file. For instance:

1tail -n 100 /tmp/some_file.txt

outputs the last 100 lines of some_file.txt. One really helpful tail option is -f, which tells the operating system to watch a file and output any data written to that file to stdout. We didn’t want to duplicate built-in unix functionality if we didn’t have to, so our first attempt looked something like this:

1var spawn = require("child_process").spawn;
2
3var tail = spawn("tail", ["-f", "someFile.txt"]);
4
5tail.stdout
6 .pipe(anotherStream);

At first glance this looks like it would perform well. We're not duplicating any functionality, and almost everything is deferred to Node and the operating system. Unfortunately, we started to uncover some issues as we ran this system in production.

  • Node's child_process.spawn is known to leave around zombie processes. We saw this happening on our production server when we discovered the current user exceeded its ulimit. A quick ps -A | grep tail confirmed all the orphaned tail processes. While it's not difficult to correctly shut down spawned processes in Node, there are some scenarios under which spawn makes it impossible to prevent orphaning child processes.

  • Listening for large amounts of data from a child process's stdout has some corner cases.

To prevent situations like the ones described above, we decided that it would be better to avoid the added complexity of non-Node external processes and write our own implementation using Node. This turned out to not be that difficult - conceptually, tail -f is very simple, and fs gives us all the tools to implement this ourselves.

To our surprise, all the current tail packages on npm were either untested or used the old push streams API. Since the files we were processing could be several gigabytes in size, and downstream processing could be slow, it was crucial for us to instead use pull streams to avoid potential data buffer overflows.

Fs-Tail is pretty simple to use:

1var FsTail = require("fs-tail");
2
3var tail = FsTail("./someFile.txt");
4tail.on("EOF", function() {
5 console.log("Reached end of file");
6});
7tail.pipe(anotherStream);

If you’re dealing with files and streams in Node.js, be sure check out Fs-Tail on github.

DataProgrammingNode.js
Next Post
Previous Post