Simulating network problems with docker network

TL;DR: see how your code responds to not being able to reach a dependency

docker disconnect <network-id / network-name> <container-id / container-name>
docker connect <network-id / network-name> <container-id / container-name>

We recently ran into a production issue where a few of our main API containers were briefly unable to reach our RabbitMQ instance, and this surfaced a latent problem with our reconnection logic. The messages that go into Rabbit are exclusively side effects — things like a push notification when a user sends another user a message. If something briefly interrupts Rabbit connectivity, the API server should still be able to mostly function normally (unlike if it were unable to reach our primary AWS RDS cluster). The API-side Rabbit code was designed to reconnect during a network interruption, buffering messages in memory in the meantime. Our logs showed this process wasn't working as intended — the buffer filled up and we never reconnected to Rabbit.

After a couple quick fixes didn't resolve the issue, we realized we didn't have a clear picture of what the underlying library node-amqp was doing when it encountered a connection issue. What our investigation found is related to node-amqp specifically, and the Node event emitter model more generally[^1], but the docker network commands we used should be useful for any dockerized service.

We were working with two different objects, a node-ampq Connection and an associated Channel. When we tried docker killing the Rabbit container, the API-side Rabbit client closed gracefully — not a match with what we saw from our logging. We needed to understand the events emitted by these two objects during an unexpected network problem.

You can simulate a network partition in any number of ways; after a quick Google search we came across the docker network suite of commands. We took a stab at docker network disconnect and immediately saw the same behavior we saw in production.

docker disconnect <network-id / network-name> <container-id / container-name>
docker connect <network-id / network-name> <container-id / container-name>

Our specific issue ended up being that the close event on the AMQP connection had an error payload when the connection was not closed cleanly, and no payload when it was. The fix was pretty easy, and we determined it would work as intended by doing a quick docker network connect and watching the debug logging reconnect and flush its buffer of jobs to the Rabbit server.

We don't yet have an automated test verifying that the reconnection logic works but we plan to soon. This is what's most exciting to me about docker network — automated testing of service behavior in the case of network issues, all inside docker. We want our main API service to respond differently when specific services are unavailable. If a critical dependency like our main AWS RDS cluster is unreachable, we need to shut down, and that's pretty easy to test. Testing nuanced behavior with subcritical dependencies like reconnecting a fixed number of times (or until a fixed-size buffer is full) is trickier, but docker network provides an easy way to do just that!

[^1]: We have a small collection of Go services reading to and writing from RabbitMQ. The error-handling model there is more natural: there's a channel of jobs coming from the guts of the service that needs to be sent to the Rabbit server, along with a channel of errors produced by Rabbit. Since jobs and errors both come from the same kind of source — a Go channel — dealing with jobs and errors is as simple as selecting on the two channels.

    One aspect of our workflow that people often find odd or new to them is that we don’t require an up-front code review or pull request with signoff from another engineer before pushing to production. This often comes up in interviews or conversations with engineers who are interested in jobs at ClassDojo, with reactions ranging from curiosity to aversion.

    The reactions are understandable. Many organizations enforce blocking pull requests as a requirement, and many developers work in environments with fewer safety nets (test automation, linting, and other code quality tools) and less trust. Don’t get me wrong, code reviews are a good tool for getting feedback, sharing context and useful patterns, and we do create pull requests and ask for input all the time. It’s just not required.

    Code reviews often happen too late for certain kinds of feedback, and both research and experience[1] show that mandatory external code reviews don’t catch issues at a higher rate than other techniques, and don’t reduce production outages. Having pull requests and feature branches hanging around waiting for review can lead to tasks getting out of date with the main branch — going stale and requiring additional work that slows down the time to release. The only real, predictable outcome from having forced code reviews is slower velocity.

    In general, no single, silver bullet tool or process guarantees quality. You need to employ a combination of techniques to lower the number of defects and generate stable releases. Even in circumstances where bugs or bad code could be outright dangerous, resulting in the loss of life or irreversible harm, a code inspection process, coupled with other programming techniques such as defensive programming, could be a better fit than a blocking code review.

    Code reviews have their place as a collaboration tool, but depending on what you intend to get out of collaboration there are other, more effective techniques to reach for. Mandated blocking code reviews that are external to the team are often motivated by an underlying need for control, a lack of trust in systems or people, or as an enforced form of collaboration.

    Trust is important, and at ClassDojo we strive for a high trust, low control engineering environment. If you can't trust someone to make good changes without careful review, you can try several things to set them up for success. Options range from upskilling through training, mentoring, and pairing, to changing the way that you're working. For instance, you can promote safer changes by adding tests, instrumentations, and alerting. Ultimately, if you really don’t feel you can trust a person to write good code, you might consider parting ways with them.

    What do we do?

    At ClassDojo we’re strong advocates of pair and mob programming as collaboration practices. These both help to drive attention and energy toward problem solving, and provide live, in-context feedback on code quality and technical decisions. Fast feedback is one of the most important contributors to velocity, and getting that feedback at the time of writing the code when you have the mental model in your head is by far preferable to feedback at a later date.

    Careful scope management helps make this kind of collaboration work for us. As a team picks up a project or idea, we investigate what the smallest possible first implementation could be and start there. Smaller changes are easier to reason about as you work on them, and easier to roll back if necessary.

    For larger changes or changes with a lot of uncertainty, we have upfront design conversations, sometimes in writing but often as simple as saying "I'm planning on sticking X in Y for Z. Does that sound crazy to anyone?" That can be a better fit for up-front feedback on direction, and is something we encourage engineers and teams to do informally and often.

    Additionally, we’re very invested in test automation and instrumentation to make going to production safe. We want to prioritize making it easy to fix or roll back changes over the idea that we can predict the future and never be wrong. We also encourage post-merge code reviews. After the code is live and hopefully contributing user and business value, we expect to iterate on the code. We have processes and systems that make it easy to do that, which rely on all of the practices we’ve talked about.

    There are of course drawbacks to this approach. Services like GitHub lack good tools or notifications for post-merge code reviews, and it can be hard to remember to follow up and act on new suggestions or issues. This way of working might also not be a good fit if you don’t already have a good level of monitoring, alerting, and test automation that enables you to move swiftly with confidence.

    While this way of working might not appeal to everyone, we have found that it works well in our engineering culture. Engineers at ClassDojo tend to be very collaborative, often using pairing or mob programming to problem solve together, and also tend to be motivated by the craft of achieving a high level of quality in code and products. Our practices as a team are enabled by our team values, so we work hard at maintaining both.

    If our development practices sound like an interesting way of working, and you’re excited about working in a team that values trust and collaboration, have a look at our open roles and get in touch!

    [1] McConnell, S. (2004). 20.3 Relative Effectiveness of Quality Techniques. In Code complete (2nd ed., pp. 469–472). essay, Microsoft Press.

      One of the main engineering backbones here are ClassDojo is our CI/CD pipeline. By having a pipeline that automates tests, migrations, rolling out to production, and so much more, we allow the engineering team to focus on building great products. No managing a release train or babysitting a deployment!

      In this episode, I chat with Dom and Gregg about what CI/CD means to us, how we do it, and what steps you can take to start your own journey to continuous deployment.

      Listen to Episode 2, How and Why we use CI/CD
      • Engineering Dojo Podcast
      Newer posts
      Older posts