Posts By: Nick Bottomley

Simulating network problems with docker network

TL;DR: see how your code responds to not being able to reach a dependency

1docker disconnect <network-id / network-name> <container-id / container-name>
2docker connect <network-id / network-name> <container-id / container-name>

We recently ran into a production issue where a few of our main API containers were briefly unable to reach our RabbitMQ instance, and this surfaced a latent problem with our reconnection logic. The messages that go into Rabbit are exclusively side effects — things like a push notification when a user sends another user a message. If something briefly interrupts Rabbit connectivity, the API server should still be able to mostly function normally (unlike if it were unable to reach our primary AWS RDS cluster). The API-side Rabbit code was designed to reconnect during a network interruption, buffering messages in memory in the meantime. Our logs showed this process wasn't working as intended — the buffer filled up and we never reconnected to Rabbit.

After a couple quick fixes didn't resolve the issue, we realized we didn't have a clear picture of what the underlying library node-amqp was doing when it encountered a connection issue. What our investigation found is related to node-amqp specifically, and the Node event emitter model more generally1, but the docker network commands we used should be useful for any dockerized service.

We were working with two different objects, a node-ampq Connection and an associated Channel. When we tried docker killing the Rabbit container, the API-side Rabbit client closed gracefully — not a match with what we saw from our logging. We needed to understand the events emitted by these two objects during an unexpected network problem.

You can simulate a network partition in any number of ways; after a quick Google search we came across the docker network suite of commands. We took a stab at docker network disconnect and immediately saw the same behavior we saw in production.

1docker disconnect <network-id / network-name> <container-id / container-name>
2docker connect <network-id / network-name> <container-id / container-name>

Our specific issue ended up being that the close event on the AMQP connection had an error payload when the connection was not closed cleanly, and no payload when it was. The fix was pretty easy, and we determined it would work as intended by doing a quick docker network connect and watching the debug logging reconnect and flush its buffer of jobs to the Rabbit server.

We don't yet have an automated test verifying that the reconnection logic works but we plan to soon. This is what's most exciting to me about docker network — automated testing of service behavior in the case of network issues, all inside docker. We want our main API service to respond differently when specific services are unavailable. If a critical dependency like our main AWS RDS cluster is unreachable, we need to shut down, and that's pretty easy to test. Testing nuanced behavior with subcritical dependencies like reconnecting a fixed number of times (or until a fixed-size buffer is full) is trickier, but docker network provides an easy way to do just that!

  1. We have a small collection of Go services reading to and writing from RabbitMQ. The error-handling model there is more natural: there's a channel of jobs coming from the guts of the service that needs to be sent to the Rabbit server, along with a channel of errors produced by Rabbit. Since jobs and errors both come from the same kind of source — a Go channel — dealing with jobs and errors is as simple as selecting on the two channels.