TL;DR: see how your code responds to not being able to reach a dependency
docker disconnect <network-id / network-name> <container-id / container-name>
docker connect <network-id / network-name> <container-id / container-name>
We recently ran into a production issue where a few of our main API containers were briefly unable to reach our RabbitMQ instance, and this surfaced a latent problem with our reconnection logic. The messages that go into Rabbit are exclusively side effects — things like a push notification when a user sends another user a message. If something briefly interrupts Rabbit connectivity, the API server should still be able to mostly function normally (unlike if it were unable to reach our primary AWS RDS cluster). The API-side Rabbit code was designed to reconnect during a network interruption, buffering messages in memory in the meantime. Our logs showed this process wasn't working as intended — the buffer filled up and we never reconnected to Rabbit.
After a couple quick fixes didn't resolve the issue, we realized we didn't have a clear picture of what the underlying library node-amqp was doing when it encountered a connection issue. What our investigation found is related to node-amqp
specifically, and the Node event emitter model more generally[^1], but the docker network
commands we used should be useful for any dockerized service.
We were working with two different objects, a node-ampq
Connection and an associated Channel. When we tried docker kill
ing the Rabbit container, the API-side Rabbit client closed gracefully — not a match with what we saw from our logging. We needed to understand the events emitted by these two objects during an unexpected network problem.
You can simulate a network partition in any number of ways; after a quick Google search we came across the docker network
suite of commands. We took a stab at docker network disconnect
and immediately saw the same behavior we saw in production.
docker disconnect <network-id / network-name> <container-id / container-name>
docker connect <network-id / network-name> <container-id / container-name>
Our specific issue ended up being that the close
event on the AMQP connection had an error payload when the connection was not closed cleanly, and no payload when it was. The fix was pretty easy, and we determined it would work as intended by doing a quick docker network connect
and watching the debug logging reconnect and flush its buffer of jobs to the Rabbit server.
We don't yet have an automated test verifying that the reconnection logic works but we plan to soon. This is what's most exciting to me about docker network
— automated testing of service behavior in the case of network issues, all inside docker
. We want our main API service to respond differently when specific services are unavailable. If a critical dependency like our main AWS RDS cluster is unreachable, we need to shut down, and that's pretty easy to test. Testing nuanced behavior with subcritical dependencies like reconnecting a fixed number of times (or until a fixed-size buffer is full) is trickier, but docker network
provides an easy way to do just that!
[^1]: We have a small collection of Go services reading to and writing from RabbitMQ. The error-handling model there is more natural: there's a channel of jobs coming from the guts of the service that needs to be sent to the Rabbit server, along with a channel of errors produced by Rabbit. Since jobs and errors both come from the same kind of source — a Go channel — dealing with jobs and errors is as simple as select
ing on the two channels.