ClassDojo has a fleet of services that work together to serve user requests as well as complete other data operation tasks. Most of these services are set to auto-scale to better utilize our resources. Part of doing this involves synchronizing a lot of metadata like tags and service ports to keep our configurations up to date. We started thinking about how we could improve distributed runtime changes to our infrastructure-wide config and service discovery as servers join and leave the fleet.
We currently use HAProxy to manage traffic to our backend services. It allows us to distribute the workload across our various servers as well as route requests meant for specific services to the right clusters. The way in which we ensure that HAProxy has an up-to-date view of which service clusters are available to receive traffic is by running a job on Jenkins every minute to update the list. It does this by polling AWS by security or autoscaling group name.
We have a handful of services in our backend, but our largest and most complex is our API backend. So we are very dependent on the availability of this service. We have been caught out by this a few times, where for some reason or another, our Jenkins job that generates the configuration file for HAProxy has failed or got stuck.
This has been particularly problematic in times of frequent load fluctuation. As load patterns change, we autoscale to add or remove servers from this cluster. However, because this is a pull rather than push model for updating service membership, any lag slows down how fast we are able to add capacity to our load balancer. Worst case, we can be delayed for up to a minute. On the flip side, old servers can potentially hang around in our load balancer config for up to a minute as well. This lag also puts more pressure on our auto-scaler because we are not able to reduce CPU pressure from our existing servers fast enough. Alternatively, the auto-scaler may end up over-provisioning in the next scale-up window because requests are still queued up.
After our recent encounters with these issues, we decided to investigate ways we might improve our current system, given its lack of reliability.
Seeing as we were already using HashiCorp’s Terraform product for our infrastructure, we decided to look into their service discovery product, Consul. With Consul, the main features that we were interested in were service discovery, health checks and failure detection, as well as the key-value store.
An Intro to Consul
Consul has two components; servers and clients, and together they form a single Consul cluster. Each client node in the cluster has a Consul agent running, whose purpose is to monitor the health of the services running as well as the node itself. These agents communicate with one or more Consul servers.
The difference between servers and clients is that servers store and replicate data. More specifically, they are responsible for maintaining the state of the clients in the cluster, ensuring that membership is accurately tracked via health monitoring, as well as managing the key-value store. Consul achieves this using the consensus protocol, Raft. The Secret Lives of Data provides a great interactive explanation of this, here. Not necessary to understand the basics of Consul, but nonetheless interesting.
The servers must successfully elect a leader to provide a fault-tolerant cluster. In order to avoid problems with split-brain networks, where an equal number of servers are on each side of a network split, or too few servers left to form a quorum, we decided to bootstrap our cluster with 5 nodes as is recommended by Consul. This table from the Consul documentation shows the trade-off between the number of servers in the cluster and the fault-tolerance.
Members of the cluster are able to automatically discover each other as long as they know the address of at least one other member. This auto-discovery feature means that addresses don’t need to be hardcoded. We leverage this Cloud Auto-Joining mechanism to bootstrap the cluster. It’s quite important to understand how cluster bootstrapping works in order to run a successful production cluster. Fortunately, Consul provides really useful bootstrapping documentation, here.
Consul agents run various health checks on service instances, which are also running as Consul clients. This allows us to define service checks that run right next to the services and their results are exposed via the Consul API. Read more about Service Definitions, here.
Our Approach
Initially, we experimented with the technology, to decide whether or not we wanted to go through with the full implementation on Consul with our infrastructure.
This mostly involved playing with bootstrapping the Consul servers and understanding how the leader elections worked, and more importantly, if and when they failed, how to recover from that.
Once we transitioned to running this in production, we needed to ensure that the cluster was fault-tolerant. Otherwise, we ran the risk of a full-service outage. Hopefully, our configuration setup with 2n + 1
consul servers, n
being the fault tolerance level, helps us prevent reaching a state where our entire cluster goes down. Again, the Consul docs provide a good outline of how to handle an outage, here.
Cluster Configuration
Consul Servers
We configured the Consul servers separately using Terraform. Below shows the basics of our configuration file used for bootstrapping the servers.
{
"ui": true,
"leave_on_terminate" : true,
"bootstrap_expect": 5,
"server": true,
}
Here are some important tips to consider when configuring your servers:
bootstrap
vsbootstrap_expect
Either set bootstrap_expect
to your desired number of servers, as we did above. This value must be consistent across all the servers in the cluster. Another way of achieving this is by simply specifying bootstrap
as true, but only on one server. Either of these must be specified in order for a successful leader election to occur.
leave_on_terminate
When nodes are terminated, they remain on the list of servers returned when consul members
is called, preventing a successful election. Therefore it is important to set leave_on_terminate
to true
in the configuration. This also applies to Consul clients.
consul members
vsconsul operator raft list-peers
When attempting to debug an election failure, keep in mind that consul members
and consul operator raft list-peers
do not necessarily return the same results. The consul members
command simply outputs the current list of members that Consul agents are aware of and their state. Whereas the latter retrieves the list of peers included in the Raft subsystem for leader elections. Both failed members and invalid peers can contribute to failure to elect a leader and once identified, should be removed.
Consul Clients
We use Ansible to configure our AMIs and for Consul we decided to include the Consul Agent as part of our base AMI. Here is a basic version of the configuration file we implemented for our clients.
{
"server": false,
"ui": false,
"leave_on_terminate": true,
"service": {
"name": "api",
"tags": [
"api",
"prod"
]
}
}
In order to integrate this with our existing infrastructure, we simply had to update the way we were populating the services seen by HAProxy. HAProxy can be set up to query the Consul DNS interface. This setup reduces the lag in HAProxy config updates from minutes to a few milliseconds since the DNS resolver on HAProxy can be tuned to query for updates every few milliseconds. We used this in conjunction with Consul Template to replace our old implementation of querying by AWS autoscaling or security group. Instead, we can now query by service.
At first, we only sent a filtered selection of requests from our existing API cluster to our new Consul cluster, but we’ve now expanded it to receive all API traffic and have been running it for a few months. We’ve definitely improved our infrastructure for the better as well as made it easier for us to transition to a more service-oriented architecture. In fact, we have already added a few other services to Consul.
Overall, Consul has been a pretty solid addition to our infrastructure. We are exploring more ways of improving how we distribute configuration at runtime and integrating with Nomad.
If you have cool ideas about working with distributed infrastructure or questions about setting up your own Consul cluster in production, we’d be happy to hear from you!
Reach out to us at engineering@classdojo.com 😊