Getting just the right amount of concurrency in Ansible
July 10, 2017
This is something I’ve been working on for a while, and as I figure it might be something others also might be struggling with, I thought I’d share:
Configuration management is this incredibly powerful thing where servers “take care of themselves” — at least that’s the idea. Real life is often far more messy. Automated configuration management also means that changes might occur on several of your servers at the same time, and that might not always be a good thing. We’re a (relatively) small shop without the luxury/burden of having our services spread out across thousands of servers — a lot of our stuff runs on as little as 2 vms. Most of our stack is still on .Net classic, so for the moment we’re not able to run it in Kubernetes or stuff either. Which means that we had to figure out a way to efficiently manage our vms in such a way that we would never bring down 2 servers running the same thing at the same time, will still automating the config management of them as much as possible. “Bring down” in this context could be as little as changing a setting inside IIS, as some of those changes may cause the AppPool running our services to restart. Ultimately, we want to be able to get to a point where our config management tool is free to perform reboots or whatever it needs to, all while the service itself is still up and running as far as customers are concerned.
As you may know, we’ve been running Ansible for more than a year now, and we’ve changed how we execute ansible against our vms a lot during the last year. This blog post explains some of the thinking we’ve done about this, and what we’ve come up with (for now).
Even though we’re a fairly small shop we did something smart from the get-go: We organized our servers into a structure we call servicefamilies/servicegroups. These two attributes get tagged on a server wherever it is — in Datadog, Elasticsearch, Aws and inside the vm using envvars. In short, we stick those attributes wherever we can. These are used for a host of different things, but in this context the important thing is that they indicate a relationship between servers: If two servers share the same family and group, they do the same job _— which means that we typically shouldn’t incur downtime on more than one vm inside a family/group combination at the same time. As we grow we may adjust this to something like _max 10% instead, but for now it’s fine since our numbers are small. So, if server1 is going through a reboot, server2 better be up. Simple enough.
It’s probably also worth mentioning that Ansible has some built-in smarts to take care of rolling updates, as it has the ability to perform serial execution on playbooks. The reason we’re not using those options is that I want to make it as easy as possible for devs to write playbooks — I simply don’t want them to have to concern themselves with the right amount of serial-ness inside playbooks they write.
We’ve been executing Ansible inside Docker for a little while already, and the plan all along was to end up in a place we could use this to our advantage. A playbook is kicked off not by running ansible-playbook locally on a workstation or by ssh-ing into a server, but simply by sending a job request to a custom rest api we built. This job looks at the requested playbook, and the family/group of the servers to run it against, and performs some smart grouping of them according to their family/group membership. So, a single “job request” is translated into a single “ansible run” for every server that gets “hit” by that playbook. This info gets spread into a set of sqs queues, where a number of messagehandlers (also running in containers) will pick them up one by one. This is essentially how we ensure serialness. So, if 4 servers get “hit” by a job, and they belong in two different family/groups, we will send jobs to 2 queues, each containing the jobs for the “grouped” servers. Something like this:
So, the magic here is that each “queue handler” runs in paralell with the other handlers, and each family/group’s servers get handled by a single handler at any given time. We can scale up our down the number of queues simply by adjusting a config file, and the “job controller” will make sure that an SQS queue and a messagehandler container will exist for every requested queue. We also use the “Approximate amount of messages” attribute of each queue to make sure we always send jobs to the least busy queue, ensuring that jobs get sent thru the pipe as quickly as possible. Each job sent through the queue will cause the messagehandler to invoke an Ansible job targeting a single server (using the — limit parameter) and wait until it’s either done or failed — that’s how we ensure that each group gets executed in a serial fashion.
Each job also posts a truckload of status back to a DynamoDB that keeps track of all jobs in flight, and the entire output log gets tagged with the job guid and uploaded to s3. This will allow us to do stuff like “if the first server failed executing its job, cancel all the other corresponding jobs for the same group/family”. There’s tons of fancy add-ons we can build around this as we progress.
It’s pretty cool to see a single request getting translated into a bunch of jobs, spread across all available queues and executed as quickly as possible. As we gain experience and confidence in this way of running Ansible, we can start performing reboots and other potentially “dangerous” actions in the middle of a playbook, knowing that this system will ensure that we’ll never bring down servers in such a way that it causes problems with the services we provide our customers.
Hi, I'm Trond Hindenes, SRE lead at RiksTV. Fan of Python, drones, cloud and snowboarding. I'm on twitter.