Orchestration with MCollective, Part II

In my last article, I introduced how MCollective could be used for general orchestration tasks. Configuration management like Puppet and Chef can help you bootstrap a server from scratch and push new versions of configuration files, but normally, configuration management scripts run at particular times in no particular order. Orchestration comes in when you need to perform some kind of task, specifically something like a software upgrade, in a certain order and stop the upgrade if there's some kind of problem. With orchestration software like MCollective, Ansible or even an SSH for loop, you can launch commands from a central location and have them run on specific sets of servers.

Although I favor MCollective because of its improved security model compared to the alternatives and its integration with Puppet, everything I discuss here should be something you can adapt to any decent orchestration tool.

So in this article, I expand on the previous one on MCollective and describe how you can use it to stage all of the commands you'd normally run by hand to deploy an internal software update to an application server.

I ended part one on MCollective with describing how you could use it to push an OpenSSL update to your environment and then restart nginx:

mco package openssl update
mco service nginx restart

In this example, I ran the commands against every server in my environment; however, you'd probably want to use some kind of MCollective filter to restart nginx on only part of your infrastructure at a time. In my case, I've created a custom Puppet fact called hagroup and divided my servers into three different groups labeled a, b and c, split along fault-tolerance lines. With that custom fact in place, I can restart nginx on only one group of servers at a time:

mco service nginx restart -W hagroup=c

This approach is very useful for deploying OpenSSL updates, but hopefully those occur only a few times a year if you are lucky. What you more likely will run into as a common task ideal for orchestration is deploying your own in-house software to application servers. Although everyone does this in a slightly different way, the following pattern is pretty common. This pattern is based on the assumption that you have a redundant, fault-tolerant application and can take any individual server offline for software updates. This means you use some kind of load balancer that checks the health of your application servers and moves unhealthy servers out of rotation. In this kind of environment, a simple, serial approach to updates might look something like this:

  • Get a list of all of the servers running the application.

  • Start with the first server on the list.

  • Set a short maintenance window for the server in your monitoring system.

  • Tell your load balancers to drain any existing sessions to this server.

  • Update the list of available packages for the server.

  • Stop the service on that server.

  • Update the software on that server.

  • Start the service on that server.

  • Make sure the service started successfully.

  • Perform a health check to make sure the service is healthy.

  • Add the server back to the load balancer rotation.

  • Repeat for the rest of the servers on the list.

If any of those steps fails, the administrator would stop the update and go investigate and fix the problem. Often if there is going to be a failure, it will be at the software-update or health-check phase, and the point of this process is to make sure that if an upgrade doesn't go well, you stop at the first server before pushing broken software to the rest of the environment.


Kyle Rankin is SVP of Security and Infrastructure at Zero, the author of many books including Linux Hardening in Hostile Networks, DevOps Troubleshooting and The Official Ubuntu Server Book, and a columnist for Linux Journal. Follow him @kylerankin