My Favorite Infrastructure
Take a tour through the best infrastructure I ever built with stops in architecture, disaster recovery, configuration management, orchestration and security.
Working at a startup has many pros and cons, but one of the main benefits over a traditional established company is that a startup often gives you an opportunity to build a completely new infrastructure from the ground up. When you work on a new project at an established company, you typically have to account for legacy systems and design choices that were made for you, often before you even got to the company. But at a startup, you often are presented with a truly blank slate: no pre-existing infrastructure and no existing design choices to factor in.
Brand-new, from-scratch infrastructure is a particularly appealing prospect if you are at a systems architect level. One of the distinctions between a senior-level systems administrator and architect level is that you have been operating at a senior level long enough that you have managed a number of different high-level projects personally and have seen which approaches work and which approaches don't. When you are at this level, it's very exciting to be able to build a brand-new infrastructure from scratch according to all of the lessons you've learned from past efforts without having to support any legacy infrastructure.
During the past decade, I've worked at a few different startups where I was asked to develop new infrastructure completely from scratch but with high security, uptime and compliance requirements, so there was no pressure to cut corners for speed like you might normally face at a startup. I've not only gotten to realize the joy of designing new infrastructure, I've also been able to do it multiple times. Each time, I've been able to bring along all of the past designs that worked, leaving behind the bits that didn't, and updating all the tools to take advantage of new features. This series of infrastructure designs culminated in what I realize looking back on it is my favorite infrastructure—the gold standard on which I will judge all future attempts.
In this article, I dig into some of the details of my favorite infrastructure. I describe some of the constraints around the design and explore how each part of the infrastructure fits together, why I made the design decisions I did, and how it all worked. I'm not saying that what worked for me will work for you, but hopefully you can take some inspiration from my approach and adapt it for your needs.
Whenever you describe a solution you think works well, it's important to preface it with your design constraints. Often when people are looking for infrastructure cues, the first place they look is how "big tech companies" do it. The problem with that approach is that unless you also are a big tech company (and even if you are), your constraints are likely very different from theirs. What works for them with their budget, human resources and the problems they are trying to solve likely won't work for you, unless you are very much like them.
Also, the larger an organization gets, the more likely it is going to solve problems in-house instead of using off-the-shelf solutions. There is a certain stage in the growth of a tech company, when it has enough developers on staff, that when it has a new problem to solve, it likely will use its army of developers to create custom, proprietary tools just for itself instead of using something off the shelf—even if an off-the-shelf solution would get the company 90% there. This is a shame, because if all of these large tech companies put that effort into improving existing tools and sharing their changes, we would all spend less time reinventing wheels. If you've ever interviewed people who have spent a long time at a large tech company, you quickly realize that they are really well trained to administer that specific infrastructure, but without those custom tools, they may have a hard time working anywhere else.
Startup constraints also are very different from large company constraints, so it might equally be a mistake to apply solutions that work for a small startup to a large-scale company. Startups typically have very small teams but also need to build infrastructure very quickly. Mistakes that make their way to production often have a low impact on startups. They are most concerned about getting some kind of functioning product out to attract more investment before they run out of money. This means that startups are more likely to favor not only off-the-shelf solutions, but also favor cutting corners.
All that is to say, what worked for me under my constraints might not work for you under your constraints. So before I go into the details, you should understand the constraints I was working under.
Constraint 1: Seed Round Financial Startup
This infrastructure was built for a startup that was developing a web application in the financial space. We had limitations both on the amount of time we could spend on building the infrastructure and the size of the team we had available to build it. In many cases, there were single-member teams. In previous iterations of building my ideal infrastructure, I had a team of at least one other person if not multiple people to help me build out the infrastructure, but here I was on my own.
The combination of a time constraint along with the fact that I was doing this alone meant I was much more likely to pick stable solutions that worked for me in the past using technologies I was deeply familiar with. In particular, I put heavy emphasis on automation so I could multiply my efforts. There is a kind of momentum you can build when you use configuration management and orchestration in the right way.
Constraint 2: Non-Sysadmin Emergency Escalation
I was largely on my own not just to build the infrastructure, but also when it came to managing emergencies. Normally I try to stick to a rule that limits production access to system administrators, but in this case, that would mean we would have no redundancy if I was unavailable. This constraint meant that if I was unavailable for whatever reason, alerts needed to escalate up to someone who primarily had a developer background with only some Linux server experience. Because of this, I had to make sure that it was relatively straightforward to respond to the most common types of emergencies.
Constraint 3: PCI Compliance
I love the combination of from-scratch infrastructure development you get to do in a startup with tight security constraints that prevent you from cutting corners. A lot of people in the security space look down a bit on PCI compliance, because so many companies think of it as a box to check and hire firms known for checking that box with minimal fuss. However, there are a lot of good practices within PCI-DSS if you treat them as a minimum security bar to manage honestly, instead of a maximum security bar to skirt by. We had a hard dependency on PCI compliance, so meeting and exceeding that policy had some of the greatest impact on the design.
Constraint 4: Custom Rails Web Applications
The development team had a strong background in Rails, so most of the in-house software development was for custom middleware applications based on a standard database-backed Rails application stack. A number of different approaches exist for packaging and distributing this kind of application, so this also factored into the design.
Constraint 5: Minimal Vendor Lock-in
It's somewhat common for venture-capital-backed startups to receive credits from cloud providers to help them get started. It not only helps startups manage costs while they're trying to figure out their infrastructure, but also if the startup manages to use cloud-specific features, it has the side benefit of making it harder for the startup to move to a different provider down the road once they have larger cloud bills.
Our startup had credits with more than one cloud provider, so we wanted the option to switch to another provider in case we were cash-strapped when we ran out of credits. This meant our infrastructure had to be designed for portability and use as few cloud-specific features as possible. The cloud-specific features we did use needed to be abstracted away and easily identified, so we could port them to another provider more easily later.
PCI policy pays a lot of attention to systems that manage sensitive cardholder data. These systems are labeled as "in scope", which means they must comply with PCI-DSS standards. This scope extends to systems that interact with these sensitive systems, and there is a strong emphasis on compartmentation—separating and isolating the systems that are in scope from the rest of the systems, so you can put tight controls on their network access, including which administrators can access them and how.
Our architecture started with a strict separation between development and production environments. In a traditional data center, you might accomplish this by using separate physical network and server equipment (or using abstractions to virtualize the separation). In the case of cloud providers, one of the easiest, safest and most portable ways to do it is by using completely separate accounts for each environment. In this way, there's no risk that a misconfiguration would expose production to development, and it has a side benefit of making it easy to calculate how much each environment is costing you per month.
When it came to the actual server architecture, we divided servers into individual roles and gave them generic role-based names. We then took advantage of the Virtual Private Cloud feature in Amazon Web Services to isolate each of these roles into its own subnet, so we could isolate each type of server from others and tightly control access between them.
By default, Virtual Private Cloud servers are either in the DMZ and have public IP addresses, or they have only internal addresses. We opted to put as few servers as possible in the DMZ, so most servers in the environment only had a private IP address. We intentionally did not set up a gateway server that routed all of these servers' traffic to the internet—their isolation from the internet was a feature!
Of course, some internal servers did need some internet access. For those servers, it was only to talk to a small number of external web services. We set up a series of HTTP proxies in the DMZ that handled different use cases and had strict whitelists in place. That way we could restrict internet access from outside the host itself to just the sites it needed, while also not having to worry about collecting lists of IP blocks for a particular service (particularly challenging these days since everyone uses cloud servers).
Cloud services often are unreliable, but it was critical that our services could scale and survive an outage on any one particular server. We started by using a minimum of three servers for every service, because fault-tolerance systems designed for two systems tend to fall into a traditional primary/failover architecture that doesn't scale well past two. A design that could account for three servers probably also could accommodate four or six or more.
Cloud systems rely on virtualization to get the most out of bare metal, so any servers you use aren't real physical machines, but instead some kind of virtual machine running alongside others on physical hardware. This presents a problem for fault tolerance: what happens if all of your redundant virtual machines end up on the same physical machine, and that machine goes down?
To address this concern, some cloud vendors separate a particular site into multiple standalone data centers, each with its own hardware, power and network that are independent from the others. In the case of Amazon, these are called Availability Zones, and it's considered a best practice to spread your redundant servers across Availability Zones. We decided to set up three Availability Zones and divided our redundant servers across them.
In our case, we wanted to spread out the servers consistently and automatically, so we divided our servers into threes based on the number at the end of their hostname. The software we used to spawn instances would look at the number in the hostname, apply a modulo three to it, and then use that to decide which Availability Zone a host would go to. Hosts like web1, web4 and web7 would be on one group; web2, web5 and web8 in another; and web3, web6 and web9 in a third zone.
When you have multiple servers, you also need some way for machines to fail over to a different server if one goes down. Some cloud providers offer in-house load balancing, but because we needed portability, we didn't want to rely on any cloud-specific features. Although we could have added custom load-balancing logic to our applications, instead we went with a more generic approach using the lightweight and fast HAProxy service.
One approach to using HAProxy would be to set up a load-balancing server running HAProxy and have applications talk to it on various ports. This would behave a lot like some of the cloud-provided load-balancing services (or a load-balancing appliance in a traditional data center). Of course, if you use that approach, you have another problem: what happens when the load balancer fails? For true fault tolerance, you'd need to set up multiple load balancers and then configure the hosts with their own load-balancing logic so they could fail over to the redundant load balancer in the case of a fault, or otherwise rely on a traditional primary/secondary load-balancer failover with a floating IP that would be assigned to whichever load balancer was active.
This traditional approach didn't work for us, because we realized that there might be cases where one entire Availability Zone might be segregated from the rest of the network. We also didn't want to add additional failover logic to account for a load-balancer outage. Instead, we realized that because HAProxy was so lightweight (especially compared to the regular applications on the servers), we could just embed an HAProxy instance on every server that needed to talk to another service redundantly. That HAProxy instance would be aware of any downstream service that local server needed to talk to and present ports on the localhost that represented each downstream service.
Here's how this worked in practice: if webappA needed to talk to middlewareB, it would just connect to localhost port 8001. HAProxy would take care of health checks for downstream services, and if a service went down, it would automatically connect to another. In that circumstance, webappA might see that its connection dropped and would just need to reconnect. This meant that the only fault-tolerance logic our applications needed was the ability to detect when a connection dropped and retry.
We also organized the HAProxy configuration so that each host favored talking to a host within its own Availability Zone. Hosts in other zones were designated as "backup" hosts in HAProxy, so it would use those hosts only if the primary host was down. This helped optimize network traffic as it stayed within the Availability Zone it started with under normal circumstances. It also made analyzing traffic flows through the network much easier, as we could assume that traffic that entered through frontend2 would be directed to middleware2, which would access database2. Since we made sure that traffic entering our network was distributed across our front-end servers, we could be assured that load was relatively evenly distributed, yet individual connections would tend to stick on the same set of servers throughout a particular request.
Finally, we needed to factor disaster recovery into our plans. To do this, we created a complete disaster recovery environment in a totally separate geographic region from production that otherwise mimicked the servers and configuration in production. Based on our recovery time lines, we could get away with syncing our databases every few hours, and because these environments were independent of each other, we could test our disaster recovery procedure without impacting production.
One of the most important things to get right in this infrastructure was the configuration management. Because I was building and maintaining everything largely by myself and had some tight time lines, the very first thing I focused on was a strong foundation of configuration management using Puppet. I had a lot of experience with Puppet through the years from before it was the mature and robust product it is today. Today though, I could take advantage of all of the high-quality modules the Puppet community has written for common tasks to get a head start. Why reinvent an nginx configuration when the main Puppetlabs module did everything I needed already? One of the keys to this approach was making sure that we started with a basic vanilla image with no custom configuration on it and set it so that all configuration changes that turned a vanilla server into, say, a middleware app server was done through Puppet.
Another critical reason why I chose Puppet was precisely for the reason many people avoid it: the fact that the Puppetmaster can sign Puppet clients using TLS certificates. Many people hit a big roadblock when they try to set up Puppetmasters to sign clients and opt for a masterless setup instead. In my use case, I would have been missing a great opportunity. I had a hard requirement that all communication over the cloud network be protected using TLS, and by having a Puppetmaster that signed hosts, I would get a trusted local Certificate Authority (the Puppetmaster) and have valid local and signed certificates on every host in my network for free!
Many people open themselves up to vulnerabilities when they enable autosigning on Puppet clients, yet having to sign new Puppet clients manually, particularly in a cloud instance, can be cumbersome. I took advantage of a feature within Puppet that lets you add custom valid headers into the Certificate Signing Request (CSR) the Puppet client would generate. I used a particular x509 header that was designed to embed a pre-shared key into the CSR. Then I used Puppet's ability to specify a custom autosigning script. This script then gets passed the client CSR and decides whether to sign it. In my script, we inspected the CSR for the client's name and the pre-shared key. If they matched the values in the copy of that hostname/pre-shared key pair on the Puppetmaster, we signed it; otherwise, we didn't.
This method worked because we spawned new hosts from the Puppetmaster itself. When spawning the host, the spawning script would generate a random value and store it in the Puppet client's configuration as a pre-shared key. It would also store a copy of that value in a local file named after the client hostname for the Puppetmaster autosign script to read. Since each pre-shared key was unique and used only for a particular host, once it was used, we deleted that file.
To make configuring TLS on each server simple, I added a simple in-house Puppet module that let me copy the local Puppet client certificate and local Certificate Authority certificate wherever I needed it for a particular service, whether it was nginx, HAProxy, a local webapp or Postgres. Then I could enable TLS for all of my internal services knowing that they all had valid certificates they could use to trust each other.
I used the standard role/profile pattern to organize my Puppet modules and made sure that whenever I had a Puppet configuration that was based on AWS-specific features, I split that off into an AWS-specific module. That way, if I needed to migrate to another cloud platform, I easily could identify which modules I'd need to rewrite.
All Puppet changes were stored in Git with the master branch acting as the production configuration and with additional branches for the other environments. In the development environment, the Puppetmaster would apply any changes that got pushed automatically, but since that Git repository was hosted out of the development environment, we had a standing rule that no one should be able to change production directly from development. To enforce this rule, changes to the master branch would get synced to production Puppetmasters but never automatically applied—a sysadmin would need to log in to production and explicitly push the change using our orchestration tool.
Puppet is great when you want to make sure that a certain set of servers all have the same changes, as long as you don't want to apply changes in a particular order. Unfortunately, a lot of changes you'll want to make to a system follow a certain order. In particular, when you perform software updates, you generally don't want them to arrive across your servers in a random order over 30 minutes. If there is a problem with the update, you want the ability to stop the update process and (in some environments) roll back to the previous version. When people try to use Puppet for something it's not meant to do, they often get frustrated and blame Puppet, when really they should be using Puppet for configuration management and some other tool for orchestration.
In the era when I was building this environment, MCollective was the most popular orchestration tool to pair with Puppet. Unlike some orchestration tools that are much closer to the SSH for loop scripts everyone used a few decades ago, MCollective has a strong security model where sysadmins are restricted to a limited set of commands within modules they have enabled ahead of time. Every command runs in parallel across the environment, so it's very fast to push changes, whether it's to one host or every host.
The MCollective client doesn't have SSH access to hosts; instead, it signs each command it issues and pushes it to a job queue. Each server checks that queue for commands intended for it and validates the signature before it executes it. In this way, compromising the host on which the MCollective client runs doesn't give you remote SSH root access to the rest of the environment—it gives you access only to the restricted set of commands you have enabled.
We used our bastion host as command central for MCollective, and the goal was to remove the need for sysadmins to have to log in to individual servers to an absolute minimum. To start, we wanted to make sure that all of the common sysadmin tasks could be performed using MCollective on the bastion host. MCollective already contains modules that let you query the hosts on your network that match particular patterns and pull down facts about them, such as what version a particular software package is.
The great thing about MCollective commands is that they let you build a library of individual modules for particular purposes that you then can chain together in scripts for common workflows. I've written in the past about how you can use MCollective to write effective orchestration scripts, and this was an environment where it really shined. Let's take one of the most common sysadmin tasks: updating software. Because MCollective already had modules in place to query and update packages using the native package manager, we packaged all of our in-house tools as Debian packages as well and put them in internal package repositories. To update an in-house middleware package, a sysadmin would normally perform the following series of steps by hand:
- Get a list of servers that run that software.
- Start with the first server on the list.
- Set a maintenance mode in monitoring for that server.
- Tell any load balancers to move traffic away from the server.
- Stop the service.
- Update the software.
- Confirm the software is at the correct version.
- Start the service.
- Test the service.
- Tell any load balancers to move traffic back to the server.
- End the maintenance mode.
- Repeat for the rest of the hosts.
All I did was take each of the above steps and make sure there was a corresponding MCollective command for it. Most of the steps already had built-in MCollective plugins for them, but in a few cases, such as for the load balancers, I wrote a simple MCollective plugin for HAProxy that would control the load balancers. Remember, many of the servers in the environment had their own embedded HAProxy instance, but because MCollective runs in parallel, I could tell them all to redirect traffic at the same time.
Once each of these steps could be done with MCollective, the next step was to combine them all into a single generic script to deploy an application. I also added appropriate checks at each of the stages, so in the event of an error, the script would stop and exit with a descriptive error. In the development environment, we automatically pushed out updates once they passed all of their tests, so I also made sure that our continuous integration server (we used Jenkins) used this same script to deploy our app updates for dev. That way I could be sure the script was being tested all the time and could stage improvements there first.
Having a single script that would automate all of these steps for a single app was great, but the reality is that a modern service-oriented architecture has many of these little apps. You rarely deploy one at a time; instead, you have a production release that might contain five or more apps, each with their own versions. After doing this by hand a few times, I realized there was room to automate this as well.
The first step in automating production releases was to provide a production manifest my script could use to tell it what to do. A production manifest lists all of the different software a particular release will have and which versions you will use. In well organized companies, this sort of thing will be tracked in your ticketing system, so you can have proper approval and visibility into what software went to production when. This is especially handy if you have a problem later, because you more easily can answer the question "what changed?"
I decided to make the right approach the easy approach and use our actual production manifest ticket as input for the script. That meant if you wanted an automated production release, the first step was to create a properly formatted ticket with an appropriate title, containing a bulleted list of each piece of software you want to deploy and which version you intend on deploying, in the order you want them to be deployed. You then would log in to production (thereby proving you were authorized to perform production changes) and run the production deploy script, which would take as input the specific ticket number it should read. It would perform the following steps:
- Parse the ticket and prompt the sysadmin with the list of packages it will deploy as a sanity check and not proceed until the sysadmin says "yes".
- Post a message in group chat alerting the team that a production release is starting, using the ticket title as a description.
- Update the local package repository mirrors so they have the latest version of the software.
- For each app: 1) notify group chat that the app is being updated, 2) run the app deployment automation script and 3) notify group chat that the app updated successfully.
- Once all apps have been updated successfully, notify group chat.
- Email the log of all updates to a sysadmin alias and also as a comment to the ticket.
Like with the individual app deploy script, if there were any errors, we'd immediately abort the script and send alerts with full logs to email, chat and in the ticket itself, so we could investigate what went wrong. We would perform deployments first in a hot disaster recovery environment located in a separate region, and if it succeeded, in production as well. Once the script successfully worked in production, the script was smart enough to close the ticket. In the end, performing a production deployment, whether you wanted to update one app or ten, involved the following steps:
- Create a properly formatted ticket.
- Log in to the disaster recovery environment and run the production deploy script.
- Log in to the production environment and run the production deploy script.
The automation made the process so easy, production deploys were relatively painless while still following all of our best practices. This meant when I went on vacation or was otherwise unavailable, even though I was the only sysadmin on the team, my boss with a strong development background easily could take over production deployments. The consistent logging and notifications also made it so that everyone was on the same page, and we had a nice audit trail for every software change in production.
I also automated the disaster recovery procedure. You've only really backed something up if you've tested recovery. I set as a goal to test our disaster recovery procedure quarterly, although in practice, I actually did it monthly, because it was useful to have fresh data in the disaster recovery environment, so we could better catch any data-driven bugs in our software updates before they hit production. Compared to many environments, this is a much more frequent test, but I was able to do it because I wrote MCollective modules that would restore the disaster recovery databases from backup and then wrapped the whole thing in a master script that turned it all into a single command that would log the results into a ticket, so I could keep track of each time I restored the environment.
We had very tight security requirements for our environment that started (but didn't end) with PCI-DSS compliance. This meant that all network communication between services was encrypted using TLS (and the handy internal certificate authority Puppet provided), and all sensitive data was stored on disks that were encrypted at rest. It also meant that each server generally performed only one role.
Most of the environment was isolated from the internet, and we went further to define ingress and egress firewall rules both on each host and enforced them in Amazon's security groups. We started with a "deny by default" approach and opened up ports between services only when they were absolutely necessary. We also employed the "principle of least privilege", so only a few employees had production access, and we developers did not have access to the bastion host.
Each environment had its own VPN, so to access anything but public-facing
services, you started by connecting to a VPN that was protected with two-factor
authentication. From there, you could access the web interfaces for our log
aggregation server and other monitoring and trending dashboards. To log in to
any particular server, you first had to
ssh in to a bastion host, which only
accepted SSH keys and also required its own two-factor authentication. It was
the only host that was allowed access to the SSH ports on other machines, but
generally, we used orchestration scripts whenever possible, so we didn't have to
go further than the bastion host to administer production.
Each host had its own Host-based Intrusion Detection System (HIDS) using ossec, which not only would alert on suspicious activity on a server, but it also would parse through logs looking for suspicious activity. We also used OpenVAS to perform routine network vulnerability scans across the environment.
To manage secrets, we used Puppet's hiera-eyaml module that allows you to store a hierarchy of key:value pairs in encrypted form. Each environment's Puppetmaster had its own GPG key that it could use to decrypt these secrets, so we could push development or production secrets to the same Git repository, but because these files were encrypted for different recipients, development Puppetmasters couldn't view production secrets, and production Puppetmasters couldn't view development secrets. The nice thing about hiera is that it allowed you to combine plain text and encrypted configuration files and very carefully define which secrets would be available to which class of hosts. The clients would never be able to access secrets unless the Puppetmaster allowed them.
Data that was sent between production and the disaster recovery environment was GPG-encrypted with a key in the disaster recovery environment and also used an encrypted transport between the environments. The disaster recovery test script did all the heavy lifting required to decrypt backups and apply them, so the administrator didn't have to deal with them. All of these keys were stored in Puppet's hiera-eyaml module, so we didn't have to worry about losing them in the event a host went down.
Although I covered a lot of ground in this infrastructure write-up, I still covered only a lot of the higher-level details. For instance, deploying a fault-tolerant, scalable Postgres database could be an article all by itself. I also didn't talk much about the extensive documentation I wrote that, much like my articles in Linux Journal, walks the reader through how to use all of these tools we built.
As I mentioned at the beginning of this article, this is only an example of an infrastructure design that I found worked well for me with my constraints. Your constraints might be different and might lead to a different design. The goal here is to provide you with one successful approach, so you might be inspired to adapt it to your own needs.
- "Orchestration with MCollective" by Kyle Rankin, LJ, December 2016
- "Orchestration with MCollective, Part II" by Kyle Rankin, LJ, January 2017
- "Using Hiera with Puppet" by Scott Lackey, LJ, March 2015
- Official PCI Security Standards Council Site
- HAProxy: the Reliable, High Performance TCP/HTTP Load Balancer
- "Puppet Redefines Infrastructure Automation" by Petros Koutoupis