Postmortem: What to Do After a Security Incident

What Actually Works

See failures as information.

Every failure, including not having enough information to do a proper postmortem, is itself information. Do not lose sight of this. If you find yourself at a loss in a postmortem, start looking at what you would have needed to do a postmortem that you don't have. That is your first lesson learned.

Treat "root cause" as an adjective.

There's never only one root cause, because if there is only one root cause, the other root cause is "we failed to practice fault tolerance by implementing defense in depth". Root cause analysis is the act of finding root causes, plural, not the search for a single root cause.

Go back to first principles.

In my day job at Indiana University's Center for Applied Cybersecurity Research, we've been working on a set of seven principles from which cybersecurity in general can be derived. First principles work in reverse as well: they are not only a tool for performing information security, but also for figuring out how information security failed.

  • Comprehensivity. Was there a system no one knew about? Was a risk being ignored? Comprehensivity failures tend to be failures of scope.

  • Opportunity. Did something go unmaintained because the burden was placed on under-resourced in-house staff instead of using well maintained common tools? Were staff under-trained so that they didn't recognize something they should have? Was no one staying abreast of current threats?

  • Rigor. Was the organization caught out by assumptions that weren't being verified? Did monitoring fail? Was something not specified clearly enough to ensure that everyone was on the same page? Was automation not put in place to ensure that repetitive tasks were not done precisely and consistently across time and space?

  • Minimization. Was something a bigger target than it needed to be? Were there more ways in, or more moving parts, than there needed to be? Could something become easier to protect by eliminating or shrinking some part of it?

  • Compartmentation. Did someone or something have access that it didn't absolutely need? Did isolation fail? Was cryptography not implemented appropriately? Were monolithic systems and processes used when things could have been segmented from one another? Were interfaces between systems or components of systems unclear or overly complex?

  • Fault tolerance. Was there a single point of failure? Was there a credential that wasn't cheap and easy enough to revoke, so it wasn't replaced when it should have been? Was something built or configured with the assumption that bad things wouldn't happen to it?

  • Proportionality. Was security, or any systems or software decision, made in isolation, without considering the environment as a whole? This one can be a killer—when security interferes with getting the job done, people will circumvent it. When security is too expensive, no one will implement it. When a business case hasn't been made relative to other risks, the organization won't know what security to invest in and may invest in none at all because doing all information security controls is untenable.

It takes time to work with and learn to use the principles for analysis, but it's worth doing so. They are invaluable in flexing one's brain around whatever problem comes along, instead of learning types of problems one at a time. Each principle has much more to it than these brief examples, but the examples here should provide a starting point for how they may crop up in an incident postmortem.