We had a rather serious outage this past week affecting several services which put us out of our SLA with customers. Now that everything has been resolved, I am conducting a post-mortem review.
From this review, I would like to come up with an internal document that describes the outage, its effects, our response and the resolution. I want to come up with a fairly standard form for future reuse. I have included my thoughts below, but what other items should be included? If this were a security-related incident, what would you add?
- Summary Executive level summary of event.
- Affected Services
- Impact What was the impact on our users and SLAs? Was there a cost in dollar terms, missed transactions, lost customers, etc?
- Outage Duration For each affected service if there were variances
- Cause Including primary and secondary causes
- Resolution
- Timeline of events Notifications, contact with external vendors, customer notifications, responses, etc.
- Problems with our response Did things not go as planned with our response to the outage? Correct people notified? Did vendors meet their contracted obligations?
- Preventative measures to take How do we prevent this outage from occurring again or reduce its impact?
- Detection Method How well did we detect this outage and how do we improve detection in the future?
- Changes to make in future outage responses
Try to keep posts down to one item and explanation, and this post can be updated with the top voted answers.
Although it could be covered in the Preventative measures to take, I would recommend having a Detection method section that you could use to note what the true symptoms were and how you could detect the problem (faster) if it happens again, ideally using automation.
Looks good. I would only add the following:
Effects/Consequences: What is the consequence of the outage - who was affected, which SLAs were violated (if any), were there any knock-on effects?
Affected services and outage duration only tells you part of how bad an outage was. You also want to know what the impact on the business was.
Impact: What effect did this have on users, and how was it perceived? How much money did this cost us (by missing of SLA, lost orders etc.)?
Public release & internal release
This is more something for management to decide but you might what to include what should be released to customers about it or your recommendation anyway. Also either way get sign off from management on the exact wording of what will be released to customers before releasing anything.
The public release should be included in the this so anyone in the company knows what they are allowed to tell customers.