Outages are some of the things that we try to avoid but they're inevitable: they happen (very rarely, we hope) and we have to know how to deal with them (and learn from them).
So, what's the major outage you've been part of? How did you and your team deal with the problem? What have you learned for the future? Please share your thoughts :)
We had a heating steam pipe that ran through our data center rupture. Very hot, condensation and asbestos insulation all over the place. Power cut for weeks during cleanup.
OK, my group's stuff was BGP paired, load balanced between multiple data center. We had some fraction of our users see a 30 second freeze before their current transaction was transferred. Many of the other projects saw outages of up to several days, with everyone putting in lots of overtime to help everyone else.
Lessons learned: Do your continuity planning first, then build your system to support your conclusions:
I'm 'part of' outages almost every single day (Monitor WAN links for 44 sites). The 'little ones' are the ones that are less than 5 minutes and most of the time go 'unnoticed' (The NOC only monitors outages higher than 5 minutes, for some reason). I try to communicate with the site to see if it was an internal issue and check router logs whenever the issue is 'unknown'.
I find Communication is key (and that's an understatement!) when dealing with outages. DO NOT WAIT TO BE CALLED as you're troubleshooting, or trying to find out what exactly happen. Make sure you communicate that you know they're down and you're working on it. Give them a time frame of when you will get back to them to give them updates on the situation (ETR). Don't let them hanging to think you have forgotten about them, make sure they KNOW someone is looking at their problem. You call them, so they don't have to call you.
Thankfully, the longest a site has been down under my watch has been 7 hours (this is within a work-day 10am-5pm). It should have been shorter by a few hours, if it hadn't been for the lack of good communication between all of the parties involved. Pretty much, the issue wasn't escalated properly, and due to the assumption that 'someone was working on it' the issue took (relatively for the site) forever to get resolved.
I was attending a job interview at a company who happened to be currently facing an entire network outage in their 50+ user office. I solved it within minutes, and got to meet their current sysadmin and their IT support company they'd called in because he couldn't solve it - they'd spent all morning trying to work out what was going wrong.
The previous guy had installed two wireless routers in bridge mode, and plugged them both into the wired network. They were barely in range of each other, so they had a loop in their network which came and went as the reception varied.
Needless to say I got the job and then implemented some change management logging as soon as I started.
I experienced a weeklong outage of our entire server network. We dealt with it by creating a redundancy network, to prevent that very same problem in the future, but while the outage was occurring, we used an old server that we had set up in a remote location. We've learned to always have a backup plan.
Probably the biggest one was a 4-day all-HQ network outage caused by a major network upgrade.
The biggest tip I have is to have an established robust incident management process. There's a brilliant presentation from the Velocity 2008 conference I saw about adapting the general Incident Command System used by emergency personnel (http://en.wikipedia.org/wiki/Incident_Command_System) to IT-type incidents as well: http://en.oreilly.com/velocity2008/public/schedule/detail/1525
We cribbed extensively from this when developing our own internal "Sev1" incident process. It stresses communication, unity of command, clear handoff of responsibilities, and other great stuff.
I'll also put in a plug for the Transparent Uptime blog: http://www.transparentuptime.com/ - it's online service focused but his general rules on how/what to communicate for an outage apply to internal IT-ey stuff as well. http://www.transparentuptime.com/2010/03/guideline-for-postmortem-communication.html specifically - we had a manager crib from that and started sending out communications in that format and you wouldn't believe the positive response.
How well timed. I just got back from an emergency trip to one of the sites we support.
As far as user impact it wasn't a major impact but it had the potential to be. As part of an ongoing project to migrate some sites off of our support we created a new trusted domain. After extensive testing we prepped for the first site to migrate to the new domain which we would still manage. So the night of the migrate comes along and we start by migrating one of two DCs to the new domain. That goes fine. We migrate the security groups and User accounts. That goes fine as well and group membership is updated properly. We migrate the File Server and run security translation to update the ACLs. Again all goes well. Migrate App servers and update IAS for VPN and no problems. We then migrate a test user PC and the user retains their profile settings and can access all network resources perfectly. We then migrate the other DC. We then go migrate the remaining computers and half fail. We find that the local XP firewall is on. I immediately push a GPO to the site to turn it off but will have to wait the computers refresh. This does not happen quickly enough and users start arriving. They can't log into the original domain because both DCs are on the new one.
Rather then try re-adding one DC back to the original domain we update firewall rules to allow access to other remote DCs for the original domain and take the 3 hour drive to the site.
Going on little sleep: The GPO to disable the local firewall has now pushed out. Without thinking i grab all the computer objects and push the migration. I forgot that this RESETs the computer objects. So now all the successfully migrated PCs are cut from the domain.
To make matters worse the local admin pass we roll out with our image does not work because of a long gone on-site tech that reset them.
I spent the weekend manually adding all the PCs to the new domain after using a boot disk to wipe the local admin pass.
Lessons Learned:
Sorry that was long winded.