One of my client's sites received a direct lightning hit last week (coincidentally on Friday the 13th!).
I was remote to the site, but working with someone onsite, I discovered a strange pattern of damage. Both internet links were down, most servers were inaccessible. Much of the damage occurred in the MDF, but one fiber-connected IDF also lost 90% of the ports on a switch stack member. Enough spare switch ports were available to redistribute cabling elsewhere and reprogram, but there was downtime while we chased down affected devices..
This was a new building/warehousing facility and a lot of planning went into the design of the server room. The main server room is run off of an APC SmartUPS RT 8000VA double-conversion online UPS, backed by a generator. There was proper power distribution to all connected equipment. Offsite data replication and systems backups were in place.
In all, the damage (that I'm aware of) was:
- Failed 48-port line card on a Cisco 4507R-E chassis switch.
Failed Cisco 2960 switch in a 4-member stack.(oops... loose stacking cable)- Several flaky ports on a Cisco 2960 switch.
- HP ProLiant DL360 G7 motherboard and power supply.
- Elfiq WAN link balancer.
- One Multitech fax modem.
- WiMax/Fixed-wireless internet antenna and power-injector.
- Numerous PoE connected devices (VoIP phones, Cisco Aironet access points, IP security cameras)
Most of the issues were tied to losing an entire switch blade in the Cisco 4507R-E. This contained some of the VMware NFS networking and the uplink to the site's firewall. A VMWare host failed, but HA took care of the VM's once storage networking connectivity was restored. I was forced to reboot/power cycle a number of devices to clear funky power states. So the time to recovery was short, but I'm curious about what lessons should be learned...
- What additional protections should be implemented to protect equipment in the future?
- How should I approach warranty and replacement? Cisco and HP are replacing items under contract. The expensive Elfiq WAN link balancer has a blurb on their website that basically said "too bad, use a network surge protector". (seems like they expect this type of failure)
- I've been in IT long enough to have encountered electrical storm damage in the past, but with very limited impact; e.g. a cheap PC's network interface or the destruction of mini switches.
- Is there anything else I can do to detect potentially flaky equipment, or do I simply have to wait for odd behavior to surface?
- Was this all just bad luck, or something that should be really be accounted for in disaster recovery?
With enough $$$, it's possible to build all sorts of redundancies into an environment, but what's a reasonable balance of preventative/thoughtful design and effective use of resources here?
A couple of jobs ago, one of the datacenters for the place I was working for was one floor below a very large aerial. This large, thin, metal item was the tallest thing in the area and was hit by lightning every 18 months or so. The datacenter itself was built around 1980, so I wouldn't call it the most modern thing around, but they had long experience dealing with lightning damage (the serial-comms boards had to be replaced every time, which is a trial if the comms boards are in a system that hasn't had any new parts made in 10 years).
One thing that was brought up by the old hands is that all that spurious current can find a way around anything, and can spread in a common ground once it bridges in. And can bridge in from air-gaps. Lightning is an exceptional case, where normal safety standards aren't good enough to prevent arcs and will go as far as it has energy. And it has a lot. If there is enough energy it can arc from a suspended-ceiling grid (perhaps one of the suspension wires is hung from a loop with connection to a building girder in the cement) to the top of a 2-post rack and from there into the networking goodies.
Like hackers, there is only so much you can do. Your power-feeds all have breakers on them that clamp spurious voltages, but your low-voltage networking gear almost never does and represents a common-path for an extremely energetic current to route.
Detecting potentially flaky kit is something that I know how to do in theory, but not in reality. Probably your best bet is to put the suspect gear into an area and deliberately bring the temperature in the room up into the high end of the Operating Range and see what happens. Run some tests, load the heck out of it. Leave it there for a couple days. The added thermal stress over any pre-existing electrical damage may weed out some time-bombs.
It definitely did shorten the lifespan of some of your devices, but finding out which ones is hard. Power conditioning circuitry inside power-supplies may have compromised components and be delivering dirty power to the server, something you could only detect through the use of specialized devices designed to test power-supplies.
Lightning strikes are not something I've considered for DR outside of having a DC in a facility with a giant lightning rod on the roof. Generically, a strike is one of those things that happen so infrequently it's shuffled under 'act of god' and moved along.
But... you've had one now. It shows your facility had the right conditions at least once. It's time to get an assessment for how prone your facility is given the right conditions and plan accordingly. If you're only thinking of the DR impacts of lightning now, I think that's appropriate.
I've been thinking about this question since it recently got edited back to the top of the front page.
I freely stipulate that, for people like sysadmin1138 who have to deal with installations that are highly-attractive to large lightning strikes on the DC roof, specific contingency planning for a big strike makes sense. But for most of us, this is a one-off circumstance, and I thought an answer more generally suited to the rest of us might have some value.
It is possible to imagine all kinds of film plot threats; scenarios that could definitely happen, would unquestionably take down your business operations if they did so, but that there is no reason to think have any elevated likelihood of happening. You know the sort of thing; airplane strike / lightning bolt / oil depot nearby explodes / any other plausible-but-background-risk scenario.
Each of these has a specific mitigation plan that could be put in place, but I would suggest that - modulo my stipulation above - it makes no business sense to do so. As Schneier is trying to point out in the above-linked competition, just because you can imagine something dreadful happening doesn't make it a threat against which specific planning is worthwhile, or even desirable. What does make good business sense is a general-purpose, well-documented, tested business continuity plan.
You should ask yourself what the business costs are of a complete site loss for various periods of time (eg, 24h, 96h, one week, one month) and attempt to quantify the likelihood of each occurrence. It must be an honest business cost analysis, bought into by all levels of the business. I've worked at a site where the generally-accepted figure for downtime was £5.5million/hour (and that was 20 years ago, when five million quid was a lot of money); having that figure generally agreed made so many decisions so much easier, because they just became a matter of simple maths.
Your budget is the projected loss multiplied by the annual chance of that loss; now see what you can do to mitigate that threat for the budget.
In some cases, this will run to a full standby data centre, with cold equipment, ready to go 24x7. It may mean a small standby data centre, so that customer interaction can continue with a very-reduced number of telephone operatives, and a placeholder website warning of disruption. It may mean a second, redundantly-routed internet connection at your main site, lying cold until needed. It may mean, as Mark Henderson notes above, insurance (but insurance that covers the business losses as well as the actual costs of recovery); if you can spend your BC budget on a single piece of paper that will cover all your expected costs in the event of disaster, it may make sense to buy that piece of paper - but don't forget to factor failure of underwriter into your business risk plan. It may mean upgrading the maintenance contracts on certain core equipment to extremely expensive four-hour-to-fix ones. Only you can know what makes sense for your business.
And once you have this plan, you really need to test it (with the possible exception of insurance-based ones). I've worked at a site where we had a complete small-scale-operation cold site, ready to cut over to, 45 minutes drive from our main facility. When we had a problem that shut the core network down, we ended up trying to fix it live instead of cutting over to the cold site and then fixing the core and cutting back. One of the reasons behind failure-to-cut-over was that we'd no real idea of how long it would take to cut over and to cut back. Therefore, noone really knew how long things should be allowed to run without cutover before making the decision to cut, so - quite understandably - there was reticence to decide to cut over. Heads rolled after we came back online, 14 hours later; not because of the outage per se, but because a lot of money had been spent on a facility to mitigate a day-plus outage which had lain unused during just such an outage.
As a final point, note that outsourced components of your business plan are not guaranteed to work. Your senior management may be sitting there thinking "if we put the servers in the cloud, they'll just always be there, and we can fire the sysadmins". Not so. Clouds can fail like anything else; if you've outsourced critical components to a provider, all you've done is remove your ability to estimate the chances of failure of those components. SLAs are all very well, but unless they're backed by substantial non-performance penalties, they're meaningless - why would your provider spend extra money on staying available if they could just trouser the money and refund your service charges for the period of unavailability? To be reliable, your SLAs need to come with penalties that approximate the costs to your business of the outage. Yes, that will much increase the costs of outsourcing; and yes, that is entirely to be expected.
It always comes down to how much you want to spend. I don't have deep enough knowledge to speak at length about this, but I've been in a big pharma datacenter that took a lightning strike and blew through something that was supposed to be a multiply-redundant spike arrester (and was designed correctly, but was implemented wrong so something got through.)
What was the maximum spike that your UPS could have prevented? It should have a rating. Apparently, the strike was direct enough to exceed that, or something leaked around the UPS feed, like a bad ground. So, maybe you review your power design, determine how likely another strike is, compare the cost of downtime X likelihood versus remediation, and maybe have an electrician give the facility a good survey to make sure that everything's grounded properly - and some quick reading shows that grounding for safety/code is not as intensive as grounding for prevention of damage from lightning.