We have several hosts where we have an identical hot spare host, which is patched and updated so it is very close to have to same software and config. In case of failure the network cable is switched and the DHCP server is updated with the new MAC address. This is best case, as there usually are a bit more that needs modification.
I feel it is a waste of electricity to have a hot spare host and waste of time to maintain it, and since config modifications are needed in case of failover, I'd like to ask the following:
Are hot spare hosts old school and there are better ways now?
Instead of having a hot spare host, would it make sense to make it a cold spare, take the hard drives and put them in the primary host and change the RAID from 1 to 1+1. In case of failure all I would have to do is change network cables, update the DHCP server, take the hard drives and insert them in the cold spare and power on. The benefit, as I see it, is that the 2x2 disks are always in sync, so only one host to maintain and no config changes are needed when failing over.
Is that a good idea?
Yes, it's a bit old school. Modern hardware doesn't just fail that often. Focus either on making your applications more highly-available (not always possible), or on the items needed to make your individual hosts more resilient...
For hosts:
In order of decreasing failure frequency, I see: disks, RAM, power supplies, fans most often... Sometimes system board or CPU. But those last two are where your support contract should kick in.
It's rather inefficient - not least because of the dependency on manual intervention to make the switch.
I have worked at places that run a hot DR site - literally, identical servers to the primary, ready to go instantly. However the DR switchover is an automated process - we're not talking cabling, a bit of fiddling and a switch, but a process when we press the button flips everything from one site to the other.
This approach is sickeningly expensive, but that's a business decision - acceptable risk vs. the money needed to deliver on the objective. As a rule, there's an exponential curve on recovery time objective - the nearer to zero it gets, the more it costs.
But that's what your question is about, really. What is your recovery time objective, and what is the most effective way of achieving it. Waiting for a server to boot will take a few minutes. How long does it take someone to do the adjustment and 'recovery tasks' when it goes pop at 4am?
And how long is an acceptable outage?
I would suggest that if you're doing 'hot recovery' you want to think clustering. You can be fairly cheap on clustering with good use of VMWare - 'failing over' to a VM - even from a physical - means you're not running redundant hardware. (Well, N+1 rather than 2N).
If your RTO is long enough, then switch the box off. You may find that the RTO is sufficient that a cold rebuild from backup is ok.
Sobrique explains how the manual intervention causes your proposed solution to be sup-optimal, and ewwhite talks about probability of failure of various components. Both of those IMO make very good points and should be strongly considered.
There is however one issue that nobody seems to have commented on at all so far, which surprises me a little. You propose to:
This doesn't protect you against anything the OS does on disk.
It only really protects you against disk failure, which by moving from mirrors (RAID 1) to mirrors of mirrors (RAID 1+1) you greatly reduce the impact of to begin with. You could get the same result by increasing the number of disks in each mirror set (go from 2-disk RAID 1 to 4-disk RAID 1, for example), along with quite likely improving read performance during ordinary operations.
Well then, let's look at some ways this could fail.
rm -rf ../*
orrm -rf /*
instead ofrm -rf ./*
.Maybe, maybe, maybe... (and I'm sure there are plenty more ways your proposed approach could fail.) However, in the end this boils down to your "the two sets are always in sync" "advantage". Sometimes you don't want them to be perfectly in sync.
Depending on what exactly has happened, that's when you want either a hot or cold standby ready to be switched on and over to, or proper backups. Either way, RAID mirrors of mirrors (or RAID mirrors) don't help you if the failure mode involves much of anything aside from hardware storage device failure (disk crash). Something like ZFS' raidzN can likely do a little better in some regards but not at all better in others.
To me, this would make your proposed approach a no-go from the beginning if the intent is any sort of disaster failover.
The fact that it is old school doesn't necessarily make the use of a hot spare a bad idea.
Your main concern should be the rationale, what are the risks you run, and how does running a hot spare mitigate them. Because in my perception your hot spare only addresses hardware failure, which is although not uncommon, neither the only operational risk you run, nor the most likely. The second concern is do alternative strategies provide more risk reduction or significant savings.
Running a hot spare with multiple manual fail-over steps will take long and is likely to go wrong, but I've also seem automated failover with HA cluster suites turning into major cluster f*cks.
Another thing is that hot or cold standby in the same location doesn't provide business continuity in case of local disaster.
The concept of having a hot or even cold spare is dependent how the application(s) are built in the first place.
What I mean is that if the application has been built in such a way that the data and service load is spread across multiple machines then the concept of any single machine taking the system down should go away. In that situation you don't need a hot spare. Instead you need enough excess capacity to handle when an individual machine/component dies.
For example, a standard web application generally requires a web server and a database server. For the web servers, just load balance 2 or more. If one dies, no biggie. The database is usually more difficult as it has to be architected to be multi-master with all data sync'd across the participating machines. So instead of a single DB server you end up with 2 (or more) that are both servicing your data needs. Large service providers such as Google, Amazon, Facebook, etc have gone this route. There is more upfront cost in development time, but it pays dividends if you need to scale out.
Now, if your application isn't structured in such a way or it's simply prohibitive to retro fit the app then yes you will likely want a hot spare.