After having a bit of scare with a server that wouldn't come up one morning, the higher ups have decided that the business needs a high availability / fail over setup.
We have 5 main servers (4x Linux, 1x OpenBSD) all of which need to be running for the company to operate. Three of the servers are fairly standard (Files/Web/Database), the fourth handles most network routing and web proxies, while the fifth supports our phone system and has non-standard hardware.
My boss has stated that turn around time for a server failure should be under 30 minutes.
My experience in this field is non-existent (I'm just a programmer who was 'promoted'), so I guess my question really boils down to:
- Is this something that should even be attempted by someone with average server-admin skills. If so, what should I read, and who should I talk to?
Thanks.
I think you should start by getting numbers together to describe the cost associated with fulfilling the stated "requirement" to see if it even falls within the budget. If you're not comfortable with all of the "normal" methods that would be used to fulfill the requirement (failover clustering, hypervisors with "hot migration" capability, etc), then you'd probably do well to find a consultant who can help out.
There's going to be some cost associated with the feasibility study, but it's going to cost a lot less to discover that a good solution won't fit within the stated requirement (meaning that expectations need to be set more realistically by management-- or they need to pony up more money) than it will cost to do something half-assed that ends up not fulfilling the requirement at all and blowing a ton of money in the process.
It sounds like your boss just pulled that number out of the air. Perhaps he's done some analysis and knows what the cost-per-hour associated with downtime of various systems is, but I doubt it. It sounds like some pie-in-the-sky number that isn't tied to reality. I'd be surpirsed if all your systems need that kind of availability. It may be, in the course of studying the business, that you discover that only a subset of functionality needs to have such a degree of uptime and fault-tolerance (and, thus, such a solution would ultimately cost less). I'm sure that phones and the line-of-business application are up there, but you may have some tolerance for downtime on some of the other systems.
My gut says that you're probably going to find a win in using virtualization technologies to create a failover system based on migration of virtual machines between redundant hardware. Whether it'll fit your budget or not will depend on your business, since you'll definitely need some type of SAN to make that work effectively.
Don't discount "traditional" failover clustering, though. There are definitely "wins" there, too, if your applications are well suited to such a configuration.
I wonder if your boss has thought about catastrophic failure scenarios (building burns, flood, tornado, theft, etc). If that's not already planned-for, this would be a golden opportunity to work in some general business continuity planning and disaster recovery contingency.
Get some help from somebody who can come in and study your business and make recommendations. You won't regret it.
"This road leads to much pain and hurt..."
So, what is your Business's Continuity Plan? You Disaster Recovery plan?
Have you discussed it? Written it down? TESTED IT?
You need to have a proper conversation with the "higher up's" and really get to the bottom of the requirements for high availability because it is different for different services.
So what really was the "pain point" that they felt that morning?
Was it?
I assume you have bought high quality hardware for your main systems? Good, 'cause to cheap out on hardware is a false economy as these servers come with "dual" everything in the box.
I will also assume you know HOW to rebuild a server, swap fans, power supplies, rack a server, configure dual path networks into redundant switches? You've done this enough times to understand what works and what doesn't, what is normal and what is erronous? If not then get help and training (or at least practice and experience).
Maybe a lot of the problem was FEAR. They did not have a clue that such a problem could happen (and how important the servers were to their business) and you didn't really know what you were doing (?) A confidence issue?
You need to get all the above right BEFORE going down the very expensive HA route. Can the business afford this expensive equipment (and most of it, by definition, will only ever be used in a failure and often never used!)
Evan hit's on some good points, but here are maybe some specific cost effective way's to get sub 1 hour recovery time in the face of failures.
Small Business likely means small hardware, so it may not be alot of cost to do some simple things that actually add a significant amount of resiliency in the face of problems. THe main idea is just have extra hardware ready to go.
First, get comfortable with the thought of a virtual IP. That is the IP address that users will talk to, but can reside on any server you give it to. This is the IP address you're users, and applications will want to talk to. And it'll be the most helpfull for ultimatly any solution you go for. Having a VIP means that you shouldn't have to reconfigure any oft the applications when failing over. Also, keep in mind that having redundant hardware also has the impact of increases administration overhead, doing two configuration updates instead of 1.
If we start with you're routing / web proxy server, it's probably the easiest since their won't be any real state that needs to be stored on the box itself. So just get a duplicate of the same box, and configure it the same. I'd keep both plugged in on the LAN segment, and assuming you're internet is on another interface, swap cables if their is a failure. From a routing perspective, you set all you're lan clients to target the .1 address (the VIP) for their default route and proxy server give server A the .2 address and server B the .3 address. This way they can both be managed for config updates (applies to both). And all you have to do to failover is remove the .1 IP assignment from .2 and move it to .3, and move the internet connection to the other interface. It's not very complicated, easy to do and understand, and costs the extra hardware of a second box. If you can get redundancy on the internet side, you could add some complexity, and get automatic failover using something like VRRP.
Without specifics, it's hard to say but you're web server may be just as simple. Add a second server with Identical configuration, create a vIP between the two, and move the VIP to the backup in the face of failure. I generally don't mind if session state is lost on a failover (it's a critical problem to cause a failover). So if users have to log in again, no big deal. Again, vrrp can probably be used for automatic failover.
Moving onto you're DB, this is significantly more complex. Most DB's have some sort of primary / secondary model, where you backup the original DB to the secondary, and then copy all transaction logs or DB changes to the secondary. Again, you can combine this with VIP's for the applications / users actually accessing the DB. However, failover is more complicted. Depending on the failure of the primary, you may need to actually get the drives up and running to copy and leftover transaction logs. Then bring the secondary active. If you can tolerate some lost data, then you can bring the secondary active right away. After the failover, server B is now you're primary, and you're work would be to restore server A, and turn it into the new backup so it's ready to be failed to when server b eventually has problems.
File servers are always the hardest part, since unlike DB's, it's alot harder to get a built in feature of the file system. However, some level of resiliency can be achieved by having a second server, and simple write a script that scans the filesystem for changes, and copy any new files to you're secondary. You can basically run rsync on a cron I beleive to do this. Again, you use a VIP that you give to users, that you move over if you do a failover. In you're script, I wrould highly recommend that you check to make sure the system is the owner of the VIP before transferring files. You really really really don't want the rsync to execute in the wrong direction and overwrite any changes you're users are making. This could lose some files if their is a failure, and also will not protect again users wiping out files themselves.
I have no idea what you could do about you're phone system... it really depends on the vendor and how it's setup. The vendor may have some off the shelf solution for resiliency.
Some final words of warning. Make sure you thoroughly test any setup you are going to go with. Make sure you know how to fail it over without losing that critical information. Test test test to make sure it will work when you need it to. Make sure you have processes in place that configuration changes, software updates, etc get applied properly to both primary and backups. The good news is, you can probably do controlled failover's when you want to bring a server down to upgrade, etc. It's not an active-active setup, so you have no idea if the secondary will work when you need it.
I work in telecom, and our equipment is very highly redundant, including in most cases geo-graphic redundancy. Our number 1 point of failure is redundancy isn't tested after changes, and users making changes that don't know how the redundancy model works. However, we have the added problem that all our equipment needs to support automatic failover in no more than several seconds. You can tolerate manual intervention in you're failovers if you only need to be up and running within 30 - 60 minutes. You just need to be prepared. Good luck.
Everyone elses points are great so just a couple of comments.
30 minutes is impossible to guarantee, especially for everything. You can say its a target, but there is no way it can be a guarantee, because there is always the X factor. You could have 2 ISP lines and a truck crashes into the building and takes them both out because you didn't think that having them routed from opposite ends of the building mattered is one example.
As a start for costing, double everything. You have 5 servers so you need to double, that. It doesn't need to all be on hardware, you can virtualize, but you see what I mean. On top of that, everything must be HA aware which will also add to the cost, you may find out you're going to have to replace your router with a new one and oh you need 2 of them. Don't forget to double the power feeds and get the generator, because you can't guarantee the power company will be back up within 30 minutes.
These examples are thinking its more or less a hot standby setup which is what I suspect your boss is thinking.
What I find better for the small business is to design a plan to recover and classify everything.
Figure out which services are
critical (business stops)
important (business slows down)
routine (business can make do without it for a while).
For instance, your call center phones are critial, so maybe that one is worth buying a second server and a second ISP and your average power outage is about 15 minutes so we'll get a UPS for that will last 60 minutes (don't forget the workstations either). Now lets say the ERP is only important, meaning your can function without it for a bit. Maybe your call center people use it, but if it's down, they can revert back to pen and paper or notepad and then update the ERP after. The procedure to do that if it's down will may be cheaper then trying to make it a critical service. And the routine ones might be something like printers, ok its a pain but we can make due for a couple of days if they all go down.
That also give you the order to fix stuff if the s**t really hits the fan one day :)
Is it possible? Sure. Is it affordable? Probably not for a "small business", especially if you have a boss giving you arbitrary numbers by which to work, and he's demanding high availability from an IT department that consists of a deputized programmer (seen it many times in other places and it's never pretty for your stress levels, if your situation was like theirs).
Failover is possible but usually requires redundant hardware, SANs to share data among servers, etc...in other words, good luck getting it funded if they won't hire a dedicated administrator to take care of it.
Your call system hardware you mentioned is specialized hardware, and you alluded to being a callcenter. You should talk to the vendor about options to make that redundant. Goofing with that could void support in the first place.
Other systems you could most likely gain some redundancy by investing in VMWare-type solutions (or Hyper-V or XenServer, but I'd look at VMware and XenServer first). Then you can look at getting a SAN, a couple beefy servers with fast network switches, and use LiveMotion to migrate virtualized servers between hardware servers if there's a failure, as well as balance some of the load between servers as needs come up.
You mentioned you're running Linux on those systems. With money to get multiple servers, you could look instead at setting up DRBD with a heartbeat program and STONITH to replicate data between servers and take over when one becomes unavailable; you'd be looking at setting up a system where you have literally duplicated each server, as well as doubled your power consumption and heat dissipation in the server room (if you have a server room). That can be done for the cost of hardware and your sanity. Plus you'd have to test it, you'd have downtime while configuring it, and you still have the possibility that it won't work at times as there's still the possibility of issues cropping up that have to be taken care of (split brain, for example).
Last is a plan for getting a couple systems to act as blank slate systems and have a really good backup plan to allow for you to restore data to one of the "blank" systems if a server dies. Having hardware onsite will give you some options if/when a server dies; but you still will have some downtime while restoring data, and you need instructions on how to properly install your applications to the new server. Depending on how fast you work and how big the data is you may have downtime lasting from a few hours to a day or two. You do have a working, known-good backup for your servers, with a recovery plan in place, yeah?
Should you attempt it? My first reaction is that if you're scratching your head at any of the suggestions or feeling a pit in your stomach at trying to think this stuff out, then you shouldn't. You'd need a consulting company to come in and look at the issue and work out the costs and implement it, or you need to hire a dedicated sysadmin to do it for your company.
The fact they're telling you to do it and you are saying you're "just a programmer that was "promoted" and you have a PHB telling you to give redundancy with a maximum failure time of 30 minutes is that you're kind of up a creek.