We received an interesting "requirement" from a client today.
They want 100% uptime with off-site failover on a web application. From our web application's viewpoint, this isn't an issue. It was designed to be able to scale out across multiple database servers, etc.
However, from a networking issue I just can't seem to figure out how to make it work.
In a nutshell, the application will live on servers within the client's network. It is accessed by both internal and external people. They want us to maintain an off-site copy of the system that in the event of a serious failure at their premises would immediately pick up and take over.
Now we know there is absolutely no way to resolve it for internal people (carrier pigeon?), but they want the external users to not even notice.
Quite frankly, I haven't the foggiest idea of how this might be possible. It seems that if they lose Internet connectivity then we would have to do a DNS change to forward traffic to the external machines... Which, of course, takes time.
Ideas?
UPDATE
I had a discussion with the client today and they clarified on the issue.
They stuck by the 100% number, saying the application should stay active even in the event of a flood. However, that requirement only kicks in if we host it for them. They said they would handle the uptime requirement if the application lives entirely on their servers. You can guess my response.
Here is Wikipedia's handy chart of the pursuit of nines:
Interestingly, only 3 of the top 20 websites were able to achieve the mythical 5 nines or 99.999% uptime in 2007. They were Yahoo, AOL, and Comcast. In the first 4 months of 2008, some of the most popular social networks, didn't even come close to that.
From the chart, it should be evident how ridiculous the pursuit of 100% uptime is...
Ask them to define 100% and how it will be measured Over what time period. They probably mean as close to 100% as they can afford. Give them the costings.
To elaborate. I've been in discussions with clients over the years with supposedly ludicrous requirements. In all cases the they were actually just using non precise enough language.
Quite often they frame things in ways that appear absolute - like 100% but in actual fact on deeper investigation they are reasonable enough to do the cost/benefit analyses that are required when presented with costings to risk mitigation data. Asking them how they will measure the availability is a crucial question. If they don't know this then you are in a position having to suggest to them that this needs to defined first.
I would ask the client to define what would happen in terms of business impact/costs if the site went down in the following circumstances:
And also how they will measure this.
In this way you can work with them to determine the right level of '100%'. I suspect by asking these kinds of of questions they will be able to better determine their other requirements' priorities. For example they may want to pay certain levels of SLA and compromise other functionality in order to achieve this.
Your clients are crazy. 100% uptime is impossible no matter how much money you spend on it. Plain and simple - impossible. Look at Google, Amazon, etc. They have nearly endless amounts of money to throw at their infrastructure and yet they still manage to have downtime. You need to deliver that message to them, and if they continue to insist that they offer reasonable demands. If they don't recognize that some amount of downtime is inevitable, then ditch 'em.
That said, you seem to have the mechanics of scaling/distributing the application itself. The networking portion will need to involve redundant uplinks to different ISPs, getting an ASN and IP allocation, and getting neck-deep in BGP and real routing gear so that IP address space can move between ISPs if need be.
This is, quite obviously, a very terse answer. You haven't had experience with applications requiring this degree of uptime, so you really need to get a professional involved if you want to get anywhere close to the mythical 100% uptime.
Well, that's definitely an interesting one. I'm not sure I would want to get myself contractually obligated to 100% uptime, but if I had to I think it would look something like this:
Start with the public IP on a load balancer completely out of the network and build at least two of them so that one can fail over to the other. A program like Heatbeart can help with the automatic failover of those.
Varnish is primarily known as a caching solution but it does some very decent load balancing as well. Perhaps that would be a good choice to handle the load balancing. It can be set up to have 1 to n backends optionally grouped in directors which will load balance either randomly or round-robin. Varnish can be made smart enough to check the health of every back end and drop unhealthy back ends out of the loop until it comes back online. The backends do not have to be on the same network.
I'm kind of in love with the Elastic IPs in Amazon EC2 these days so I would probably build my load balancers in EC2 in different regions or at least in different availability zones in the same region. That would give you the option of manually (god forbid) spinning up a new load balancer if you had to and moving the existing A record IP to the new box.
Varnish cannot terminate SSL, though, so if that is a concern you may want to look at something like Nginx instead.
You could have most of your backends in your clients network and one or more outside their network. I believe, but am not 100% sure, that you can prioritize the backends so that your clients machines would receive priority until such time as all of them became unhealthy.
That's where I would start if I had this task and undoubtedly refine it as I go along.
However, as @ErikA states, it's the Internet and there are always going to be parts of the network that are outside your control. You'll want to make sure your legal only ties you up with things that are under your control.
No problem - slightly revised contract wording though:
If Facebook and Amazon can't do it, then you can't. It's as simple as that.
To add oconnore's answer from Hacker News
I don't understand what the issue is. The client wants you to plan for disaster, and they aren't math oriented, so asking for 100% probability sounds reasonable. The engineer, as engineers are prone to do, remembered his first day of prob&stat 101, without considering that the client might not. When they say this, they aren't thinking about nuclear winter, they are thinking about Fred dumping his coffee on the office server, a disk crashing, or an ISP going down. Furthermore, you can accomplish this. With geographically distinct, independent, self monitoring servers, you will basically have no downtime. With 3 servers operating at an independent(1) three 9 reliability, with good failover modes, your expected downtime is under a second per year(2). Even if this happens all at once, you are still within a reasonable SLA for web connections, and therefore the downtime practically does not exist. The client still has to deal with doomsday scenarios, but Godzilla excluded, he will have a service that is "always" up.
(1) A server in LA is reasonably independent from the server in Boston, but yes, I understand that there is some intersection involving nuclear war, Chinese hackers crashing the power grid, etc. I don't think your client will be upset by this.
(2) DNS failover may add a few seconds. You are still in a scenario where the client has to retry a request once a year, which is, again, within a reasonable SLA, and not typically considered in the same vein as "downtime". With an application that automatically reroutes to an available node on failure, this can be unnoticeable.
You are being asked for something impossible.
Review the other answers here, sit down with your client, and explain WHY it's impossible, and gauge their response.
If they still insist on 100% uptime, politely inform them that it cannot be done and decline the contract. You will never meet their demand, and if the contract doesn't totally suck you'll get skewered with penalties.
Price accordingly, and then stipulate in the contract that any downtime past the SLA will be refunded at the rate they are paying.
The ISP at my last job did that. We had the choice of a "regular" DSL line at 99.9% uptime for $40/mo, or a bonded trio of T1s at 99.99% uptime for $1100/mo. There were frequent outages of 10+ hours per month, which brought their uptime well below the $40/mo DSL, yet we were only refunded around $15 or so, because that's what the rate per hour * hours ended up at. They made out like bandits from the deal.
If you bill $450,000 a month for 100% uptime, and you only hit 99.999%, you'll need to refund them $324. I'm willing to bet the infrastructure costs to hit 99.999% are in the neighborhood of $45,000 a month assuming fully distributed colos, multiple tier 1 uplinks, fancypants hardware, etc.
If professionals question if 99.999 percent availability [is] ever a practical or financially viable possibility, then 99.9999% availability is even less possible or practical. Let alone 100%.
You will not meet 100% availability goal for an extended period of time. You may get away with it for a week or a year, but then something will happen and you will be held responsible. The outfall can range from damaged reputation (you promised, you didn't deliver) to bankruptcy from contractual fines.