This question is more of a math question than a server question, but it is strongly server related.
If I have a server that I would be able to guarantee 95% uptime and I would put that server in a cluster of 2, how much would the uptime be then? Now, let's say I do the same, but I make it a cluster of 3?
Let's not consider things like single point of failure, but purely focus on the math here. One of the things that makes this a bit complicated is that if for example I have 2 servers, the chance that they are both off is 2^2, so that's 1/4th; or for 3 that's 2^3, so 1/8. Considering I have an 5% downtime for each of these servers, would the total average be then that 1/8th of that 5%?
How would you calculate something like this?
Uptime is a slippery thing... If you want to calculate the availability of a service then it is simply
If you have a cluster providing the service, then the likelyhood that the service becomes unavailable does down but the availability (uptime) calculation for the service stays the same.
The chance of one server being offline is (1 - 0.95) The chance of both servers being offline is (1 - 0.95) * (1 - 0.95) = 0.0025 etc...
So using your model and from a purely mathematical point of view one or both of the servers should be up 99.75% of the time
However, I'm not sure that using such a mathematical model is the correct way to work out your potential uptime as there are other factors that may affect it which are common to both servers i.e. the 95% might be because 5% of the time there is a power cut whihc would affect BOTH servers so having a cluster would make no difference
This depends on why your servers are down 5% of the time. If you have power 95% of the time, but your servers are otherwise flawless, then a second server at the same location does not increase your uptime at all: if one goes down, both go down. This is an example of the failures being correlated. It's likely that at least some of your downtime is due to errors that affect all servers together (power...). But some of the downtime will be independent between servers. If you want to do it properly, you ought to deal with these things separately. So you want to work out the probability that server 1 does not have an independent error (p) and that server 2 does not have an independent error (q) and that there is no systemic error that kills both (r). It would be relatively safe to assume that these errors are independent, and thus you could just multiply them together: pqr is probability of some server being up.
The problem is, you can't use actual uptime data to give you values for p,q, and r, except that if you have just server 1 and it is up 95% of the time, then p*r = 0.95.
First of all, the total availability or uptime of a cluster depends on how large a part of the cluster is needed to be active for the whole cluster to be considered 'up'.
As you found out, the first two cases are quite simple to calculate. Let the probability of a single server being online at any given time p = 0.95. Now, for three servers, the probability that they are all online at the same time is p3 = 0.857375.
For the opposite case, where at least one machine should be active at a given time, it's easier to calculate by inverting the problem and looking at the probabilities of the machines being offline. The probability that a single machine is offline is q = 1-p = 0.05, and hence the probability that they are all down at the same time is q3 = 0.000125, giving probability 1-q3 = 1-(1-p)3 = 0.999875 that at least one is up.
The 2 out of 3 case is slightly harder to calculate. There are four possible situations where at least two out of three servers are up. 1) ABC are up, 2) AB are up, 3) AC are up, 4) BC are up. The probabilities for all these are, respectively, ppp, ppq, pqp and qpp. Since the cases are disjoint, the probabilities can be added together, giving a total A = p3 + 3 p2q = 0.992750.
(This can be expanded to more machines. The factors are the well known binomial coefficients, so counting the different cases by hand works mostly as an exercise.)
Of course, calculations like this are much easier to deal with by using a ready-made computer program... At least one online calculater can be found here:
http://stattrek.com/online-calculator/binomial.aspx
Entering the input values: probability of success = 0.95, number of trials = 3, number of successes = 2, we get the result "Cumulative Probability: P(X ≥ 2) = 0.99275". Some other related values are also given, and the online tool makes it easy to play with other numbers too.
And yes, all of the above assumes that the servers fail independently, that is a) I ignored any problems affecting the cluster as a whole, b) there isn't anything like component aging that would make it likely for the servers to fail at or nearly at the same time.
You have 5% downtime for each server, so you multiply it - 0.05*0.05=0.0025, giving you 1-0.0025=0.9975 - >99% uptime. With 3 servers you have 1-0.000125=0.999875 >99.9% uptime.
I normally account for 97% availability for standalone host (with redundant HDD and PSU), giving >99.9% for 2N and >99.99% for 3N redundancy.
I have done some more digging and found this piece of the puzzle.
Using the example of a server with an availability of 95%, then adding a second server would increase the availability to: 95% + (1-95%)*95% = 99.75%. The logic behind this is that when the 1st server is down (5% of the time), the second server is still up 95% of the time.
Adding a 3rd server would iterate through this the same way. The first 2 together are already 99.75% availlable, so adding the 3rd one would be: 99.75% + (1-99.75%)*95% = 99.9875%. And so on and so forth. This is close to Phil's answer, but still a bit different since you need to take the result of the previous iteration and use that in the next one.
For components that are dependant on eachother you simply multiply the availability percentages, so if you have 2 components that are 50% available you have 25% total availability (i.e. the system works only when both components work.)
Assuming uptime of each server is independant of the others the total uptime is
Where n is the number of servers And 0.05 is the downtime probability of one server