We're using a SAN on a project at work, and there's a bit of debate around the fact that's technically it's a Single Point of Failure. No one seems to have any hard data.
The SAN in question is a single physical box, but with internal redundant components (sorry - not sure3 what level of RAID it has, but I can find out).
What's the typical MTBF for a SAN? The PM has it down on the projects risk register as "Quite Common" - I've never heard of a SAN going down, but I don't have any stats to show how likely it really is.
Does anyone have any helpful info?
It's really not common at all, in fact I'd say it's almost exactly as common as losing power to the whole room - as if they're configured and maintained correctly power loss is the only real way of losing a complete SAN box.
That said you need to ensure that they're powered from two separate UPS's, have dual controllers, dual switches, diversely routed fibres and that you plan your shelf/array layout to cater for whole-shelf loss. If you do that then you're about as well covered as you can be without a second site.
Without knowing the exact SAN in question and how it is configured and managed, any answer to this question is a guess. I say this for 2 reasons:
Some SANs are better than others. We have an ancient EMC CX500 that has been in production for 7 years without a single hiccup. We have a Dell MD3000i that has had constant trouble. You get what you pay for.
Even the best SAN when managed or configured poorly can have poor uptime. I've seen a foolish admin cause a $2 million dollar EMC Symmetrix fail twice in one month. Before we hired him, it was up consecutively for nearly four years straight with no issues.
Since the beginning of the year, we've had all kinds of trouble, to the point where 'next available maintenance window' was a euphemism for the SAN being down. If you listen to sales, they're all kinds of solid. In practice, you don't have the expertise to torture test the SAN before going production, so it's up to the arrows of fate to expose your configuration problems at times of high demand.
The incredibly complicated SAN software or configuration failing is an unknown quantity compared to actual disk drives and other hardware. What this ultimately means is that you can add as much physical redundancy as you want but as it's all running the same broken software, you've still got a single point of failure.
That said, we seem to be running much smoother since we took the whole thing down for a firmware patch. The summary report of our SAN repair leave me worried that a bit too much magical thinking remains attributed to the SAN.
As others have pointed out, it's not common for a properly configured and spec'd storage backend (redundant controllers, power, switches, etc.) to go down. I'd seriously ask the PM to discuss, at length, the thinking behind rating it a common risk.
Technically, it is always worth documenting a "single point failure" as part of a risk assessment but there's a serious discussion to be had about whether or not fully redundant storage in a HA configuration represents a "single point of failure." It may or it may not depending on your org and the app. If it is a single point of failure, it's also worth discussing failure scenarios for loss of service to the whole datacenter (since it's unlikely to have a total failure of a redundant, HA SAN that left everything else up and available).
Dealing with those kinds of scenarios is pretty expensive ... redundant datacenters to start with and things like geographically stretched fabrics, multiple fully redundant SANs, "real-time replication" for the storage portion. The scenarios and apps that require these things are not all that common.
Just my personal experience: I've run into firmware and controller bugs that cause isolated problems. On one rare occasion, I even ran into a bug that caused one controller in an active-active pair to take a dump and trigger failover. This did not cause downtime.
I have heard of nightmare scenarios such as controller split-brain or whatnot that lead to total array collapse but it is rare and it's never definitive that this isn't due to human error or misconfiguration. (human error and misconfiguration are huge issues...I don't mean to downplay them...but they aren't "spofs" in the same sense as a single SAN.)