I have been looking into MTTF, MTBF, MTBR and MTBF for out HP Gen9 servers running in our production environment.
Root of my question, should be worried or not.
I cannot seem to get any good data as each server has a mix of hardware.
At my last company we ran about 2000 dell server r210 r410 r710 I would say on average we had about 5 servers a day that had some sort of failure. So about 0.25% of the server went hard down and needed to have a part replaced before being able to be used again.
My last company everything was setup in an HA pair, N+2 infrastructure so there was not impact to production. We were able to replace the servers and keep going
At my current office, we run 9 servers, (HP Gen9, 56 VM's Hyper-V) we do not keep a lot of replacement parts on hand also out datacenter is not managed so if something dies we have to drive about 45 minutes to replace anything.
My CTO nor IT manager seem to be worried, they have had about 2.5 days of downtime last year, I have been auguring we need to cluster the servers but they do not see a need.
Is there a wrong or right here? Not sure what to do.
I know its not my responsibility if something happens its on the CTO. This is a very small company only the CTO, IT Manager, myself (dev ops)and 1 help desk guy.
Over all experience in running a production environment, is very limited, the way a lot of things are setup I would call very junior level, neither my CTO nor IT Manager knew alot about clustering before I got there. They were in the middle of a project to setup DR without HA, which I augured against but lost.
Don't worry about the MTTF, MTBF, MTBR and MTBF figures... why would those apply to the specifics of your environment?
The servers have internal redundancies and can be extremely stable in production. But that depends on your environment, the disk array/composition, types of disks, RAM quantity, CPU configuration, thermal characteristics, power, etc.
Employing some form of high availability can reduce the potential for downtime and gives you a place to shift your workloads in the event of a failure.
This is a financial and operational risk question.
Perhaps the incremental cost of going from standalone to cluster is high enough that it doesn't make business sense? Perhaps the 2.5 days of downtime (~99.3% availability) is good enough for your operation. You should focus on offsite protection and good backups. All of your HP Gen9 systems are under manufacturer warranty today, so you do have access to parts. If you have RAID, redundant power supplies/fans and stable power, you've covered the most critical areas.
Think of this from a financial perspective and outline the risks, associated costs and try to make a compelling business case for what you want.