I have been looking into MTTF, MTBF, MTBR and MTBF for out HP Gen9 servers running in our production environment.
Root of my question, should be worried or not.
I cannot seem to get any good data as each server has a mix of hardware.
At my last company we ran about 2000 dell server r210 r410 r710 I would say on average we had about 5 servers a day that had some sort of failure. So about 0.25% of the server went hard down and needed to have a part replaced before being able to be used again.
My last company everything was setup in an HA pair, N+2 infrastructure so there was not impact to production. We were able to replace the servers and keep going
At my current office, we run 9 servers, (HP Gen9, 56 VM's Hyper-V) we do not keep a lot of replacement parts on hand also out datacenter is not managed so if something dies we have to drive about 45 minutes to replace anything.
My CTO nor IT manager seem to be worried, they have had about 2.5 days of downtime last year, I have been auguring we need to cluster the servers but they do not see a need.
Is there a wrong or right here? Not sure what to do.
I know its not my responsibility if something happens its on the CTO. This is a very small company only the CTO, IT Manager, myself (dev ops)and 1 help desk guy.
Over all experience in running a production environment, is very limited, the way a lot of things are setup I would call very junior level, neither my CTO nor IT Manager knew alot about clustering before I got there. They were in the middle of a project to setup DR without HA, which I augured against but lost.