We have 6 ESX servers running +150 VMs. Currently our VCenter server is one of these VMs. The other day we had a hardware failure in our DC (caused by a naughty UPS) which took out two of these servers. The first server it took out was running our primary VCenter server, the second running our HA/Heartbeat VCenter server, thus none of our hosts migrated off our two failed hosts onto the 4 working ones and we lost most of our VM management (users all use VSphere). This is a very unfortunate circumstance, and hopefully shouldn't happen too often, but I was wondering, is it a good idea to run our primary VCenter server on a separate box in a different datacenter*/redundant block dedicated to just VCenter, with the backup being a VM? Is it even possible? (All we have is the virtual appliance, though if it's available I wouldn't have thought it's too hard to track down).
*I'm ashamed to say, we run all our VMWare servers in a single DC. We mirror the SAN to a second DC but we have no servers there. They are only development/non-critical servers but people still shout if they're down.
There's no reason why not. I'm not aware if VMware specifically direct you to run vCenter on either physical or virtualised hardware, I believe it's supported on both.
Depending on what sorts of failures you're trying to protect against, it's usually a good idea to separate your redundant / standby instances from the primary / live instances as much as possible. Separate networks, cabs, power supplies and even buildings, cities and countries are all good ideas - they just cost different amounts and come with their own unique set of challenges.
In this particular case, it sounds like you had one of those outages which you hadn't designed / accounted for, or had knowingly chosen not to design around. Putting your management servers on the same infrastructure as the ESXi hosts, along with the same power supplies, networks etc. all runs the same risk of a single event taking everything out.
You have a choice - either you can choose to not change anything and live with the outages that result from this particular type of event happening again, or choose not to and spend some money to mitigate. Either is a valid approach, it entirely depends on how much it's going to cost you in outages vs how much it'll cost you to change.
I don't believe that it makes a difference of bare metal installation or virtualized. The only limitation that I've seen with the current High Availability setup is to have less than 10ms between nodes. For us, this limits us to a single datacenter - I don't have any other datacenters close enough for 10ms access.
Here's the Best Practices Guide for vCenter 6.5 High Availability.
Since you're probably stuck in a single Datacenter with 3 vCenter instances for the HA configuration, you'll want to eliminate as many other commonalities as you can between each. Separate servers, separate racks, different parts of the room or building, different storage, etc. The more you can seperate each node from the other, the better your chances that a single failure somewhere doesn't take all of them down.