With ESX and ESXi, we recently had two systems where that the boot partition became degraded due to a failed disk. The only alert we managed to capture was the visual alert on the Dell servers. We failed to received any electronic alerts regarding the failed or degraded array.
Does anyone have any experience with monitoring for these types of failures? In both cases, the servers were running in a RAID 5 SCSI configuration (5 disks on one system, 3 disks on another) which if we were running a Windows Server OS, we would have had an alert created in the Eventviewer. Where would I begin to look for this solution. Can it be configured in VCenter or vFoglight?
We use DRACs in our servers for alerting HW issues on the host, or you could install Open Manage Server Administrator provided you're running a supported version of ESXi
I have 3 products that I use to monitor my hosts. vCenter, vFoglight, and of course Dell OpenMange/DRAC card software on my Dell boxes.
Of all of these I find that using the Dell software is the best for alerting me to actual hardware problems. I prefer vFoglight for performance monitoring of the guests/hosts and letting me know about available resources. If I were you I would setup the Dell OpenManage and DRAC card.
1)Go to support.dell.com. Select your servers, or enter the service tag. Select your OS in the drop-down. Download Dell OpenManage System Administrator (stand alone). 2)Install via mounting the ISO from vCenter or make a physical disk and connect to the DVD drive in vCenter 3)Do the express install. You may have to restart the esx mgt services and/or the dell services. 4)Connect to OpenMange via a browser https://your server IP or DNS name:1311 5)Configure Openmange (setup alarms, setup SMTP server) 6)Setup the DRAC in OpenManage - give it an IP and change the default/root password 7)Connect to the DRAC side in a web browswer and check the configuration.
Alerts will now come off the box from your DRAC IP. After all the mail is setup do a test by pulling 1 power cable and reconnecting. You should see 4 mails come off the box (power redundancy degraded, power redundancy lost, then it coming back) - Just a simple test to make sure you are getting the mail through.
I monitor my ESX servers through CIM using a python script.
I'm not familiar with vFogLight.
In vcenter you can setup an alert, based on hardware status. I can't find a good link for you at the moment.