Today I have had disks on two separate HP ProLiant servers go into Predictive Failure. One of these servers runs Windows Server 2008 R2 and one runs Oracle Enterprise Linux 5 (a RHEL5-based distro).
If I look in the Integrated Management Logs for these servers, the Windows server has a 'Caution' entry announcing the Predictive Failure, but the OEL server does not have the same.
We have some existing business process around the IML (ticket integration, reporting, etc.), hence the preference to have these messages there. All the right bells and whistles sounded for the Windows box, but nothing from the OEL server.
I've gone back through my monitoring system's alert history and it shows that this has always been the case -- the Windows server reports its disk failures (predictive and actual), while the OEL server does not.
SNMP trap alerts appear to be working; these are logged in root's mail file and are captured in the /var/log/messages
file. Interestingly, the IML on the OEL server does appear to be showing me Repaired entries for previous disk failures. It is just the initial Caution or Failure entry that appears to be missing from the log.
The Windows server has all the HP Management agents installed as part of the Intelligent Provisioning/Smart Start install of the OS. The OEL server has the RHEL5 HP yum repo enabled, and has the hpsmh
, hpilo
, hp-health
and hp-snmp-agents
packages installed.
The Windows server is a DL380p Gen8, while the OEL server is a DL380 G7. I have no other server generations running OEL to compare (although it does appear to be common to the three DL380 G7 servers I have running OEL). Further checking shows IML-logged drive errors on other Windows servers, at least as far back as G5 (so I don't think it is a generation issue).
I've also looked at the startup/config scripts in /opt/hp/hp-snmp-agents/storage/etc/cma*
but can't see anything the pertains to the IML (not that I really know what I am looking for here).
Is it a missing package or config statement (i.e. something readily rectifiable) that is preventing these messages reaching the IML?
Or is it a known issue (leaving me no choice but to hack something else into the business process)?
I don't think you should rely on the HP IML log alone. Not everything is reported there, and the log can be cleared. I don't look at it as an authoritative source of system health status. Plus items get marked as repaired, depending on the event.
If you need a comparison of what a busy EL5 system's IML log should look like, see this pastebin. But most of my IML logs have been cleared at some point... E.g.:
The HP management agents in Linux can easily be set to send SNMP traps and also email.
Typical config in /etc/snmp/snmpd.conf:
And for the /opt/hp/hp-snmp-agents/cma.conf
The HP management agents for Linux should be straightforward. You'll want the following packages: