The situation
- Recent upgrade from 2.2 to 3.1(1e).
- Since the upgrade, at 6:51am (UTC+1) every day I experience failures on between zero and three (out of ~60) of the B200-series blades in my installation.
- It's always the same three blades, all in different chassis.
- The failures manifest themselves as a hard hang with 'CPU predictive failure' and 'CATERR_N' messages in the SEL.
- Power-cycling the blade restores it to service (at least until the next failure).
- There are no one-time or recurring schedules in the UCSM that are anywhere near this time of day.
- Cisco TAC is investigating but isn't shedding any light as to why the failures happen at the same time every day.
My research and suspicions
- I have a working theory that these are real hardware problems which have somehow been exposed by the firmware upgrade.
- There's a brief mention of something called the 'sensor scanning manager' in the troubleshooting guide, but I can't find any detail as to what it does or how to monitor it.
- I've all but ruled out an environmental cause. Our power and temperature monitors show nothing unusual at that time. We are not in an earthquake zone :-)
The question
Why are the failures happening at precisely the same time every day?
This turned out to be a bug in firmware version 3.1(1e) (Cisco account required for that link). It's described as a 'rare event' involving the VIC 1340 and a debug interrupt.
The reason this was happening at the same time every day is that it was being triggered by—
lspci
,and this is exactly was Puppet was doing each morning (we only run it once per day).
It's unclear why only certain blades were affected by this bug, but upgrading to version 3.1(1h) solved the problem.