Ping a Specific Port

Question

Flup

Asked: 2016-05-18 02:21:29 +0800 CST2016-05-18 02:21:29 +0800 CST 2016-05-18 02:21:29 +0800 CST

Cisco UCS CPU faults at the same time every day

772

The situation

Recent upgrade from 2.2 to 3.1(1e).
Since the upgrade, at 6:51am (UTC+1) every day I experience failures on between zero and three (out of ~60) of the B200-series blades in my installation.
It's always the same three blades, all in different chassis.
The failures manifest themselves as a hard hang with 'CPU predictive failure' and 'CATERR_N' messages in the SEL.
Power-cycling the blade restores it to service (at least until the next failure).
There are no one-time or recurring schedules in the UCSM that are anywhere near this time of day.
Cisco TAC is investigating but isn't shedding any light as to why the failures happen at the same time every day.

My research and suspicions

I have a working theory that these are real hardware problems which have somehow been exposed by the firmware upgrade.
There's a brief mention of something called the 'sensor scanning manager' in the troubleshooting guide, but I can't find any detail as to what it does or how to monitor it.
I've all but ruled out an environmental cause. Our power and temperature monitors show nothing unusual at that time. We are not in an earthquake zone :-)

The question

Why are the failures happening at precisely the same time every day?

1 Answers

Voted

Flup · Answer 1 · 2016-10-07T07:18:21+08:00

Best Answer

Flup

2016-10-07T07:18:21+08:002016-10-07T07:18:21+08:00

This turned out to be a bug in firmware version 3.1(1e) (Cisco account required for that link). It's described as a 'rare event' involving the VIC 1340 and a debug interrupt.

The reason this was happening at the same time every day is that it was being triggered by—

heavy memory usage, followed by
running lspci,

and this is exactly was Puppet was doing each morning (we only run it once per day).

It's unclear why only certain blades were affected by this bug, but upgrading to version 3.1(1h) solved the problem.

2

Cisco UCS CPU faults at the same time every day

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?