I have a system (centos 7.8) with k8s installed on top.
After a few days of normal operations (system load is ~30% and disk activity is around 60 iops, not saturated), the system goes to an unstable state where nothing is commited to disk anymore. As can be seen from an iostat -x 5
, avgqu-sz freezes no more r/w are accepted.
The question is where should I look next in order to identify the root cause.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 169,00 0,00 0,00 0,00 0,00 100,00
dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 186,00 0,00 0,00 0,00 0,00 100,00
dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 0,00 0,00 100,00
avg-cpu: %user %nice %system %iowait %steal %idle
11,36 0,00 1,71 86,94 0,00 0,00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 169,00 0,00 0,00 0,00 0,00 100,00
dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 186,00 0,00 0,00 0,00 0,00 100,00
dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 0,00 0,00 100,00
avg-cpu: %user %nice %system %iowait %steal %idle
11,41 0,00 1,58 87,01 0,00 0,00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 169,00 0,00 0,00 0,00 0,00 100,00
dm-0 0,00 0,00 0,00 0,00 0,00 0,00 0,00 186,00 0,00 0,00 0,00 0,00 100,00
dm-1 0,00 0,00 0,00 0,00 0,00 0,00 0,00 7,00 0,00 0,00 0,00 0,00 100,00
The last log lines from /var/log/messages before a force system reset were:
kernel: ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20130517/exfield-389)
kernel: ACPI Error: Method parse/execution failed [\_SB_.PMI0._PMM] (Node ffff99c2ba2513c0), AE_AML_BUFFER_LIMIT (20130517/psparse-536)
kernel: ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20130517/power_meter-339)
although, according to this RedHat thread, it should not be an issue https://access.redhat.com/discussions/3871951
LE 1: Occasionally I get similar freezes over small periods of time (less than a minute) and then it recovers. In the dmesg output I have:
[Lu aug 17 21:04:07 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1
[Lu aug 17 21:04:15 2020] hpsa 0000:06:00.0: device is ready.
[Lu aug 17 21:04:15 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1
LE 2: Managed to save a dmesg output when the disk doesn't recover anymore and a reset is required.
[Lu aug 24 13:00:18 2020] hpsa 0000:06:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1
[Lu aug 24 13:03:20 2020] INFO: task scsi_eh_0:332 blocked for more than 120 seconds.
[Lu aug 24 13:03:20 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Lu aug 24 13:03:20 2020] scsi_eh_0 D ffff8c603fc9acc0 0 332 2 0x00000000
[Lu aug 24 13:03:20 2020] Call Trace:
So hpsa enters this resetting logical disk-access procedure that never completes.
Last time I seen such symptom for disk IO stoping/pausing it was a bad disk issue. The disk controller on it were probably starting to malfunction, but the platten were ok.
I would check to be sure you have good backup, and as it's a system in raid check if the scsi controller is up to date as it didnt flagged the disk as bad yet.