I have a AMD EPYC 7502P 32-Core
Linux server (kernel 6.10.6
) with 6 NVMe drives, where suddenly I/O performance dropped. All operations takes too much time. Installing package updates takes hours instead of seconds (maybe minutes).
I've tried running fio
on filesystem with RAID5. There's a huge difference in clat
metric:
clat (nsec): min=190, max=359716k, avg=16112.91, stdev=592031.05
stdev
value is extreme.
full output:
$ fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [F(1)][100.0%][w=53.3MiB/s][w=13.6k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=48391: Wed Sep 25 09:17:02 2024
write: IOPS=45.5k, BW=178MiB/s (186MB/s)(10.6GiB/61165msec); 0 zone resets
slat (nsec): min=552, max=123137, avg=2016.89, stdev=468.03
clat (nsec): min=190, max=359716k, avg=16112.91, stdev=592031.05
lat (usec): min=10, max=359716, avg=18.13, stdev=592.03
clat percentiles (usec):
| 1.00th=[ 11], 5.00th=[ 12], 10.00th=[ 14], 20.00th=[ 15],
| 30.00th=[ 15], 40.00th=[ 15], 50.00th=[ 15], 60.00th=[ 16],
| 70.00th=[ 16], 80.00th=[ 16], 90.00th=[ 17], 95.00th=[ 18],
| 99.00th=[ 20], 99.50th=[ 22], 99.90th=[ 42], 99.95th=[ 119],
| 99.99th=[ 186]
bw ( KiB/s): min=42592, max=290232, per=100.00%, avg=209653.41, stdev=46502.99, samples=105
iops : min=10648, max=72558, avg=52413.32, stdev=11625.75, samples=105
lat (nsec) : 250=0.01%, 500=0.01%, 1000=0.01%
lat (usec) : 10=0.01%, 20=99.15%, 50=0.76%, 100=0.03%, 250=0.06%
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 500=0.01%
cpu : usr=12.62%, sys=30.97%, ctx=2800981, majf=0, minf=28
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2784519,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=178MiB/s (186MB/s), 178MiB/s-178MiB/s (186MB/s-186MB/s), io=10.6GiB (11.4GB), run=61165-61165msec
Disk stats (read/write):
md1: ios=0/710496, merge=0/0, ticks=0/12788992, in_queue=12788992, util=23.31%, aggrios=319833/649980, aggrmerge=0/0, aggrticks=118293/136983, aggrin_queue=255276, aggrutil=14.78%
nvme1n1: ios=318781/638009, merge=0/0, ticks=118546/131154, in_queue=249701, util=14.71%
nvme5n1: ios=321508/659460, merge=0/0, ticks=118683/138996, in_queue=257679, util=14.77%
nvme2n1: ios=320523/647922, merge=0/0, ticks=120634/134284, in_queue=254918, util=14.71%
nvme3n1: ios=320809/651642, merge=0/0, ticks=118823/135985, in_queue=254808, util=14.73%
nvme0n1: ios=316267/642934, merge=0/0, ticks=116772/143909, in_queue=260681, util=14.75%
nvme4n1: ios=321110/659918, merge=0/0, ticks=116300/137570, in_queue=253870, util=14.78%
Probably one disk is faulty, is there a way how to determine the slow disk?
All disks have similar SMART attributes, nothing outstanding. SAMSUNG 7T:
Model Number: SAMSUNG MZQL27T6HBLA-00A07
Firmware Version: GDC5902Q
Data Units Read: 2,121,457,831 [1.08 PB]
Data Units Written: 939,728,748 [481 TB]
Controller Busy Time: 40,224
Power Cycles: 5
Power On Hours: 6,913
write performance appears to be very similar:
iostat -xh
Linux 6.10.6+bpo-amd64 (ts01b) 25/09/24 _x86_64_ (64 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5.0% 0.0% 4.3% 0.6% 0.0% 90.2%
r/s rkB/s rrqm/s %rrqm r_await rareq-sz Device
0.12 7.3k 0.00 0.0% 0.43 62.9k md0
6461.73 548.7M 0.00 0.0% 0.22 87.0k md1
3583.93 99.9M 9.60 0.3% 1.13 28.5k nvme0n1
3562.77 98.9M 0.80 0.0% 1.15 28.4k nvme1n1
3584.54 99.8M 9.74 0.3% 1.18 28.5k nvme2n1
3565.96 98.8M 1.06 0.0% 1.16 28.4k nvme3n1
3585.04 99.9M 9.78 0.3% 1.16 28.5k nvme4n1
3577.56 99.0M 0.86 0.0% 1.17 28.3k nvme5n1
w/s wkB/s wrqm/s %wrqm w_await wareq-sz Device
0.00 0.0k 0.00 0.0% 0.00 4.0k md0
366.41 146.5M 0.00 0.0% 14.28 409.4k md1
8369.26 32.7M 1.18 0.0% 3.73 4.0k nvme0n1
8364.63 32.7M 1.12 0.0% 3.63 4.0k nvme1n1
8355.48 32.6M 1.10 0.0% 3.56 4.0k nvme2n1
8365.23 32.7M 1.10 0.0% 3.46 4.0k nvme3n1
8365.37 32.7M 1.25 0.0% 3.37 4.0k nvme4n1
8356.70 32.6M 1.06 0.0% 3.29 4.0k nvme5n1
d/s dkB/s drqm/s %drqm d_await dareq-sz Device
0.00 0.0k 0.00 0.0% 0.00 0.0k md0
0.00 0.0k 0.00 0.0% 0.00 0.0k md1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme0n1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme1n1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme2n1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme3n1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme4n1
0.00 0.0k 0.00 0.0% 0.00 0.0k nvme5n1
f/s f_await aqu-sz %util Device
0.00 0.00 0.00 0.0% md0
0.00 0.00 6.68 46.8% md1
0.00 0.00 35.24 14.9% nvme0n1
0.00 0.00 34.50 14.6% nvme1n1
0.00 0.00 33.98 14.9% nvme2n1
0.00 0.00 33.06 14.6% nvme3n1
0.00 0.00 32.33 14.8% nvme4n1
0.00 0.00 31.72 14.6% nvme5n1
sort of problematic appears to be interrupts
$ dstat -tf --int24 60
----system---- -------------------------------interrupts------------------------------
time | 120 128 165 199 213 342 LOC PMI IWI RES CAL TLB
25-09 10:53:45|2602 2620 2688 2695 2649 2725 136k 36 1245 2739 167k 795
25-09 10:54:45| 64 64 65 64 66 65 2235 1 26 16 2156 3
25-09 10:55:45| 33 31 32 32 32 30 2050 1 24 10 2162 20
25-09 10:56:45| 31 31 30 35 30 33 2303 1 26 63 2245 9
25-09 10:57:45| 36 29 27 34 35 35 2016 1 23 72 2645 10
25-09 10:58:45| 9 8 9 8 7 8 1766 0 27 4 1892 15
25-09 10:59:45| 59 62 59 58 60 60 1585 1 22 20 1704 9
25-09 11:00:45| 25 21 21 26 26 26 1605 0 26 10 1862 10
25-09 11:01:45| 34 32 32 33 36 31 1515 0 23 24 1948 10
25-09 11:02:45| 21 23 23 25 22 24 1772 0 27 27 1781 9
the fields with increased interrupts are mapped to 9-edge
to all drives nvme[0-5]q9
, e.g.:
$ cat /proc/interrupts | grep 120:
IR-PCI-MSIX-0000:01:00.0 9-edge nvme2q9
EDIT: The 9-edge
is probably Metadisk (Software RAID) devices.
The issue was probably caused by a malfunctioning connector, after reconnecting all drives and checking cables, the random write benchmark looks ok,
clat
max value is in "normal" range5723.2k
.Unluckily mdadm cannot handle nmve raid5/6 write properly until yet. There are 4 other options do raid nvme's inside a host for write performance: Use zfs, use nvme-hw-raid-controller, use graid (nvidiacard) or use xinnor sw-raid.