After some time (or amount of data written) after server reboot the write speed drops to less than 1 MB/s. This is regardless of which filesystem (also raw partition) and regardless of HDD (HW RAID) or SSD (SW RAID with SSD connected to motherboard AHCI ports, not the raid board). We are testing with command dd if=/dev/zero of=tst bs=1M oflag=dsync
(I have tried also 1k
and also without dsync
, but performance was not better).
The only weird thing I noticed was that the avgrq-sz
is only 8 in the iostat output (on other tested servers it was more than 600) and that the req/s is about 100 (also on SSD). Running more dd
in parallel gave each of them 1 MB/s and each of them about 100 req/s.
Sample iostat -xN 1
output:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 0.00 0.00 125.00 0.00 500.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 124.00 0.00 496.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 3.00 0.00 128.00 0.00 524.00 8.19 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 6.00 0.00 124.00 0.00 728.00 11.74 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 125.00 0.00 500.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 3.00 0.00 128.00 0.00 524.00 8.19 0.00 0.00 0.00 0.00 0.00 0.00
Iostat output with 8x dd
running:
sdc 0.00 64.00 0.00 959.00 0.00 7560.00 15.77 0.00 0.00 0.00 0.00 0.00 0.00
lsblk -O
output is consistent with other servers, which do not have this problem (like MIN-IO, RQ-SIZE, LOG-SEC). Current kernel is 4.9.16-gentoo, but the issue started with older kernel. Running dd
with oflag=direct
is fast.
EDIT: Based on shodanshok's answer I now see that the requests are indeed small, but the question is why io scheduler does not merge them into larger requests? I have tried both cfq
and deadline
schedulers. Is there anything I can check (or compare against other servers)?
Output when running with oflag=direct
(speed is OK):
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 0.00 2.00 649.00 8.00 141312.00 434.16 2.78 4.26 2.00 4.27 1.53 99.60
sdc 0.00 0.00 0.00 681.00 0.00 143088.00 420.23 2.71 3.99 0.00 3.99 1.46 99.20
sdc 0.00 0.00 2.00 693.00 8.00 146160.00 420.63 2.58 3.71 0.00 3.72 1.42 98.80
sdc 0.00 49.00 2.00 710.00 8.00 146928.00 412.74 2.68 3.76 22.00 3.71 1.39 99.20
sdc 0.00 0.00 1.00 642.00 4.00 136696.00 425.19 2.43 3.79 60.00 3.71 1.42 91.60
Server is Dell PowerEdge R330 with 32 GB RAM, LSI MegaRAID 3108 controller with HDDs, SSDs connected to onboard SATA, Intel E3-1270 CPU. Filesystem ext3, but the same happens with dd
to raw partition.
Output of lsblk
(sdc is HW RAID HDD, sda/sdb SW RAID SSDs):
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 223.6G 0 disk
|-sdb4 8:20 0 1K 0 part
|-sdb2 8:18 0 140G 0 part
| `-md1 9:1 0 140G 0 raid1 /
|-sdb5 8:21 0 25G 0 part
| `-md2 9:2 0 25G 0 raid1 /var/log
|-sdb3 8:19 0 1G 0 part
|-sdb1 8:17 0 70M 0 part
| `-md0 9:0 0 70M 0 raid1 /boot
`-sdb6 8:22 0 57.5G 0 part
`-md3 9:3 0 57.5G 0 raid1 /tmp
sr0 11:0 1 1024M 0 rom
sdc 8:32 0 3.7T 0 disk
`-sdc1 8:33 0 3.7T 0 part /home
sda 8:0 0 223.6G 0 disk
|-sda4 8:4 0 1K 0 part
|-sda2 8:2 0 140G 0 part
| `-md1 9:1 0 140G 0 raid1 /
|-sda5 8:5 0 25G 0 part
| `-md2 9:2 0 25G 0 raid1 /var/log
|-sda3 8:3 0 1G 0 part
|-sda1 8:1 0 70M 0 part
| `-md0 9:0 0 70M 0 raid1 /boot
`-sda6 8:6 0 57.5G 0 part
`-md3 9:3 0 57.5G 0 raid1 /tmp
With oflag=direct
the speed is OK, but the issue is that applications don't use direct io, so even plain cp
is slow.
/sys/block/sdc/queue/hw_sector_size : 512
/sys/block/sdc/queue/max_segment_size : 65536
/sys/block/sdc/queue/physical_block_size : 512
/sys/block/sdc/queue/discard_max_bytes : 0
/sys/block/sdc/queue/rotational : 1
/sys/block/sdc/queue/iosched/fifo_batch : 16
/sys/block/sdc/queue/iosched/read_expire : 500
/sys/block/sdc/queue/iosched/writes_starved : 2
/sys/block/sdc/queue/iosched/write_expire : 5000
/sys/block/sdc/queue/iosched/front_merges : 1
/sys/block/sdc/queue/write_same_max_bytes : 0
/sys/block/sdc/queue/max_sectors_kb : 256
/sys/block/sdc/queue/discard_zeroes_data : 0
/sys/block/sdc/queue/read_ahead_kb : 128
/sys/block/sdc/queue/discard_max_hw_bytes : 0
/sys/block/sdc/queue/nomerges : 0
/sys/block/sdc/queue/max_segments : 64
/sys/block/sdc/queue/rq_affinity : 1
/sys/block/sdc/queue/iostats : 1
/sys/block/sdc/queue/dax : 0
/sys/block/sdc/queue/minimum_io_size : 512
/sys/block/sdc/queue/io_poll : 0
/sys/block/sdc/queue/max_hw_sectors_kb : 256
/sys/block/sdc/queue/add_random : 1
/sys/block/sdc/queue/optimal_io_size : 0
/sys/block/sdc/queue/nr_requests : 128
/sys/block/sdc/queue/scheduler : noop [deadline] cfq
/sys/block/sdc/queue/discard_granularity : 0
/sys/block/sdc/queue/logical_block_size : 512
/sys/block/sdc/queue/max_integrity_segments : 0
/sys/block/sdc/queue/write_cache : write through
The request size (
avgrq-sz
) is small because you are issuing small write requests. Yourdd
command, albeit specifing a 1 MB block size, is hitting the pagecache and so each 1 MB request really is a collection of 256 * 4 KB requests. This is reflected by theavgrq-sz
which, being expressed in 512-bytes units, perfectly align with the 4 KB-sized page entry. Moreover, even SSD can have bad performance with synchronized writes, as requested byoflag=dsync
. That said, please note that the I/O scheduler should merge these small 4 KB-sized requests in bigger ones, but this is not happening.Some things to check:
cat /sys/block/sdc/queue/scheduler
? Ifnoop
is the selected scheduler, try to selectdeadline
/sys/block/sdc/queue/max_sectors_kb
at least 1024?dd if=/dev/zero of=tst bs=1M oflag=direct
: I/O performance should be much higher, right?