Ping a Specific Port

Question

Marki555

Asked: 2017-05-14 13:07:35 +0800 CST2017-05-14 13:07:35 +0800 CST 2017-05-14 13:07:35 +0800 CST

Why is io scheduler not merging requests?

772

After some time (or amount of data written) after server reboot the write speed drops to less than 1 MB/s. This is regardless of which filesystem (also raw partition) and regardless of HDD (HW RAID) or SSD (SW RAID with SSD connected to motherboard AHCI ports, not the raid board). We are testing with command dd if=/dev/zero of=tst bs=1M oflag=dsync (I have tried also 1k and also without dsync, but performance was not better).

The only weird thing I noticed was that the avgrq-sz is only 8 in the iostat output (on other tested servers it was more than 600) and that the req/s is about 100 (also on SSD). Running more dd in parallel gave each of them 1 MB/s and each of them about 100 req/s.

Sample iostat -xN 1 output:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00  125.00     0.00   500.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00  124.00     0.00   496.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     3.00    0.00  128.00     0.00   524.00     8.19     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     6.00    0.00  124.00     0.00   728.00    11.74     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00  125.00     0.00   500.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     3.00    0.00  128.00     0.00   524.00     8.19     0.00    0.00    0.00    0.00   0.00   0.00

Iostat output with 8x dd running:

sdc               0.00    64.00    0.00  959.00     0.00  7560.00    15.77     0.00    0.00    0.00    0.00   0.00   0.00

lsblk -O output is consistent with other servers, which do not have this problem (like MIN-IO, RQ-SIZE, LOG-SEC). Current kernel is 4.9.16-gentoo, but the issue started with older kernel. Running dd with oflag=direct is fast.

EDIT: Based on shodanshok's answer I now see that the requests are indeed small, but the question is why io scheduler does not merge them into larger requests? I have tried both cfq and deadline schedulers. Is there anything I can check (or compare against other servers)?

Output when running with oflag=direct (speed is OK):

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    2.00  649.00     8.00 141312.00   434.16     2.78    4.26    2.00    4.27   1.53  99.60
sdc               0.00     0.00    0.00  681.00     0.00 143088.00   420.23     2.71    3.99    0.00    3.99   1.46  99.20
sdc               0.00     0.00    2.00  693.00     8.00 146160.00   420.63     2.58    3.71    0.00    3.72   1.42  98.80
sdc               0.00    49.00    2.00  710.00     8.00 146928.00   412.74     2.68    3.76   22.00    3.71   1.39  99.20
sdc               0.00     0.00    1.00  642.00     4.00 136696.00   425.19     2.43    3.79   60.00    3.71   1.42  91.60

Server is Dell PowerEdge R330 with 32 GB RAM, LSI MegaRAID 3108 controller with HDDs, SSDs connected to onboard SATA, Intel E3-1270 CPU. Filesystem ext3, but the same happens with dd to raw partition.

Output of lsblk (sdc is HW RAID HDD, sda/sdb SW RAID SSDs):

NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdb       8:16   0 223.6G  0 disk
|-sdb4    8:20   0     1K  0 part
|-sdb2    8:18   0   140G  0 part
| `-md1   9:1    0   140G  0 raid1 /
|-sdb5    8:21   0    25G  0 part
| `-md2   9:2    0    25G  0 raid1 /var/log
|-sdb3    8:19   0     1G  0 part
|-sdb1    8:17   0    70M  0 part
| `-md0   9:0    0    70M  0 raid1 /boot
`-sdb6    8:22   0  57.5G  0 part
  `-md3   9:3    0  57.5G  0 raid1 /tmp
sr0      11:0    1  1024M  0 rom
sdc       8:32   0   3.7T  0 disk
`-sdc1    8:33   0   3.7T  0 part  /home
sda       8:0    0 223.6G  0 disk
|-sda4    8:4    0     1K  0 part
|-sda2    8:2    0   140G  0 part
| `-md1   9:1    0   140G  0 raid1 /
|-sda5    8:5    0    25G  0 part
| `-md2   9:2    0    25G  0 raid1 /var/log
|-sda3    8:3    0     1G  0 part
|-sda1    8:1    0    70M  0 part
| `-md0   9:0    0    70M  0 raid1 /boot
`-sda6    8:6    0  57.5G  0 part
  `-md3   9:3    0  57.5G  0 raid1 /tmp

With oflag=direct the speed is OK, but the issue is that applications don't use direct io, so even plain cp is slow.

/sys/block/sdc/queue/hw_sector_size : 512
/sys/block/sdc/queue/max_segment_size : 65536
/sys/block/sdc/queue/physical_block_size : 512
/sys/block/sdc/queue/discard_max_bytes : 0
/sys/block/sdc/queue/rotational : 1
/sys/block/sdc/queue/iosched/fifo_batch : 16
/sys/block/sdc/queue/iosched/read_expire : 500
/sys/block/sdc/queue/iosched/writes_starved : 2
/sys/block/sdc/queue/iosched/write_expire : 5000
/sys/block/sdc/queue/iosched/front_merges : 1
/sys/block/sdc/queue/write_same_max_bytes : 0
/sys/block/sdc/queue/max_sectors_kb : 256
/sys/block/sdc/queue/discard_zeroes_data : 0
/sys/block/sdc/queue/read_ahead_kb : 128
/sys/block/sdc/queue/discard_max_hw_bytes : 0
/sys/block/sdc/queue/nomerges : 0
/sys/block/sdc/queue/max_segments : 64
/sys/block/sdc/queue/rq_affinity : 1
/sys/block/sdc/queue/iostats : 1
/sys/block/sdc/queue/dax : 0
/sys/block/sdc/queue/minimum_io_size : 512
/sys/block/sdc/queue/io_poll : 0
/sys/block/sdc/queue/max_hw_sectors_kb : 256
/sys/block/sdc/queue/add_random : 1
/sys/block/sdc/queue/optimal_io_size : 0
/sys/block/sdc/queue/nr_requests : 128
/sys/block/sdc/queue/scheduler : noop [deadline] cfq
/sys/block/sdc/queue/discard_granularity : 0
/sys/block/sdc/queue/logical_block_size : 512
/sys/block/sdc/queue/max_integrity_segments : 0
/sys/block/sdc/queue/write_cache : write through

1 Answers

Voted

shodanshok · Answer 1 · 2017-05-14T13:24:55+08:00

shodanshok

2017-05-14T13:24:55+08:002017-05-14T13:24:55+08:00

The request size (avgrq-sz) is small because you are issuing small write requests. Your dd command, albeit specifing a 1 MB block size, is hitting the pagecache and so each 1 MB request really is a collection of 256 * 4 KB requests. This is reflected by the avgrq-sz which, being expressed in 512-bytes units, perfectly align with the 4 KB-sized page entry. Moreover, even SSD can have bad performance with synchronized writes, as requested by oflag=dsync. That said, please note that the I/O scheduler should merge these small 4 KB-sized requests in bigger ones, but this is not happening.

Some things to check:

what do you see issuing cat /sys/block/sdc/queue/scheduler? If noop is the selected scheduler, try to select deadline
is your /sys/block/sdc/queue/max_sectors_kb at least 1024?
try executing dd if=/dev/zero of=tst bs=1M oflag=direct: I/O performance should be much higher, right?

4

Why is io scheduler not merging requests?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?