I am reading a large file sequentially from the disk and trying to understand the iostat output while the reading is taking place.
- Size of the file : 10 GB
- Read Buffer : 4 KB
- Read ahead (/sys/block/sda/queue/read_ahead_kb) : 128 KB
The iostat output is as follows
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 833.00 14.00 103.88 0.05 251.30 6.07 5.69 2.33 205.71 1.18 100.00
Computing the average size of an I/O request = (rMB/s divided by r/s) gives ~ 128 KB which is the read ahead value. This seems to indicate that while the read system call has specified a 4KB buffer, the actual disk I/O is happening according to the read ahead value.
When I increased the read ahead value to 256KB, the iostat output was as follows
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 28.00 412.00 12.00 102.50 0.05 495.32 10.78 12.15 4.76 265.83 2.36 100.00
Again the average I/O request size was 256 KB matching the read ahead.
This kept up until I set 512 KB as the read ahead value and did not hold up when I moved up to a read ahead value of 1024 KB - the average size of the I/O request was still 512 KB. Increasing max_sectors_kb (maximum amount of data per I/O request) from the default of 512 KB to 1024 KB also did not help here.
Why is this happening - ideally I would like to minimize my read IOPS as much as possible and read larger amount of data per I/O request (larger than 512 KB per request). Additionally, I am hitting 100% disk utilization in all cases - I would want to throttle myself to read at 50-60% disk utilization with good sequential throughput. In short, what are the optimized application/kernel settings for sequential read I/O.
You say that you want to minimize read IOPS and maximise the size of each IO request. I suspect that you wouldn't really benefit from this though. Normally I'd care about maximizing throughput while minimizing latency, and finding a good balance of those two for the particular application.
Note that when you went from a 128kB readahead to a 256kB readahead, read throughput actually dropped from 103.88MB/s to 102.50MB/s. I wouldn't expect this trend to reverse at a higher readahead size. The higher readahead also brings a risk of more wasted IO if the data is not purely sequential, which would reduce performance of useful IO.
If you're interested, the 512kB limit probably comes from another layer in the storage stack such as the SCSI driver, the controller firmware, or the bus.
To throttle IO you could look at the following: How to Throttle per process I/O to a max limit?
If you are reading from a filesystem on top of a LVM volume, this seems to be the exptected behavior. I also wrote on the LVM mailing list, but no one replyed to me.
I suspect the LVM code internally manages blocks/requests of 512 KB maximum, so increasing the
max_sectors_kb
parameter over this hard limit has no effect.