I've got a server from hetzner.de (EQ4) with 2* SAMSUNG HD753LJ drives (750G 32MB cache).
OS is CentOS 5 (x86_64). Drives are combined together into two RAID1 partitions:
- /dev/md0 which is 512MB big and has only /boot partitions
- /dev/md1 which is over 700GB big and is one big LVM which hosts other partitions
Now, I've been running some benchmarks and it seems like even though exactly the same drives, speed differs a bit on each of them.
# hdparm -tT /dev/sda
/dev/sda: Timing cached reads: 25612 MB in 1.99 seconds = 12860.70 MB/sec Timing buffered disk reads: 352 MB in 3.01 seconds = 116.80 MB/sec
# hdparm -tT /dev/sdb
/dev/sdb: Timing cached reads: 25524 MB in 1.99 seconds = 12815.99 MB/sec Timing buffered disk reads: 342 MB in 3.01 seconds = 113.64 MB/sec
Also, when I run eg. pgbench which is stressing IO quite heavily, I can see following from iostat output:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 231.40 0.00 298.00 0.00 9683.20 32.49 0.17 0.58 0.34 10.24
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 231.40 0.00 298.00 0.00 9683.20 32.49 0.17 0.58 0.34 10.24
sdb 0.00 231.40 0.00 301.80 0.00 9740.80 32.28 14.19 51.17 3.10 93.68
sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb2 0.00 231.40 0.00 301.80 0.00 9740.80 32.28 14.19 51.17 3.10 93.68
md1 0.00 0.00 0.00 529.60 0.00 9692.80 18.30 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 529.00 0.00 9688.00 18.31 24.51 49.91 1.81 95.92
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 152.40 0.00 330.60 0.00 5176.00 15.66 0.19 0.57 0.19 6.24
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 152.40 0.00 330.60 0.00 5176.00 15.66 0.19 0.57 0.19 6.24
sdb 0.00 152.40 0.00 326.20 0.00 5118.40 15.69 19.96 55.36 3.01 98.16
sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb2 0.00 152.40 0.00 326.20 0.00 5118.40 15.69 19.96 55.36 3.01 98.16
md1 0.00 0.00 0.00 482.80 0.00 5166.40 10.70 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 482.80 0.00 5166.40 10.70 30.19 56.92 2.05 99.04
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 181.64 0.00 324.55 0.00 5445.11 16.78 0.15 0.45 0.21 6.87
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 181.64 0.00 324.55 0.00 5445.11 16.78 0.15 0.45 0.21 6.87
sdb 0.00 181.84 0.00 328.54 0.00 5493.01 16.72 18.34 61.57 3.01 99.00
sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb2 0.00 181.84 0.00 328.54 0.00 5493.01 16.72 18.34 61.57 3.01 99.00
md1 0.00 0.00 0.00 506.39 0.00 5477.05 10.82 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 506.39 0.00 5477.05 10.82 28.77 62.15 1.96 99.00
And this is completely getting me confused. How come two exactly the same specced drives have such a difference in write speed (see util%)? I haven't really paid attention to those speeds before, so perhaps that something normal -- if someone could confirm I would be really grateful.
Otherwise, if someone have seen such behavior again or knows what is causing such behavior I would really appreciate answer.
I'll also add that both "smartctl -a" and "hdparm -I" output are exactly the same and are not indicating any hardware problems. The slower drive was changed already two times (to new ones). Also I asked to change the drives with places, and then sda were slower and sdb quicker (so the slow one was the same drive). SATA cables were changed two times already.
Could you please try
bonnie++
benchmark tool? You should run it with twice the size of the memory (example for 1GB):Your problem description makes me think it is the controller that can't easily handle the parallel writes that the software RAID1 does. Use the command above on the following situations. To check if this hypothesis is true, please do:
1) Separate benchmarks for each hard disk. Hypothesis says that the results will be similar.
2) Benchmark the RAID1.
3) Simultaneous benchmark on different disks. Hypothesis says that it should look more like 2) than 1).
Good luck,
João Miguel Neves
I agree that you're getting a performance disparity between the disks: just look at the disparity in queue sizes. However, we don't yet know whether to blame the disks themselves or something higher up the stack. A couple of experiments:
Make an md3 with a partition from sdb as the first element of the mirror and a partition of sda as the second: see if performance follows the disk or the software RAID. (This would surprise me, but it might be worth doing before the second experiment which requires (eugh) physical access.)
Physically swap the connections to sda and sdb. If performance changes now, you should blame your disk controller.
I think that your data are normal, it is, they are different, but only by very few %. I've seen value of this type on many other couple of equal drives.
Andrea
Sorry, I crossed my eyes at the end of the rows and also didn't undesrtood what you wrote about the differences about %util of the two drives.
No, it's not normal and after what you have said I think that likely the problem is the controller. Are the two channells configured in the same way?
I would suspect the raid as part of this observation. The drives seem to show almost identical w/s and wsec/s. as the md raid replicates the write to two drives attached to the same controller it might be possible that the transfer of the data over the bus happens only once thus one drive might got to the roof in terms of cpu utilization while the other just transferes the already present block from the controller. have you tried to reproduce the behaviour without an md raid?
May be it is because of write-intent bitmap? It causes slowdown of RAID-1.
Turning off write-intent bitmap will increase speed of writing to RADI-1, but also increase time for rebuilding the array in case of failure.
I work at a hosting facility, and I've seen similar issues in the past with huge drive %util at data rates that shouldn't max out the drive. Typically, when we see that we swap the drive with another one and RMA the old drive.
Contact your hosting provider and let them know that you're having this problem, and see if they can swap it with another drive to see if that fixes the problem. If not, it may be a drive controller issue.
which elevator scheduler are you using? Based on your workload, you might try deadline over CFQ. Depending on the kernel, it is possible that it was shipped with the Anticipatory scheduler enabled which I've found to have problems with md constructed sets.
It could be an iostat problem or an internal kernel statistic problem. The numbers of hdparm seem to be a bit high, a review found this hd to have a speed of up to 88mb/s for write. It also could be a switched off NCQ, check the dmesg output.
The real test is to unraid the hds and run the same benchmarks, like bonnie++, on each of them.