I have a 6 disk raid6 mdadm array I'd like to benchmark writes to:
root@ubuntu:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sda[0] sdf[5] sde[4] sdd[3] sdc[2] sdb[1]
1953545984 blocks level 6, 64k chunk, algorithm 2 [6/6] [UUUUUU]
Benchmarks can be inaccurate because of cache - for example, notice the write speed here is higher than it should be:
root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.276026 s, 380 MB/s
Now we can disable each disk cache pretty easily:
root@ubuntu:~# hdparm -W0 /dev/sd*
/dev/sda:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
/dev/sdb:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
/dev/sdc:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
/dev/sdd:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
/dev/sde:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
/dev/sdf:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
But there is still Linux caching:
root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.00566339 s, 1.9 GB/s
To disable Linux caching, we can mount the filesystem synchronously:
mount -o remount,sync /mnt/raid6
But after this writes become way slower than they should be:
root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 23.3311 s, 449 kB/s
It's as if mdadm requires async mounts in order to function. What's going on here?
Quote by questioner:
That's not quite right... sync doesn't simply disable caching like you want in a benchmark. It makes every write result in a "sync" command, which means flushing cache all the way to the disk.
Here is a server over here, to explain better:
conv=fdatasync simply means flush after the write, and tell you the time including that flush. Alternatively, you can do:
And then calculate MB/s from the 2.95s real time rather than the above 0.2s. But that is uglier, and more work, since the stats printed by dd are not including the sync.
If you used "sync" you would flush every write... maybe that means every block, which would run very slow. "sync" should only be used on very strict systems, eg. databases where the loss of one single transaction due to a disk failure is unacceptable (eg. if I transfer a billion bucks from my bank account to yours, and the system crashes, and suddenly you have the money but so do I).
Here is another explanation with additional options, one I read about long ago. http://romanrm.ru/en/dd-benchmark
And one more note: Your benchmark you are doing this way is totally valid in my opinion, although not valid in many others' opinions. But it is not a real-life test. It is a single threaded sequential write. If your real life use case is like that, eg. sending some big files over the network, then it may be a good benchmark. If your use case is different, eg. an ftp server with 500 people uploading small files at the same time, then it is not very good.
And also, you should use a randomly generated file on RAM for best results. It should be random data because some file systems are too smart when you feed them zeros. eg. on Linux using the ram file system tmpfs which is mounted on /dev/. And it should be a RAM fs instead of using /dev/urandom directly because /dev/random is really slow, and /dev/urandom is faster (eg. 75MB/s) but still slower than hdd.
Performance is dramatically worse because synchronous writing forces parity computation to hammer the disks.
In general, computing and writing parity is a relatively slow process, especially with RAID 6--in your case, not only does md have to fragment the data into four chunks, it then computes two chunks of parity for each stripe. In order to improve performance, RAID implementations (including md) will cache recently-used stripes in memory in order to compare the data to be written with the existing data and quickly recompute parity on write. If new data is written to a cached stripe, it can compare, fragment, and recompute parity without ever touching the disk, then flush it later. You've created a situation where md always misses the cache, in which case it has to read the stripe from disk, compare the data, fragment the new data, recompute parity, then flush the new stripe directly to disk. What would require zero reads and writes from/to disk on a cache hit becomes six reads and six writes for every stripe written.
Granted, the difference in performance you've observed is enormous (1.9GB/s versus 449KB/s), but I think it's all accounted for in how much work md is doing to maintain the integrity of the data.
This performance hit may be compounded by how you have the disks arranged... if you have them all on one controller, that much extra reading and writing will bring performance to a standstill.
Can you tell us how your 6 disks are made up? Sounds to me as if they were part of SAN/DAS whatever targets - that propably consist of the same physical disks (so if all 6 reside on the same disk this will degrade performance compared to a single disk by 6).
Have a look at this link to anwerleaks.com.
So how did you set up your bitmap?