Ping a Specific Port

Question

chrishiestand

Asked: 2011-03-24 00:37:51 +0800 CST2011-03-24 00:37:51 +0800 CST 2011-03-24 00:37:51 +0800 CST

Why does mdadm write unusably slow when mounted synchronously?

772

I have a 6 disk raid6 mdadm array I'd like to benchmark writes to:

root@ubuntu:~# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 sda[0] sdf[5] sde[4] sdd[3] sdc[2] sdb[1]
      1953545984 blocks level 6, 64k chunk, algorithm 2 [6/6] [UUUUUU]

Benchmarks can be inaccurate because of cache - for example, notice the write speed here is higher than it should be:

root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.276026 s, 380 MB/s

Now we can disable each disk cache pretty easily:

root@ubuntu:~# hdparm -W0 /dev/sd*

/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

/dev/sdb:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

/dev/sdc:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

/dev/sdd:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

/dev/sde:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

/dev/sdf:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

But there is still Linux caching:

root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.00566339 s, 1.9 GB/s

To disable Linux caching, we can mount the filesystem synchronously:

mount -o remount,sync /mnt/raid6

But after this writes become way slower than they should be:

root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 23.3311 s, 449 kB/s

It's as if mdadm requires async mounts in order to function. What's going on here?

3 Answers

Voted

Peter · Answer 1 · 2012-05-05T05:36:01+08:00

Quote by questioner:

But there is still Linux caching:
root@ubuntu:/mnt/raid6# dd if=/dev/zero of=delme bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.00566339 s, 1.9 GB/s
To disable Linux caching, we can mount the filesystem synchronously:
mount -o remount,sync /mnt/raid6

That's not quite right... sync doesn't simply disable caching like you want in a benchmark. It makes every write result in a "sync" command, which means flushing cache all the way to the disk.

Here is a server over here, to explain better:

$ dd if=/dev/zero of=testfile bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.183744 s, 2.9 GB/s

$ dd if=/dev/zero of=testfile bs=1M count=500 conv=fdatasync
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 5.22062 s, 100 MB/s

conv=fdatasync simply means flush after the write, and tell you the time including that flush. Alternatively, you can do:

$ time ( dd if=/dev/zero of=testfile bs=1M count=500 ; sync )
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.202687 s, 2.6 GB/s

real    0m2.950s
user    0m0.007s
sys     0m0.339s

And then calculate MB/s from the 2.95s real time rather than the above 0.2s. But that is uglier, and more work, since the stats printed by dd are not including the sync.

If you used "sync" you would flush every write... maybe that means every block, which would run very slow. "sync" should only be used on very strict systems, eg. databases where the loss of one single transaction due to a disk failure is unacceptable (eg. if I transfer a billion bucks from my bank account to yours, and the system crashes, and suddenly you have the money but so do I).

Here is another explanation with additional options, one I read about long ago. http://romanrm.ru/en/dd-benchmark

And one more note: Your benchmark you are doing this way is totally valid in my opinion, although not valid in many others' opinions. But it is not a real-life test. It is a single threaded sequential write. If your real life use case is like that, eg. sending some big files over the network, then it may be a good benchmark. If your use case is different, eg. an ftp server with 500 people uploading small files at the same time, then it is not very good.

And also, you should use a randomly generated file on RAM for best results. It should be random data because some file systems are too smart when you feed them zeros. eg. on Linux using the ram file system tmpfs which is mounted on /dev/. And it should be a RAM fs instead of using /dev/urandom directly because /dev/random is really slow, and /dev/urandom is faster (eg. 75MB/s) but still slower than hdd.

dd if=/dev/urandom of=/dev/shm/randfile bs=1M count=500
dd if=/dev/shm/randfile bs=1M count=500 conv=fdatasync

tmehlinger · Answer 2 · 2011-04-16T12:20:54+08:00

Performance is dramatically worse because synchronous writing forces parity computation to hammer the disks.

In general, computing and writing parity is a relatively slow process, especially with RAID 6--in your case, not only does md have to fragment the data into four chunks, it then computes two chunks of parity for each stripe. In order to improve performance, RAID implementations (including md) will cache recently-used stripes in memory in order to compare the data to be written with the existing data and quickly recompute parity on write. If new data is written to a cached stripe, it can compare, fragment, and recompute parity without ever touching the disk, then flush it later. You've created a situation where md always misses the cache, in which case it has to read the stripe from disk, compare the data, fragment the new data, recompute parity, then flush the new stripe directly to disk. What would require zero reads and writes from/to disk on a cache hit becomes six reads and six writes for every stripe written.

Granted, the difference in performance you've observed is enormous (1.9GB/s versus 449KB/s), but I think it's all accounted for in how much work md is doing to maintain the integrity of the data.

This performance hit may be compounded by how you have the disks arranged... if you have them all on one controller, that much extra reading and writing will bring performance to a standstill.

Nils · Answer 3 · 2011-04-16T12:11:00+08:00

Nils

2011-04-16T12:11:00+08:002011-04-16T12:11:00+08:00

Can you tell us how your 6 disks are made up? Sounds to me as if they were part of SAN/DAS whatever targets - that propably consist of the same physical disks (so if all 6 reside on the same disk this will degrade performance compared to a single disk by 6).

Have a look at this link to anwerleaks.com.

So how did you set up your bitmap?

0

Why does mdadm write unusably slow when mounted synchronously?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?