I'm trying to configure MD RAID1 (using mdadm) with the --write-mostly
option so that a network (EBS) volume and a local drive are mirrors of one another (the idea being that the local drive is ephemeral to my instance, but has better performance).
To vet this idea, I get a baseline performance estimate of my drive using the following two scripts.
fio -name=RandWrite -group_reporting -allow_file_create=0 \
-direct=1 -iodepth=128 -rw=randwrite -ioengine=io_uring -bs=32k \
-time_based=1 -ramp_time=10 -runtime 10 -numjobs=8 \
-randrepeat=0 -norandommap=1 -filename=$BENCHMARK_TARGET
# Read performance
fio -name=RandRead -group_reporting -allow_file_create=0 \
-direct=1 -iodepth=128 -rw=randread -ioengine=io_uring -bs=32k \
-time_based=1 -ramp_time=10 -runtime 10 -numjobs=8 \
-randrepeat=0 -norandommap=1 -filename=$BENCHMARK_TARGET
Results:
- Network drive: 117 MiB/s write, 117 MiB/s read
- Local drive: 862 MiB/s write, 665 MiB/s read
The problem comes when I introduce mdadm. Even when using a trivial no-mirror "RAID1", the write performance is severely worse when using the network drive.
mdadm --build /dev/md0 --verbose --level=1 --force --raid-devices=1 "$TARGET"
# mdadm --detail /dev/md0
/dev/md0:
Version :
Creation Time : Mon Sep 30 14:22:41 2024
Raid Level : raid1
Array Size : 10485760 (10.00 GiB 10.74 GB)
Used Dev Size : 10485760 (10.00 GiB 10.74 GB)
Raid Devices : 1
Total Devices : 1
State : clean
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Consistency Policy : resync
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
- 0-mirror RAID1 array backed by network drive: 69.9 MiB/s write, 118 MiB/s read
- 0-mirror RAID1 array backed by local drive: 868 MiB/s write, 665 MiB/s read
As we can see here, the write performance didn't change much for local drive (MD-raid vs. raw access), but it's severely impaired when using the network drive via MD-raid. Why does this happen?
Without knowing exact mdadm implementation, I'll write my educated guess on this.
I think that in RAID 1 setup, the RAID subsystem waits for both drives to acknowledge write operations before processing next file events. And then there might be additional delays introduced by the mismatch of performance between the drives, which would then contribute to the 69.9 MiB/s vs 117 MiB/s write speed.
I don't think it is feasible to create RAID arrays with devices where access speed is vastly different. RAID wasn't designed for this use case.
You might want to look at a cluster filesystem such as GFS2 or OCFS2, those might be better suited for your use case.
As near as I can tell, this is a failure mode caused by overloading the MD kernel module with IOPS.
When I modify my scripts to use iodepth=64 numjobs=1, I see no loss in performance on the raw drives, and my RAID1 write performance impact disappears.
Here are the final scripts:
And here are the net results:
numjobs=8
numjobs=1
I am guessing that too many IOPS combined with the slower drive leads to excessive queue length, which then leads to some sort of lock contention in the kernel module. But I don't know enough of the details to be sure. What I have learned, is that I'll need a more accurate benchmark to properly decide if this approach is viable for my use case.
You are probably bound by MD write bitmap. You can try disabling it (via
--bitmap=none
during creation or later with--grow
), but be sure to understand that an unclean shutdown on a bitmap-less array means a full resync after restart.