We have 4 Redhat Boxes Dell PowerEdge R630 (say a,b,c,d) having the following OS/packages.
RedHat EL 6.5 MySql Enterprise 5.6 DRBD 8.4 Corosync 1.4.7
We have setup 4-way stacked drbd resources as below:
Cluster Cluster-1: servers a and b connected to each other local lan Cluster Cluster-2: servers c and d
Cluster Cluster-1 and Cluster-2 are attached via stacked drbd via virtual IPs and are part of different data centres.
The drbd0 disks have been created locally on each servers 1GB, and are further attached to drbd10.
Overall setup comprises of 4 layers: Tomcat frontend application -> rabbitmq -> memcache -> mysql/drbd
We are experiencing very high Disk IO, even the activity is not must as of now. But the traffic/activity will be increased in couple of weeks, so we are worried that it will pose a very bad impact on performance. The I/o Useage is going high only on the stacked sites(90% and above sometimes). The secondary sites are not having this issue.At times the usage is going high when the application is ideal.
So kindly share some advice/tuning guidelines that will help us in resolving the issue;
resource clusterdb {
protocol C;
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notifyemergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notifyemergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergencyshutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
}
startup {
degr-wfc-timeout 120; # 2 minutes.
outdated-wfc-timeout 2; # 2 seconds.
}
disk {
on-io-error detach;
no-disk-barrier;
no-md-flushes;
}
net {
cram-hmac-alg "sha1";
shared-secret "clusterdb";
after-sb-0pri disconnect;
after-sb-1pri disconnect;
after-sb-2pri disconnect;
rr-conflict disconnect;
}
syncer {
rate 10M;
al-extents 257;
on-no-data-accessible io-error;
}
on sever-1 {
device /dev/drbd0;
disk /dev/sda2;
address 10.170.26.28:7788;
meta-disk internal;
}
on ever-2 {
device /dev/drbd0;
disk /dev/sda2;
address 10.170.26.27:7788;
meta-disk internal;
}
}
The stacked Configuration:-
resource clusterdb_stacked {
protocol A;
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notifyemergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notifyemergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergencyshutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
}
startup {
degr-wfc-timeout 120; # 2 minutes.
outdated-wfc-timeout 2; # 2 seconds.
}
disk {
on-io-error detach;
no-disk-barrier;
no-md-flushes;
}
net {
cram-hmac-alg "sha1";
shared-secret "clusterdb";
after-sb-0pri disconnect;
after-sb-1pri disconnect;
after-sb-2pri disconnect;
rr-conflict disconnect;
}
syncer {
rate 10M;
al-extents 257;
on-no-data-accessible io-error;
}
stacked-on-top-of clusterdb {
device /dev/drbd10;
address 10.170.26.28:7788;
}
stacked-on-top-of clusterdb_DR {
device /dev/drbd10;
address 10.170.26.60:7788;
}
}
The Requested data:-
Date || svctm(w_wait)|| %util
10:32:01 3.07 55.23 94.11
10:33:01 3.29 50.75 96.27
10:34:01 2.82 41.44 96.15
10:35:01 3.01 72.30 96.86
10:36:01 4.52 40.41 94.24
10:37:01 3.80 50.42 83.86
10:38:01 3.03 72.54 97.17
10:39:01 4.96 37.08 89.45
10:41:01 3.55 66.48 70.19
10:45:01 2.91 53.70 89.57
10:46:01 2.98 49.49 94.73
10:55:01 3.01 48.38 93.70
10:56:01 2.98 43.47 97.26
11:05:01 2.80 61.84 86.93
11:06:01 2.67 43.35 96.89
11:07:01 2.68 37.67 95.41
Update to question as per the comments:-
It is high actually comparing local with stacked.
Between local servers
[root@pri-site-valsql-a]#ping pri-site-valsql-b
PING pri-site-valsql-b.csn.infra.sm (10.170.24.23) 56(84) bytes of data.
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=1 ttl=64 time=0.143 ms
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=2 ttl=64 time=0.145 ms
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=3 ttl=64 time=0.132 ms
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=4 ttl=64 time=0.145 ms
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=5 ttl=64 time=0.150 ms
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=6 ttl=64 time=0.145 ms
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=7 ttl=64 time=0.132 ms
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=8 ttl=64 time=0.127 ms
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=9 ttl=64 time=0.134 ms
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=10 ttl=64 time=0.149 ms
64 bytes from pri-site-valsql-b.csn.infra.sm (10.170.24.23): icmp_seq=11 ttl=64 time=0.147 ms
^C
--- pri-site-valsql-b.csn.infra.sm ping statistics ---
11 packets transmitted, 11 received, 0% packet loss, time 10323ms
rtt min/avg/max/mdev = 0.127/0.140/0.150/0.016 ms
Between two stacked servers
[root@pri-site-valsql-a]#ping dr-site-valsql-b
PING dr-site-valsql-b.csn.infra.sm (10.170.24.48) 56(84) bytes of data.
64 bytes from dr-site-valsql-b.csn.infra.sm (10.170.24.48): icmp_seq=1 ttl=64 time=9.68 ms
64 bytes from dr-site-valsql-b.csn.infra.sm (10.170.24.48): icmp_seq=2 ttl=64 time=4.51 ms
64 bytes from dr-site-valsql-b.csn.infra.sm (10.170.24.48): icmp_seq=3 ttl=64 time=4.53 ms
64 bytes from dr-site-valsql-b.csn.infra.sm (10.170.24.48): icmp_seq=4 ttl=64 time=4.51 ms
64 bytes from dr-site-valsql-b.csn.infra.sm (10.170.24.48): icmp_seq=5 ttl=64 time=4.51 ms
64 bytes from dr-site-valsql-b.csn.infra.sm (10.170.24.48): icmp_seq=6 ttl=64 time=4.52 ms
64 bytes from dr-site-valsql-b.csn.infra.sm (10.170.24.48): icmp_seq=7 ttl=64 time=4.52 ms
^C
--- dr-site-valsql-b.csn.infra.sm ping statistics ---
7 packets transmitted, 7 received, 0% packet loss, time 6654ms
rtt min/avg/max/mdev = 4.510/5.258/9.686/1.808 ms
[root@pri-site-valsql-a]#
The output showing high I/O:-
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
drbd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.06 0.00 0.00 99.94
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
drbd0 0.00 0.00 0.00 2.00 0.00 16.00 8.00 0.90 1.50 452.25 90.45
avg-cpu: %user %nice %system %iowait %steal %idle
0.25 0.00 0.13 0.50 0.00 99.12
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
drbd0 0.00 0.00 1.00 44.00 8.00 352.00 8.00 1.07 2.90 18.48 83.15
avg-cpu: %user %nice %system %iowait %steal %idle
0.13 0.00 0.06 0.25 0.00 99.56
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
drbd0 0.00 0.00 0.00 31.00 0.00 248.00 8.00 1.01 2.42 27.00 83.70
avg-cpu: %user %nice %system %iowait %steal %idle
0.19 0.00 0.06 0.00 0.00 99.75
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
drbd0 0.00 0.00 0.00 2.00 0.00 16.00 8.00 0.32 1.50 162.25 32.45
Edited properties file.But still no luck
disk {
on-io-error detach;
no-disk-barrier;
no-disk-flushes;
no-md-flushes;
c-plan-ahead 0;
c-fill-target 24M;
c-min-rate 80M;
c-max-rate 300M;
al-extents 3833;
}
net {
cram-hmac-alg "sha1";
shared-secret "clusterdb";
after-sb-0pri disconnect;
after-sb-1pri disconnect;
after-sb-2pri disconnect;
rr-conflict disconnect;
max-epoch-size 20000;
max-buffers 20000;
unplug-watermark 16;
}
syncer {
rate 100M;
on-no-data-accessible io-error;
}
I don't see the stacked resource in your configuration. You also didn't mention any version numbers, but seeing al-extents set so low makes me think you're running something ancient (8.3.x) or followed some very old instructions.
Regardless, assuming you're using protocol A for your stacked device's replication (asynchronous) you're still going to quickly fill your TCP send buffers when IO spikes and consequently hit IO wait while the buffer flushes; DRBD needs to put its replicated writes somewhere and can only have so many unacknowledged replicated writes in flight.
IO wait contributes to system load. If you temporarily disconnect the stacked resource does the system load settle? That'd be one way to verify that this is the issue. You could also look at your TCP buffers with something like netstat or ss to see how full they are when your load is high.
Unless the latency and throughput of the connection between your sites is amazing (dark fiber, or something), you probably need/want to look into using DRBD Proxy from LINBIT; it let's you use system memory to buffer writes.