I have a FreeBSD 10.0 system on which I have MySQL running with its InnoDB database files stored on ZFS. It seemed to have been going fine for months, but lately, and seemingly suddenly, performance has bottomed out. After debugging a while surrounding MySQL, I finally decided to just see if just reading the database files from the filesystem was slow.
I picked one table whose database file was about 16GB, ran:
time cat table.ibd > /dev/null
and got:
cat table.ibd > /dev/null 1.24s user 64.35s system 0% cpu 1:00:34.65 total
In comparison, a copy of the file (with some changes, I'm sure: the database is live) living on UFS on the same system gives me:
cat table.ibd > /dev/null 0.20s user 9.34s system 5% cpu 9.550 total
Here are non-default ZFS options on this system:
/boot/loader.conf:
vfs.zfs.arc_max=17179869184
-
/etc/sysctl.conf:
vfs.zfs.prefetch_disable=1
-
zfs get:
recordsize 16K
compression on
atime off
primarycache metadata
zfs:zfs_nocacheflush 1
There are 12 snapshots on this filesystem right now. (Which seems excessive; I'm going to see if any of them can be deleted.)
There's no L2ARC for this pool (or any other pool on the system).
I've tried all three values for primarycache, and I've tried reenabling the prefetch, none of which seem to have had any significant effect.
The zpool is four 2-disk mirrors:
% zpool status mysqlrot
pool: mysqlrot
state: ONLINE
scan: scrub repaired 0 in 9h45m with 0 errors on Fri Jun 26 12:46:33 2015
config:
NAME STATE READ WRITE CKSUM
mysqlrot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
mfid9 ONLINE 0 0 0
mfid10 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
mfid11 ONLINE 0 0 0
mfid12 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
mfid13 ONLINE 0 0 0
mfid14 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
mfid15 ONLINE 0 0 0
mfid16 ONLINE 0 0 0
spares
mfid19 AVAIL
errors: No known data errors
One oddity is how the raw devices are set up. The storage controller is an MFI controller, and each raw disk is actually configured in the MFI controller as a 1-disk RAID0 volume:
% sudo mfiutil show volumes
mfi0 Volumes:
Id Size Level Stripe State Cache Name
mfid0 ( 185G) RAID-1 64K OPTIMAL Disabled <OS>
mfid1 ( 558G) RAID-0 64K OPTIMAL Disabled <DB0A>
mfid2 ( 558G) RAID-0 64K OPTIMAL Disabled <DB0B>
mfid3 ( 558G) RAID-0 64K OPTIMAL Disabled <DB1A>
mfid4 ( 558G) RAID-0 64K OPTIMAL Disabled <DB1B>
mfid5 ( 558G) RAID-0 64K OPTIMAL Disabled <DB2A>
mfid6 ( 558G) RAID-0 64K OPTIMAL Disabled <DB2B>
mfid7 ( 558G) RAID-0 64K OPTIMAL Disabled <DB3A>
mfid8 ( 558G) RAID-0 64K OPTIMAL Disabled <DB3B>
mfid9 ( 558G) RAID-0 64K OPTIMAL Disabled <DB4A>
mfid10 ( 558G) RAID-0 64K OPTIMAL Disabled <DB4B>
mfid11 ( 558G) RAID-0 64K OPTIMAL Disabled <DB5A>
mfid12 ( 558G) RAID-0 64K OPTIMAL Disabled <DB5B>
mfid13 ( 558G) RAID-0 64K OPTIMAL Disabled <DB6A>
mfid14 ( 558G) RAID-0 64K OPTIMAL Disabled <DB6B>
mfid15 ( 558G) RAID-0 64K OPTIMAL Disabled <DB7A>
mfid16 ( 558G) RAID-0 64K OPTIMAL Disabled <DB7B>
mfid17 ( 558G) RAID-0 64K OPTIMAL Disabled <DB8A>
mfid18 ( 558G) RAID-0 64K OPTIMAL Disabled <DB8B>
mfid19 ( 558G) RAID-0 64K OPTIMAL Disabled <SPARE0>
No errors that I've been able to find in any log file or utility.
Anyone have any ideas where to look?
Various data:
% zpool list mysqlrot
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mysqlrot 2.17T 1.49T 701G 68% 1.00x ONLINE -
Copying 15.3GB file to the affected filesystem (from a UFS filesystem):
% time sudo cp test.file /var/lib/mysql/mysqlrot/test.file
sudo cp test.file /var/lib/mysql/mysqlrot/test.file 0.02s user 44.23s system 1% cpu 1:06.93 total
(That's 66.93 seconds.)
Reading same file from the affected filesystem:
# time cat test.file > /dev/null
cat test.file > /dev/null 4.23s user 268.50s system 0% cpu 25:29.27 total
(That's 1529.27 seconds: almost 23 times longer than the write.)
Interestingly, cp
performs far better than cat
:
% time sudo cp /var/lib/mysql/v4netrot/test.file /dev/null
sudo cp /var/lib/mysql/v4netrot/test.file /dev/null 0.03s user 33.63s system 0% cpu 3:05.99 total
(185.99 seconds)
iostat
data during copy to:
extended device statistics
device r/s w/s kr/s kw/s qlen svc_t %b
mfid0 0.0 0.6 0.0 8.8 0 0.6 0
mfid1 0.4 12.6 2.7 70.5 0 0.5 0
mfid2 1.8 12.8 10.8 70.5 0 1.6 2
mfid3 0.2 10.0 1.5 92.4 0 0.7 0
mfid4 0.0 10.4 0.0 92.4 0 0.5 0
mfid5 0.2 9.4 1.2 39.7 0 0.5 0
mfid6 0.6 9.8 3.9 39.7 0 0.6 0
mfid7 0.6 4.8 0.3 46.1 0 0.9 0
mfid8 1.8 4.8 11.4 46.1 0 0.8 0
mfid9 0.4 1327.2 2.9 26686.5 0 0.5 23
mfid10 0.8 1328.2 1.8 26686.5 0 0.5 20
mfid11 1.4 1304.8 8.4 26357.6 0 0.5 23
mfid12 1.4 1304.6 2.6 26357.6 0 0.6 31
mfid13 1.6 1120.6 3.5 26194.2 0 0.6 25
mfid14 0.4 1122.6 2.7 26194.2 0 0.5 22
mfid15 0.8 1406.6 5.5 26188.5 0 0.5 22
mfid16 1.0 1174.6 2.0 21534.3 10 4.9 74
mfid17 5.8 24.2 152.9 300.6 0 0.3 0
mfid18 4.0 23.6 76.7 300.6 0 0.3 0
mfid19 0.0 0.0 0.0 0.0 0 0.0 0
The kw/s data ranged from about 17k to about 25k, and was pretty consistent between drives.
iostat
when cat
ting from:
extended device statistics
device r/s w/s kr/s kw/s qlen svc_t %b
mfid0 0.0 0.0 0.0 0.0 0 0.0 0
mfid1 0.4 17.2 11.4 63.4 0 0.5 0
mfid2 0.0 17.0 0.0 63.4 0 0.4 0
mfid3 0.0 14.0 0.0 56.4 0 0.4 0
mfid4 0.4 13.6 0.2 56.4 0 0.4 0
mfid5 0.8 9.6 4.8 37.3 0 0.8 0
mfid6 0.0 9.8 0.0 37.3 0 0.4 0
mfid7 0.2 3.8 17.2 11.9 0 0.6 0
mfid8 0.2 3.8 1.4 11.9 0 0.5 0
mfid9 1208.8 0.0 6831.4 0.0 0 0.1 11
mfid10 129.4 0.0 780.7 0.0 0 0.2 2
mfid11 906.4 0.0 5858.5 0.0 0 0.1 10
mfid12 600.5 0.0 2673.0 0.0 0 0.1 5
mfid13 136.2 0.0 803.9 0.0 0 0.2 3
mfid14 316.1 0.0 1895.3 0.0 0 0.1 4
mfid15 243.6 0.0 1414.5 0.0 0 0.1 2
mfid16 129.0 0.0 768.8 0.0 0 0.2 2
mfid17 3.8 25.8 29.8 274.1 0 0.2 0
mfid18 6.0 25.6 96.6 274.1 0 0.2 0
mfid19 0.0 0.0 0.0 0.0 0 0.0 0
the kr/s numbers were wildly inconsistent, but those are representative numbers.
iostat
while cp
ing from (to /dev/null):
extended device statistics
device r/s w/s kr/s kw/s qlen svc_t %b
mfid0 0.0 0.0 0.0 0.0 0 0.0 0
mfid1 21.0 66.6 107.6 2351.7 0 0.9 6
mfid2 17.6 66.8 106.8 2351.7 0 1.0 6
mfid3 17.6 39.0 116.9 2111.3 0 1.1 6
mfid4 18.8 39.6 99.8 2111.3 0 1.3 7
mfid5 23.2 62.4 172.2 2076.1 0 1.1 7
mfid6 23.0 62.0 130.0 2076.1 0 1.4 9
mfid7 16.2 62.6 112.6 2125.3 0 1.0 6
mfid8 17.4 63.0 107.6 2125.3 0 0.7 4
mfid9 237.5 44.6 5140.6 807.0 0 3.1 22
mfid10 263.7 43.6 5530.5 807.0 0 1.5 14
mfid11 252.7 55.8 5297.6 802.4 0 2.6 20
mfid12 298.1 55.6 5361.9 802.4 0 2.5 21
mfid13 275.3 46.2 5116.4 801.4 0 2.8 22
mfid14 252.9 42.4 5107.7 801.4 2 3.1 21
mfid15 270.9 43.8 4546.5 943.7 0 1.2 12
mfid16 257.7 44.0 5642.5 943.7 0 2.5 19
mfid17 7.8 23.0 73.1 244.9 0 0.3 0
mfid18 0.8 24.2 44.4 244.9 0 0.2 0
mfid19 0.0 0.0 0.0 0.0 0 0.0 0
It does not seem CPU-bound. top
shows that a cp
from the affected filesystem to /dev/null
consumes about 18% of one core (out of 48) and the rest of the cores are showing over 95% idle.
last pid: 12474; load averages: 1.65, 1.26, 1.14 up 39+05:42:19 14:29:08
147 processes: 1 running, 146 sleeping
CPU 0: 0.0% user, 0.0% nice, 1.6% system, 0.0% interrupt, 98.4% idle
CPU 1: 0.0% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.8% idle
CPU 2: 0.4% user, 0.0% nice, 0.8% system, 0.0% interrupt, 98.8% idle
CPU 3: 0.0% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.8% idle
CPU 4: 0.0% user, 0.0% nice, 2.3% system, 0.0% interrupt, 97.7% idle
CPU 5: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% idle
CPU 6: 0.0% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.8% idle
CPU 7: 0.0% user, 0.0% nice, 2.7% system, 0.0% interrupt, 97.3% idle
CPU 8: 0.4% user, 0.0% nice, 1.9% system, 0.0% interrupt, 97.7% idle
CPU 9: 0.0% user, 0.0% nice, 2.3% system, 0.0% interrupt, 97.7% idle
CPU 10: 0.4% user, 0.0% nice, 1.9% system, 1.2% interrupt, 96.5% idle
CPU 11: 0.0% user, 0.0% nice, 3.9% system, 0.0% interrupt, 96.1% idle
CPU 12: 0.0% user, 0.0% nice, 1.6% system, 0.8% interrupt, 97.7% idle
CPU 13: 0.0% user, 0.0% nice, 2.7% system, 0.0% interrupt, 97.3% idle
CPU 14: 0.0% user, 0.0% nice, 1.6% system, 0.0% interrupt, 98.4% idle
CPU 15: 0.4% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.4% idle
CPU 16: 0.0% user, 0.0% nice, 1.6% system, 0.0% interrupt, 98.4% idle
CPU 17: 0.4% user, 0.0% nice, 3.1% system, 0.0% interrupt, 96.5% idle
CPU 18: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% idle
CPU 19: 0.8% user, 0.0% nice, 0.8% system, 0.0% interrupt, 98.4% idle
CPU 20: 0.0% user, 0.0% nice, 2.7% system, 0.0% interrupt, 97.3% idle
CPU 21: 0.4% user, 0.0% nice, 1.9% system, 0.0% interrupt, 97.7% idle
CPU 22: 0.0% user, 0.0% nice, 2.3% system, 0.0% interrupt, 97.7% idle
CPU 23: 0.4% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.4% idle
CPU 24: 0.0% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.8% idle
CPU 25: 0.8% user, 0.0% nice, 0.8% system, 0.0% interrupt, 98.4% idle
CPU 26: 0.4% user, 0.0% nice, 0.8% system, 0.0% interrupt, 98.8% idle
CPU 27: 0.0% user, 0.0% nice, 4.7% system, 0.0% interrupt, 95.3% idle
CPU 28: 0.0% user, 0.0% nice, 0.8% system, 0.0% interrupt, 99.2% idle
CPU 29: 0.4% user, 0.0% nice, 0.8% system, 0.0% interrupt, 98.8% idle
CPU 30: 0.0% user, 0.0% nice, 2.7% system, 0.0% interrupt, 97.3% idle
CPU 31: 0.0% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.8% idle
CPU 32: 0.0% user, 0.0% nice, 0.8% system, 0.0% interrupt, 99.2% idle
CPU 33: 0.0% user, 0.0% nice, 3.5% system, 0.0% interrupt, 96.5% idle
CPU 34: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% idle
CPU 35: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% idle
CPU 36: 1.2% user, 0.0% nice, 1.6% system, 0.0% interrupt, 97.3% idle
CPU 37: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 38: 0.4% user, 0.0% nice, 1.2% system, 0.4% interrupt, 98.1% idle
CPU 39: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
CPU 40: 0.0% user, 0.0% nice, 1.9% system, 0.0% interrupt, 98.1% idle
CPU 41: 0.0% user, 0.0% nice, 1.2% system, 0.0% interrupt, 98.8% idle
CPU 42: 0.4% user, 0.0% nice, 1.9% system, 0.0% interrupt, 97.7% idle
CPU 43: 0.0% user, 0.0% nice, 4.7% system, 0.0% interrupt, 95.3% idle
CPU 44: 0.0% user, 0.0% nice, 2.3% system, 0.0% interrupt, 97.7% idle
CPU 45: 0.4% user, 0.0% nice, 2.7% system, 0.0% interrupt, 96.9% idle
CPU 46: 0.4% user, 0.0% nice, 3.5% system, 0.0% interrupt, 96.1% idle
CPU 47: 0.4% user, 0.0% nice, 1.6% system, 0.0% interrupt, 98.1% idle
Mem: 82G Active, 23G Inact, 15G Wired, 3340K Cache, 1655M Buf, 4858M Free
ARC: 12G Total, 527M MFU, 11G MRU, 4375K Anon, 377M Header, 89M Other
Swap: 4096M Total, 12M Used, 4084M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
12461 root 31 0 12268K 2552K zio->i 30 0:07 18.16% cp
3151 mysql 24 0 92309M 92073M uwait 3 22:06 7.47% mysqld{mysqld}
3151 mysql 22 0 92309M 92073M select 33 15:30 4.79% mysqld{mysqld}
3151 mysql 20 0 92309M 92073M uwait 46 895:41 0.49% mysqld{mysqld}
12175 wfaulk 20 0 23864K 6404K CPU25 25 0:03 0.29% top
6074 root 20 0 84348K 40372K kqread 25 0:11 0.20% vc-aggregator{vc-aggregator}
(The rest of the processes show 0.00% CPU utilization.)
Did you try to restart the machine? Did it any difference?
If a restart did not help, what you are facing can be a problem with excessive fragmentation, which is #1 enemy for rotating media (read: HDDs). The high number of snapshot can exacerbate this situation.
To confirm the issue, try the following:
fallocate testfile.raw -l <size>
If you confirm that this is a frag issue, follow these steps:
mv table.ibd table.ibd.old
)cp -a table.ibd.old table.ibd
)EDIT after iostat update
Thank you for the iostat number.
You moved to the ZFS array a 15 GB file in about 67s, this means an ingestion rate of 223 MB/s or 55 MB/s per disk (excluding the mirrored ones). On the other hand, your iostat seems to report about that (about 25 MB/s), so I attribute that discrepancy to a compression ration of about 2:1.
OK, this is good. However, during read strange things happen...
Discarding the
cat
result (cat
use a very small buffer by default, and with disabled prefetcher it surely is way slower thencp
), thecp
command is way to slow: you copied a 15 GB file in about 1530s, which mean an extraction rate of only 10 MB/s. And that is already factoring the compression advantage. On the other hand, your iostat number shows over 5 MB/s reads per disks, which add up to about 40 MB/s per array. Factoring the 2:1 compression ratio, it should give you a transfer rate north of 80 MB/s. This means that you are at about 1/8 your read potential.The question is: why? It really seems as the CPU was maxed out during the transfer. Can you run a
top
and adstat
session during reading the affected file? If possible, configuretop
so show the per-CPU load.The MFI controller completely failed yesterday. I'm guessing this slowness was just an early symptom of that impending hardware failure.