I have huge performance issues using MongoDB (i believe it is mmapped DB) with ZFSonlinux.
Our Mongodb is almost only writes. On replicas without ZFS, disk is completely busy for ~5s spikes, when app writes into DB every 30s, and no disk activity in between, so i take that as the baseline behaviour to compare.
On replicas with ZFS, disk is completely busy all the time, with the replicas stuggling to keep up to date with the MongoDB primary. I have lz4 compression enabled on all replicas, and the space savings are great, so there should be much less data to hit the disk
So on these ZFS servers, i first had the default recordsize=128k. Then i wiped the data and set recordsize=8k before resyncing Mongo data. Then i wiped again and tried recordsize=1k. I also tried recordsize=8k without checksums
Nevertheless, it did not solved anything, disk was always kept a 100% busy. Only once on one server with recordsize=8k, the disk was much less busy than any non-ZFS replicas, but after trying different setting and trying again with recordsize=8k, disk was 100%, i could not see the previous good behaviour, and could not see it on any other replica either.
Moreover, there should be almost only writes, but see that on all replicas under different settings, disk is completely busy with 75% reads and only 25% writes
(Note, i believe MongoDB is mmapped DB. I was told to try MongoDB in AIO mode, but i did not find how to set it, and with another server running MySQL InnoDB i realised that ZFSonLinux did not support AIO anyway.)
My servers are CentOS 6.5 kernel 2.6.32-431.5.1.el6.x86_64. spl-0.6.2-1.el6.x86_64 zfs-0.6.2-1.el6.x86_64
#PROD 13:44:55 root@rum-mongo-backup-1:~: zfs list
NAME USED AVAIL REFER MOUNTPOINT
zfs 216G 1.56T 32K /zfs
zfs/mongo_data-rum_a 49.5G 1.56T 49.5G /zfs/mongo_data-rum_a
zfs/mongo_data-rum_old 166G 1.56T 166G /zfs/mongo_data-rum_old
#PROD 13:45:20 root@rum-mongo-backup-1:~: zfs list -t snapshot
no datasets available
#PROD 13:45:29 root@rum-mongo-backup-1:~: zfs list -o atime,devices,compression,copies,dedup,mountpoint,recordsize,casesensitivity,xattr,checksum
ATIME DEVICES COMPRESS COPIES DEDUP MOUNTPOINT RECSIZE CASE XATTR CHECKSUM
off on lz4 1 off /zfs 128K sensitive sa off
off on lz4 1 off /zfs/mongo_data-rum_a 8K sensitive sa off
off on lz4 1 off /zfs/mongo_data-rum_old 8K sensitive sa off
What could be going on there ? What should i look to figure out what ZFS is doing or which setting is badly set ?
EDIT1:
hardware: These are rented servers, 8 vcores on Xeon 1230 or 1240, 16 or 32GB RAM, with zfs_arc_max=2147483648
, using HP hardware RAID1. So ZFS zpool is on /dev/sda2 and does not know that there is an underlying RAID1. Even being a suboptimal setup for ZFS, i still do not understand why disk is choking on reads while DB does only writes.
I understand the many reasons, which we do not need to expose here again, that this is bad and bad, ... for ZFS, and i will soon have a JBOD/NORAID server which i can do the same tests with ZFS's own RAID1 implementation on sda2 partition, with /, /boot and swap partitions doing software RAID1 with mdadm.
First off, it's worth stating that ZFS is not a supported filesystem for MongoDB on Linux - the recommended filesystems are ext4 or XFS. Because ZFS is not even checked for on Linux (see SERVER-13223 for example) it will not use sparse files, instead attempting to pre-allocate (fill with zeroes), and that will mean horrendous performance on a COW filesystem. Until that is fixed adding new data files will be a massive performance hit on ZFS (which you will be trying to do frequently with your writes). While you are not doing that performance should improve, but if you are adding data fast enough you may never recover between allocation hits.
Additionally, ZFS does not support Direct IO, so you will be copying data multiple times into memory (mmap, ARC, etc.) - I suspect that this is the source of your reads, but I would have to test to be sure. The last time I saw any testing with MongoDB/ZFS on Linux the performance was poor, even with the ARC on an SSD - ext4 and XFS were massively faster. ZFS might be viable for MongoDB production usage on Linux in the future, but it's not ready right now.
This may sound a bit crazy, but I support another application that benefits from ZFS volume management attributes, but does not perform well on the native ZFS filesystem.
My solution?!?
XFS on top of ZFS zvols.
Why?!?
Because XFS performs well and eliminates the application-specific issues I was facing with native ZFS. ZFS zvols allow me to thin-provision volumes, add compression, enable snapshots and make efficient use of the storage pool. More important for my app, the ARC caching of the zvol reduced the I/O load on the disks.
See if you can follow this output:
ZFS zvol, created with:
zfs create -o volblocksize=128K -s -V 800G vol0/pprovol
(note that auto-snapshots are enabled)Properties of ZFS zvol block device. 900GB volume (143GB actual size on disk):
XFS information on ZFS block device:
XFS mount options:
Note: I also do this on top of HP Smart Array hardware RAID in some cases.
The pool creation looks like:
With the result looking like:
We were looking into running Mongo on ZFS and saw that this post raised major concerns about the performance available. Two years on we wanted to see how new releases of Mongo that use WiredTiger over mmap, performed on the now officially supported ZFS that comes with the latest Ubuntu Xenial release.
In summary it was clear that ZFS doesn't perform quite as well as EXT4 or XFS however the performance gap isn't that significant, especially when you consider the extra features that ZFS offers.
I've made a blog post about our findings and methodology. I hope you find it useful!
I believe your disk is busy doing reads because of the
setting. Here you are explicitly limiting the ARC to 2Gb, even though you have 16-32Gb. ZFS is extremely memory-hungry and zealous when it comes to the ARC. If you have non-ZFS replicas identical to ZFS replicas (HW RAID1 underneath), doing some maths yields
which means you are probably invalidating the whole ARC cache in 5 seconds time. ARC is (to some degree) "intelligent" and will try to retain both the most recently written blocks and the most used ones, so your ZFS volume may well be trying to provide you a decent data cache with the limited space it has. Try raising zfs_arc_max to half of your RAM (or even more) and using arc_shrink_shift to more aggressively evict ARC cache data.
Here you can find a 17-part blog reading for tuning and understanding ZFS filesystems.
Here you can find the ARC shrink shift setting explaination (first paragraph), which will allow you to reclaim more ARC RAM upon eviction and keep it under control.
I'm unsure of the reliability of the XFS on zvol solution. Even though ZFS is COW, XFS is not. Suppose that XFS is updating its metadata and the machine loses power. ZFS will read the last good copy of the data thanks to the COW feature, but XFS won't know of that change. Your XFS volume may remain "snapshotted" to the version before the power failure for an half, and to the version after power failure for the other (because it's not known to ZFS that all that 8Mb write has to be atomic and contains inodes only).
[EDIT] arc_shrink_shift and other parameters are available as module parameters for ZFSonlinux. Try
to get all the supported ones for your configuration.
What are your ZFS settings, particularly primarycache, logbias and sync?
Make sure primarycache=all, logbias=throughput.
sync=disabled will give you a significant speedup in writes but may lose you up to 5 seconds of most recently written data if a crash occurs. Given the symptoms you are describing, you may also want to disable ZFS prefetch.
I wrote an article based on a talk I gave a while back about running MySQL on ZFS which you may find helpful.