Context: I'm on a Toshiba 512 GB NVMe (Model: KXG50ZNV512G)
I'm seeing this weird behaviour while benchmarking Postgres on ZFS-on-Linux (via pgbench
), where the second and third runs of a benchmark are progressively slower than the first run.
Here is what is happening:
client=1 | 770 => 697 | 10% reduction in TPS
client=4 | 2717 => 2180 | 24% reduction in TPS
client=8 | 4579 => 3339 | 37% reduction in TPS
client=12 | 4219 => 4175 | 01% reduction in TPS
client=48 | 5902 => 5623 | 05% reduction in TPS
client=96 | 7094 => 6739 | 05% reduction in TPS
I'm re-running these tests and the early numbers indicate that the 3rd run is slower than the 1st and 4th is slower than the 3rd.
Could the lack of TRIM support on ZFS-on-Linux causing this - https://github.com/zfsonlinux/zfs/pull/8255 ?
Rather than the missing TRIM support (whose performance deficit you can often avoid by simply leaving ~10% unpartitioned space at the end of the disk), what is hitting you probably is ZFS CoW behavior.
Basically, when running on an empty dataset, you can write without incurring in read/modify/write because, well, you have not written much yet. When really rewriting data (as in following benchmarks), you are going to progressively hit more and more read/modify/write, leading to both read and write amplification (and slower performance).
To check if it is the case, simply use
zpool iostat
to record total reads/writes on the first three runs: if you see the second and third to command an increased amount of transferred bytes, you have the confirmation of what written above.You can verify whether your autotrim is enabled on that pool.
zpool get autotrim [poolname]
Turning that on may help the performance. If not, you can try to enable it with:
zpool set autotrim=on [poolname]
Leaving 10% empty space can also help. However, if the ssd is not brand new, you have to shrink the existing partition to leave out 10% empty space. After that, you also have to issue "blkdiscard" to that empty space. Note that blkdiscard is dangerous command, which may wipe out existing data if you enter wrong address. It is not recommended to do that on existing ssd.