Ping a Specific Port

Question

jcharaoui

Asked: 2018-05-29 08:10:32 +0800 CST2018-05-29 08:10:32 +0800 CST 2018-05-29 08:10:32 +0800 CST

Disabling ext4 write barriers when using an external journal

772

I'm currently experimenting with different ways of improving write speeds to a fairly large, rotating disk-based, software-raid (mdadm) array on Debian using fast NVMe devices.

I found that using a pair of such devices (raid1, mirrored) to store the filesystem's journal yields interesting performance benefits. The mount options I am using to achieve this are noatime,journal_aync_commit,data=journal.

In my tests, I've also discovered that adding the barrier=0 option offers significant benefits in terms of write performance. However, I'm not certain that this option is safe to use in my particular filesystem configuration. This is what the kernel documentation says about ext4 write barriers:

Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use, at some performance penalty. If your disks are battery-backed in one way or another, disabling barriers may safely improve performance.

The specific NVMe device I'm using is an Intel DC P3700 which has built-in power-loss protection which means that in the event of an unexpected shutdown, any data still present in temporary buffers is safely committed to NAND storage thanks to reserve energy storage.

So my question is, can I safely disable ext4 write barriers if the journal is stored on a device with battery-backed cache, while the rest of the filesystem itself sits on disks which don't have this feature?

3 Answers

Voted

jcharaoui · Answer 1 · 2018-05-31T07:12:22+08:00

I'm writing a new answer because after further analysis, I don't think the previous answer is correct.

If we look at the write_dirty_buffers function, it issues a write request with the REQ_SYNC flag, but it doesn't cause a cache flush, or barrier, to be issued. That is accomplished by the blkdev_issue_flush call, which is appropriately gated by a verification of the JDB2_BARRIER flag, which itself is only present when the filesystem is mounted with barriers enabled.

So if we look back at checkpoint.c, barriers only matter when a transaction is dropped from the journal. The comments in the code are informative here, telling us that this write barrier is unlikely to be necessary, but is there anyway as a safeguard. I think the assumption here is that by the time a transaction is dropped from the journal, the data itself is unlikely to be still lingering in the drive's cache, and not yet committed to permanent storage. But since it's only an assumption, the write barrier is issued anyway.

So why aren't barriers used when writing data to the main filesystem? I think the key here is that as long as the journal is coherent, metadata that's missing from the filesystem (eg. because it was lost in a power-loss event) is normally recovered during the journal replay, thus avoiding filesystem corruption. Furthermore, the use of data=journal should also guarantee consistency of actual filesystem data because, as I understand it, the recovery process will also write out data blocks that were committed to the journal as part of its replay mechanism.

So while ext4 does not actually flush disk caches at the end of a checkpoint, some steps should be taken to maximize recoverability in case of a power-loss:

The filesystem should be mounted with data=journal, and not data=writeback (data=ordered is unavailable when using an external journal). This one should be obvious: we want a copy of all incoming data blocks inside the journal since those are the ones likely to be lost in a power-loss event. This isn't expensive performance-wise, since NVMe devices are very fast.
The maximum journal size of 102400 blocks (400MB when using 4K filesystem blocks) should be used, so as to maximize the amount of data that's recoverable in a journal replay. This shouldn't be an issue since all NVMe devices are always at least several gigabytes in size.
Problems may still arise in case an unexpected shutdown happens during a write-intensive operation. If transactions get dropped from the journal device faster than the data drives are able to flush their caches on their own, unrecoverable data loss or filesystem corruption could occur.

So the bottom line is, in my view, is that it's not 100% safe to disable write barriers, although some precautions can be implemented (#1 and #2) to make this setup a little safer.

Luca Gibelli · Answer 2 · 2018-05-29T12:42:05+08:00

Luca Gibelli

2018-05-29T12:42:05+08:002018-05-29T12:42:05+08:00

Another way to put your question is this: when doing a checkpoint, i.e. when writing the data in the journal to the actual filesystem, does ext4 flush out the cache (of the rotating disks, in your case) before marking the transaction as completed and updating the journal accordingly?

If we look at the source code of jbd2 (which is responsible to handle the journalling) in checkpoint.c we see that jbd2_log_do_checkpoint() calls at the end:

__flush_batch(journal, &batch_count);

which calls:

write_dirty_buffer(journal->j_chkpt_bhs[i], REQ_SYNC);

So it seems like it should be safe.

Related: in the past a patch to use WRITE_SYNC in journal checkpoint was also proposed: The reason was that writing the buffers had too low priority and caused the journal to fill up while waiting for the write to complete

3

wazoox · Answer 3 · 2018-06-01T11:30:06+08:00

wazoox

2018-06-01T11:30:06+08:002018-06-01T11:30:06+08:00

If disabling write barriers enhance significantly performance, that means you shouldn't disable write barriers and that your data is at risk. See this part of the XFS FAQ for explanations.

0

Disabling ext4 write barriers when using an external journal

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?