Background: I'm planning to use ZFS and I need to find the correct ashift
parameter for my harddrive, which should be log2(sector_size)
, e.g. 9 for 512 byte sectors.
My harddrive reports a physical and logical sector size of 512 bytes. I read that some harddrives report wrong information to prevent compatibility problems with operating systems that assume 512 byte sectors. I'm not sure whether that's the case with my harddrive.
So I wrote a small program to help me determine the true physical sector size. The program opens an empty partition on my harddrive and writes blocks of 4096 bytes at 1000 randomly selected locations spread within 1 GiB. The random locations are first aligned to 4096 bytes, then an offset is added. The program performs these 1000 random writes using different offsets and measures how long the writes took for each offset. The first offset is zero, then it is increased in steps of 256 bytes.
When opening the partition for writing, I use the O_WRONLY | O_SYNC | O_DIRECT
flags to get as close to the hardware as I can, i.e. circument as many caches as I can. I also make sure that my buffer is properly aligned in memory.
Here's what I would expect:
- For non-zero offsets, the addresses I'm writing to are not aligned to the harddrive's physical sectors (regardless of whether it has 512 or 4096 byte physical sectors). There is at least one sector that has to be modified only partially, so the harddrive has to read that sector, update parts of it and then write it back. That should be the slower case because a read is involved (read-modify-write).
- For zero offset, regardless of whether the harddrive has physical 512 or 4096 byte sectors, the write operations should not require reading any sectors. All sectors affected by the writes should simply be overwritten. This should be the faster case.
But in fact, I cannot notice any difference. The 1000 writes always take around 8.5 seconds. The offset doesn't seem to have any influence:
Offset Time (ms) for 1000 random writes
------ --------------------------------
0 8459.11
256 8450.69
512 8633.82
768 8533.94
1024 8467.36
1280 8450.63
1536 8525.72
1792 8533.96
2048 8450.64
2304 8450.79
2560 8442.37
2816 8442.38
3072 8442.28
3328 8450.82
3584 8442.27
3840 8450.81
Additional observations/remarks:
- Writing units of 512 bytes results in similar numbers (i.e. no noticeable influence of the offset).
- Just for the case that my partition itself is not aligned to a physical sector boundary, I also tried increasing the offset in 1 byte steps. That way, the "ideal" offset would be found eventually - but still, I couldn't identify any difference.
Can anyone explain this?
For sake of completion, here's my program (in case anyone wants to run it, insert the path to an empty block device into the open
call):
#include <chrono>
#include <fcntl.h>
#include <iostream>
#include <random>
#include <unistd.h>
int main()
{
const int bufferSize = 4096;
char buffer[bufferSize] __attribute__((aligned(4096)));
for (int offset = -256; offset < 4096; offset += 256)
{
std::mt19937 generator;
std::uniform_int_distribution<int> distribution(0, 1024 * 1024 * 1024 / 4096);
if (offset >= 0) std::cout << offset << "\t";
else std::cout << "Warming up ..." << std::endl;
int f = open("PATH_TO_EMPTY_BLOCK_DEVICE", O_WRONLY | O_SYNC | O_DIRECT);
auto t0 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000; ++i)
{
lseek(f, SEEK_SET, 4096 * distribution(generator) + offset);
if (write(f, buffer, bufferSize) != bufferSize) exit(1);
}
auto t1 = std::chrono::high_resolution_clock::now();
close(f);
if (offset >= 0) std::cout << (1000 * std::chrono::duration_cast<std::chrono::duration<double>>(t1 - t0).count()) << std::endl;
}
return 0;
}
4096 bytes x 1000 times = 4 MBytes data. Chances are that your hard drive has 64 MB of cache, if not more, 256 MB is not uncommon on modern drives.
Your methodology will work better if you increase the write size significantly, maybe 64 times, in order to actually see the physical drives characteristics.
Which OS are you using? If it's Linux, how were you able to do write at a starting offset which wasn't a multiple of 512 bytes when you were using
O_DIRECT
against an underlying block device?Aligning to the "true" sector size should be less of a hit but just how much better is highly device, data and pattern dependent (Toshiba claim the performance decrease due to misalignment could be as high as 20%). SSDs (which are not what you're asking about but may have to do large erases before laying down the data) are an excellent example because bad write alignment can lead to unnecessary write amplification. Having said that I'm told that modern devices internally have sectors much larger than 4kbytes but almost never expose this to higher levels.
Well you're most likely to see the impact of a read-modify-write (RMW) when you are going in the fastest possible situation that hits it (as the difference will be largest). Because you're doing random writes that force the OS to wait for true completion it is likely you are in a slower situation and the performance hit is just lost in the noise. As others have stated you also have to defeat any cache that may mask problems - if you have somehow populated the cache with the sectors that are going to be used by the RMW process then again the hit could be entirely masked. It could be your example program is flawed. Have you considered using fio?
If the disk wants to lie to that extent (not indicating a better physical size) trying to second guess it's behaviour beyond aligning partitions to 4kbytes is going to be challenging. OpenZFS does contain a list of drives whose fake block size it will try to compensate for though.
The main reason I've read for people using a non-default ashift with ZFS is to be able to add disks which have a 4kbyte native block size into the mix at a later stage.