I'm looking for recommendations for ext3 filesystem block size for MySQL 5.6 using InnoDB.
Running a CentOS 5.4 VM in VMware ESXi 5, VMFS 5 datastore on NetApp FibreChannel LUNs (which have 4k chunk size). Using O_DIRECT, innodb_flush_log_at_trx_commit = 2, 14G buffer pool, and the db does OLTP with some occasional large queries crunching a lot of data. Some tables are several GB or more, others very small. Table and ibdata files are on one filesystem, binlogs and ib_logfiles on another, so they can have different block sizes.
I understand InnoDB uses 16k block size, which is not user-configurable, so I'm wondering if it's worth it to set the ext3 block size to match, instead of the 4k default.
thanks!
Filesystem block size should not have bad impact on InnoDB. I'm not speaking about tiny bits of cpu-bound performance, since filesystem overhead for it is vanishingly small. What you should worry about is IO performance.
When mysql needs to read InnodDB page from disks it accesses inode structure for the file. ext3 inode contains references to 15 blocks. First 12 points to data blocks directly. The rest 3 points to blocks, containing other blocks references, which also may be direct or indirect.
So if InnoDB page lays in first (12*4)=48KB of the file - it will be fetched in 2 IO operations: 1 for inode, second for data block, if it lays in first (12*4 + 1024)*4=4.2MB in 3 ops, (12+1024+1024^2)*4=4GB - 4 ops, (12*bs+1024+1024^2+1024^3)*4=4TB - 5 ops.
1024 is the number of 4byte block reference in 4k block.
Readahead(preallocation for writes) and caching will reduce this count, allowing to read/write several blocks at once.
Block size of 4k is the same as linux memory page size, making page caching easier to code.
When Innodb page will be written first time, ext3 will preallocate 8 sequential blocks (32kb) and write 4 of them, other 4 will be discarded (or used for one more page). All changes to this page will be stored on the same blocks.
Reducing blocksize has only benefit in saving disk space, since 1 block is minimal unit of data to store on disk.
Increasing it (there's some kernel patches to do it) will improve performance for very large files, but not that much, as you may think. Matching it to InnoDB page size makes no sense, since in the vast majority of cases data blocks for one InnoDB page will lay sequentially on disk and will be read/written in single operation.
Nevermind, the only available block sizes for ext2/3 appear to be 1K, 2K, 4K.
from mke3fs(8) man page: