I have been asked to come up fio
benchmark results for this test dataset: 1048576x1MiB. So, overall size is 1TiB. The set contains 2^20 1MiB files. The server runs CentOS Linux release 7.8.2003 (Core)
. It has sufficient RAM:
[root@tbn-6 src]# free -g
total used free shared buff/cache available
Mem: 376 8 365 0 2 365
Swap: 3 2 1
It's actually not a physical server. Instead, it's a Docker container with the following CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
[...]
Why docker? We are working on a project that evaluates the appropriateness of using containers instead of physical servers. Back to the fio
issue.
I remember I had troubles with fio
dealing with a dataset consisting many small files before. So, I did the following checks:
[root@tbn-6 src]# ulimit -Hn
8388608
[root@tbn-6 src]# ulimit -Sn
8388608
[root@tbn-6 src]# cat /proc/sys/kernel/shmmax
18446744073692774399
All looked OK to me. I also compiled as of this writing the latest fio 3.23 with GCC 9.
[root@tbn-6 src]# fio --version
fio-3.23
Here is the job file:
[root@tbn-6 src]# cat testfio.ini
[writetest]
thread=1
blocksize=2m
rw=randwrite
direct=1
buffered=0
ioengine=psync
gtod_reduce=1
numjobs=12
iodepth=1
runtime=180
group_reporting=1
percentage_random=90
opendir=./1048576x1MiB
Note: of the above, the following can be taken out:
[...]
gtod_reduce=1
[...]
runtime=180
group_reporting=1
[...]
The rest MUST be kept. This is because running fio in our view the job file should be set up in such a way that emulates the application's interactions with storage as closely as possible, even knowing fio
!= the application
.
I did the first run like so
[root@tbn-6 src]# fio testfio.ini
smalloc: OOM. Consider using --alloc-size to increase the shared memory available.
smalloc: size = 368, alloc_size = 388, blocks = 13
smalloc: pool 0, free/total blocks 1/524320
smalloc: pool 1, free/total blocks 8/524320
smalloc: pool 2, free/total blocks 10/524320
smalloc: pool 3, free/total blocks 10/524320
smalloc: pool 4, free/total blocks 10/524320
smalloc: pool 5, free/total blocks 10/524320
smalloc: pool 6, free/total blocks 10/524320
smalloc: pool 7, free/total blocks 10/524320
fio: filesetup.c:1613: alloc_new_file: Assertion `0' failed.
Aborted (core dumped)
OK, so time to use the --alloc-size
[root@tbn-6 src]# fio --alloc-size=776 testfio.ini
smalloc: OOM. Consider using --alloc-size to increase the shared memory available.
smalloc: size = 368, alloc_size = 388, blocks = 13
smalloc: pool 0, free/total blocks 1/524320
smalloc: pool 1, free/total blocks 8/524320
smalloc: pool 2, free/total blocks 10/524320
smalloc: pool 3, free/total blocks 10/524320
smalloc: pool 4, free/total blocks 10/524320
smalloc: pool 5, free/total blocks 10/524320
smalloc: pool 6, free/total blocks 10/524320
smalloc: pool 7, free/total blocks 10/524320
smalloc: pool 8, free/total blocks 8/524288
smalloc: pool 9, free/total blocks 8/524288
smalloc: pool 10, free/total blocks 8/524288
smalloc: pool 11, free/total blocks 8/524288
smalloc: pool 12, free/total blocks 8/524288
smalloc: pool 13, free/total blocks 8/524288
smalloc: pool 14, free/total blocks 8/524288
smalloc: pool 15, free/total blocks 8/524288
fio: filesetup.c:1613: alloc_new_file: Assertion `0' failed.
Aborted (core dumped)
Back to square one :(
I must be missing something. Any help is much obliged.
(TL;DR setting
--alloc-size
to have a big number helps)I bet you can simplify this job down and still reproduce the problem (which will be helpful for whoever looks at this because there are less places to look). I'd guess the crux is that
opendir
option and the fact that you say the directory contains "2^20 1MiB files"...If you read the documentation of
--alloc-size
you will notice it mentions:By default fio evenly distributes random I/O across evenly across a file (each block is written once per pass) but to do so it needs to keep track of the areas it has written which means it has to keep a data structure per file. OK you can see where this is going...
Memory pools set aside for certain data structures (because they have to be shared between jobs). Initially there are 8 pools (https://github.com/axboe/fio/blob/fio-3.23/smalloc.c#L22 ) and by default each pool is 16 megabytes in size (https://github.com/axboe/fio/blob/fio-3.23/smalloc.c#L21 ).
Each file that does random I/O requires a data structure to go with it. Based on your output let's guess that each file forces an allocation a data structure of 368 bytes + header (https://github.com/axboe/fio/blob/fio-3.23/smalloc.c#L434 ), which combined comes to 388 bytes. Because the pool works in allocations of 32 bytes (https://github.com/axboe/fio/blob/fio-3.23/smalloc.c#L70 ) this means we actually take a bite of 13 blocks (416 bytes) out of a pool per file.
Out of curiosity I have the following questions:
/tmp
can be?I don't think the above are germane to your issue but it would be good rule out.
Update: by default, docker limits the amount of IPC shared memory (also see its --shm-size option). It's unclear if it was a factor in this particular case but see the "original job only stopped at 8 pools" comment below.
So why didn't setting
--alloc-size=776
help? Looking at what you wrote, it seems odd that your blocks per pool didn't increase, right? I notice your pools grew to the maximum of 16 (https://github.com/axboe/fio/blob/fio-3.23/smalloc.c#L24 ) the second time around. The documentation for--alloc-size
says this:You used
--alloc-size=776
... isn't 776 KiB smaller than 16 MiB? That would make each pool smaller than the default and may explain why it tried to grow the number of pools to the maximum of 16 before giving up in your second run.The above arithmetic suggests you want each pool to be approximately 52 megabytes in size if you are going to have 8 of them for a sum total of approximately 416 megabytes of RAM. What happens when you use
--alloc-size=53248
?Update: the calculated number above was too low. In a comment the question asker reports that using a much higher setting of
--alloc-size=1048576
was required.(I'm a little concerned that the original job only stopped at 8 pools (128 MiB) though. Doesn't that suggest that trying to grow to a ninth 16 MiB pool was problematic?)
Finally, the fio documentation seems to be hinting these data structures are being allocated when you ask for a particular distribution of random I/O. This suggests that if the I/O is sequential or if the I/O is using random offsets but DOESN'T have to adhere to a distribution then maybe those data structures don't have to be allocated... What happens if you use
norandommap
?(Aside:
blocksize=2M
but your files are 1MiB big - is that correct?)This question feels too big and specialist for a casual serverfault answer and may get a better answer from the fio project itself (see https://github.com/axboe/fio/blob/fio-3.23/REPORTING-BUGS , https://github.com/axboe/fio/blob/fio-3.23/README#L58 ).
Good luck!