Nginx setup for static large file (100MB-16GB) serving on a CentOs v7.7 with a bound network 2x10Gbps. ZFS on linux is used.
- Pool size 50TB on 8x8TB disks
- Max arc size 65GB
- L2ARC 1TB nvme
- Recordsize=16M
- ashift=12
- nginx: sendfile off
- nginx: aio on
- nginx: output_buffers 1 128k
System is up for some days. Too much cpu used on filling arc. Disk is busy on 600MB/s but nginx throughput is under 2Gbps and L2ARC hit ratio is very low. Any idea?
Here is zfs_arc_summary output and perf report.
ZFS Subsystem Report Wed May 20 12:27:46 2020
ARC Summary: (HEALTHY)
Memory Throttle Count: 0
ARC Misc:
Deleted: 1.84m
Mutex Misses: 157.78k
Evict Skips: 157.78k
ARC Size: 102.54% 66.97 GiB
Target Size: (Adaptive) 100.00% 65.32 GiB
Min Size (Hard Limit): 92.87% 60.66 GiB
Max Size (High Water): 1:1 65.32 GiB
ARC Size Breakdown:
Recently Used Cache Size: 46.89% 31.40 GiB
Frequently Used Cache Size: 53.11% 35.57 GiB
ARC Hash Breakdown:
Elements Max: 159.31k
Elements Current: 97.44% 155.23k
Collisions: 11.76k
Chain Max: 2
Chains: 779
ARC Total accesses: 446.46m
Cache Hit Ratio: 99.29% 443.29m
Cache Miss Ratio: 0.71% 3.17m
Actual Hit Ratio: 99.29% 443.29m
Data Demand Efficiency: 99.28% 402.73m
CACHE HITS BY CACHE LIST:
Most Recently Used: 5.99% 26.57m
Most Frequently Used: 94.01% 416.71m
Most Recently Used Ghost: 0.00% 9.65k
Most Frequently Used Ghost: 0.28% 1.26m
CACHE HITS BY DATA TYPE:
Demand Data: 90.19% 399.81m
Prefetch Data: 0.00% 0
Demand Metadata: 9.81% 43.47m
Prefetch Metadata: 0.00% 1.82k
CACHE MISSES BY DATA TYPE:
Demand Data: 91.77% 2.91m
Prefetch Data: 0.00% 0
Demand Metadata: 7.85% 249.26k
Prefetch Metadata: 0.38% 12.12k
L2 ARC Summary: (HEALTHY)
Low Memory Aborts: 0
Free on Write: 3
R/W Clashes: 0
Bad Checksums: 0
IO Errors: 0
L2 ARC Size: (Adaptive) 458.07 GiB
Compressed: 99.60% 456.23 GiB
Header Size: 0.00% 5.34 MiB
L2 ARC Breakdown: 3.17m
Hit Ratio: 15.02% 476.70k
Miss Ratio: 84.98% 2.70m
Feeds: 55.31k
L2 ARC Writes:
Writes Sent: 100.00% 55.27k
ZFS Tunable:
metaslab_debug_load 0
zfs_multihost_interval 1000
zfs_vdev_default_ms_count 200
zfetch_max_streams 8
zfs_nopwrite_enabled 1
zfetch_min_sec_reap 2
zfs_dbgmsg_enable 1
zfs_dirty_data_max_max_percent 25
zfs_abd_scatter_enabled 1
zfs_remove_max_segment 16777216
zfs_deadman_ziotime_ms 300000
spa_load_verify_data 1
zfs_zevent_cols 80
zfs_obsolete_min_time_ms 500
zfs_dirty_data_max_percent 40
zfs_vdev_mirror_non_rotating_inc 0
zfs_resilver_disable_defer 0
zfs_sync_pass_dont_compress 8
zvol_volmode 1
l2arc_write_max 8388608
zfs_disable_ivset_guid_check 0
zfs_vdev_scrub_max_active 128
zfs_vdev_sync_write_min_active 64
zvol_prefetch_bytes 131072
zfs_send_unmodified_spill_blocks 1
metaslab_aliquot 524288
zfs_no_scrub_prefetch 0
zfs_abd_scatter_max_order 10
zfs_arc_shrink_shift 0
zfs_vdev_queue_depth_pct 1000
zfs_txg_history 100
zfs_vdev_removal_max_active 2
zil_maxblocksize 131072
metaslab_force_ganging 16777217
zfs_delay_scale 500000
zfs_free_bpobj_enabled 1
zfs_vdev_async_write_active_min_dirty_percent 30
metaslab_debug_unload 1
zfs_read_history 0
zfs_vdev_initializing_max_active 1
zvol_max_discard_blocks 16384
zfs_recover 0
zfs_scan_fill_weight 3
spa_load_print_vdev_tree 0
zfs_key_max_salt_uses 400000000
zfs_metaslab_segment_weight_enabled 1
zfs_dmu_offset_next_sync 0
l2arc_headroom 2
zfs_deadman_synctime_ms 600000
zfs_dirty_data_sync_percent 20
zfs_free_min_time_ms 1000
zfs_dirty_data_max 4294967296
zfs_vdev_async_read_min_active 64
dbuf_metadata_cache_max_bytes 314572800
zfs_mg_noalloc_threshold 0
zfs_dedup_prefetch 0
dbuf_cache_lowater_pct 10
zfs_slow_io_events_per_second 20
zfs_vdev_max_active 1000
l2arc_write_boost 8388608
zfs_resilver_min_time_ms 3000
zfs_max_missing_tvds 0
zfs_vdev_async_write_max_active 10
zvol_request_sync 0
zfs_async_block_max_blocks 100000
metaslab_df_max_search 16777216
zfs_prefetch_disable 1
metaslab_lba_weighting_enabled 1
zio_dva_throttle_enabled 1
metaslab_df_use_largest_segment 0
zfs_vdev_trim_max_active 2
zfs_unlink_suspend_progress 0
zfs_sync_taskq_batch_pct 75
zfs_arc_min_prescient_prefetch_ms 0
zfs_scan_max_ext_gap 2097152
zfs_initialize_value 16045690984833335022
zfs_mg_fragmentation_threshold 95
zil_nocacheflush 0
l2arc_feed_again 1
zfs_trim_metaslab_skip 0
zfs_zevent_console 0
zfs_immediate_write_sz 32768
zfs_condense_indirect_commit_entry_delay_ms 0
zfs_dbgmsg_maxsize 4194304
zfs_trim_extent_bytes_max 134217728
zfs_trim_extent_bytes_min 32768
zfs_user_indirect_is_special 1
zfs_lua_max_instrlimit 100000000
zfs_free_leak_on_eio 0
zfs_special_class_metadata_reserve_pct 25
zfs_deadman_enabled 1
dmu_object_alloc_chunk_shift 7
vdev_validate_skip 0
zfs_commit_timeout_pct 5
zfs_arc_meta_limit_percent 75
metaslab_bias_enabled 1
zfs_send_queue_length 16777216
zfs_arc_p_dampener_disable 1
zfs_object_mutex_size 64
zfs_metaslab_fragmentation_threshold 70
zfs_delete_blocks 20480
zfs_arc_dnode_limit_percent 10
zfs_no_scrub_io 0
zfs_dbuf_state_index 0
zio_deadman_log_all 0
zfs_vdev_sync_read_min_active 64
zfs_deadman_checktime_ms 60000
metaslab_fragmentation_factor_enabled 1
zfs_override_estimate_recordsize 0
zfs_multilist_num_sublists 0
zvol_inhibit_dev 0
zfs_scan_legacy 0
zfetch_max_distance 16777216
zap_iterate_prefetch 1
zfs_scan_strict_mem_lim 0
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_scan_checkpoint_intval 7200
dmu_prefetch_max 134217728
zfs_recv_queue_length 16777216
zfs_vdev_mirror_rotating_seek_inc 5
dbuf_cache_shift 5
dbuf_metadata_cache_shift 6
zfs_condense_min_mapping_bytes 131072
zfs_vdev_cache_size 0
spa_config_path /etc/zfs/zpool.cache
zfs_dirty_data_max_max 4294967296
zfs_arc_lotsfree_percent 10
zfs_vdev_ms_count_limit 131072
zfs_zevent_len_max 1024
zfs_checksum_events_per_second 20
zfs_arc_sys_free 0
zfs_scan_issue_strategy 0
zfs_arc_meta_strategy 1
zfs_condense_max_obsolete_bytes 1073741824
zfs_vdev_cache_bshift 16
zfs_compressed_arc_enabled 1
zfs_arc_meta_adjust_restarts 4096
zfs_max_recordsize 16777216
zfs_vdev_scrub_min_active 48
zfs_zil_clean_taskq_maxalloc 1048576
zfs_lua_max_memlimit 104857600
zfs_vdev_raidz_impl cycle [fastest] original scalar sse2 ssse3
zfs_per_txg_dirty_frees_percent 5
zfs_vdev_read_gap_limit 32768
zfs_scan_vdev_limit 4194304
zfs_zil_clean_taskq_minalloc 1024
zfs_multihost_history 0
zfs_scan_mem_lim_fact 20
zfs_arc_meta_limit 0
spa_load_verify_shift 4
zfs_vdev_sync_write_max_active 128
l2arc_norw 0
zfs_arc_meta_prune 10000
zfs_vdev_removal_min_active 1
metaslab_preload_enabled 1
dbuf_cache_max_bytes 629145600
zfs_vdev_mirror_non_rotating_seek_inc 1
zfs_spa_discard_memory_limit 16777216
zfs_vdev_initializing_min_active 1
zvol_major 230
zfs_vdev_aggregation_limit 1048576
zfs_flags 0
zfs_vdev_mirror_rotating_seek_offset 1048576
spa_asize_inflation 24
zfs_admin_snapshot 0
l2arc_feed_secs 1
vdev_removal_max_span 32768
zfs_trim_txg_batch 32
zfs_multihost_fail_intervals 10
zfs_abd_scatter_min_size 1536
zio_taskq_batch_pct 75
zfs_sync_pass_deferred_free 2
zfs_arc_min_prefetch_ms 0
zvol_threads 32
zfs_condense_indirect_vdevs_enable 1
zfs_arc_grow_retry 0
zfs_multihost_import_intervals 20
zfs_read_history_hits 0
zfs_vdev_min_ms_count 16
zfs_zil_clean_taskq_nthr_pct 100
zfs_vdev_async_write_min_active 2
zfs_vdev_async_read_max_active 128
zfs_vdev_aggregate_trim 0
zfs_delay_min_dirty_percent 60
zfs_vdev_cache_max 16384
zfs_removal_suspend_progress 0
zfs_vdev_trim_min_active 1
zfs_scan_mem_lim_soft_fact 20
ignore_hole_birth 1
spa_slop_shift 5
zfs_vdev_write_gap_limit 4096
dbuf_cache_hiwater_pct 10
spa_load_verify_metadata 1
l2arc_noprefetch 1
send_holes_without_birth_time 1
zfs_vdev_mirror_rotating_inc 0
zfs_arc_dnode_reduce_percent 10
zfs_arc_pc_percent 0
zfs_metaslab_switch_threshold 2
zfs_vdev_scheduler deadline
zil_slog_bulk 786432
zfs_expire_snapshot 300
zfs_sync_pass_rewrite 2
zil_replay_disable 0
zfs_nocacheflush 0
zfs_vdev_aggregation_limit_non_rotating 131072
zfs_arc_max 70132659200
zfs_arc_min 65132659200
zfs_read_chunk_size 1048576
zfs_txg_timeout 5
zfs_trim_queue_limit 10
zfs_arc_dnode_limit 0
zfs_scan_ignore_errors 0
zfs_pd_bytes_max 52428800
zfs_scrub_min_time_ms 1000
l2arc_headroom_boost 200
zfs_send_corrupt_data 0
l2arc_feed_min_ms 200
zfs_arc_meta_min 0
zfs_arc_average_blocksize 8192
zfetch_array_rd_sz 1048576
zfs_autoimport_disable 1
zio_slow_io_ms 30000
zfs_arc_p_min_shift 0
zio_requeue_io_start_cut_in_line 1
zfs_removal_ignore_errors 0
zfs_scan_suspend_progress 0
zfs_vdev_sync_read_max_active 128
zfs_deadman_failmode wait
zfs_reconstruct_indirect_combinations_max 4096
zfs_ddt_data_is_special 1
What size I/O is the client requesting? My hunch here is that your recordsize is way too big and causing huge read amplification. If the client is fetching smaller blocks, ZFS still has to read the full 16MB block to verify the checksum, and ARC and L2ARC are designed to be resistant to caching sequential I/O because the benefit of caching with sequential I/O is smaller. The whole block has to be cached, so you can only cache ~4000 blocks in ARC.
Reduce the recordsize to 1MB, cp + mv back a few heavily used files and see if disk I/O and network I/O become more similar.