We have 2 similar machines x10qbi, 4 x cpu, 1TB RAM, Ubuntu 18 that are facing soft lockup and freeze. Already disabled ACPI in BIOS, but it did not solve, ASP was off since initial setup.
These 2 machines faced issues with RAID5 and RAID10 via adaptec (8 disks x 3TB SAS 7200)
watchdog: BUG: soft lockup - CPU#51 stuck for 22s! [pacd:140354]
Modules linked in: macvlan cfg80211 ceph libceph fscache aufs overlay rdma_ucm(OE) ib_ucm(OE) ib_ipoib(OE) ib_umad(OE) esp6_offload esp6 esp4_offload esp4 xfrm_algo mlx5_fpga_
mdev vfio_iommu_type1 vfio mdev(OE) mlx4_en(OE) bonding xfs nls_iso8859_1 intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif i
k(OE) intel_rapl_perf mei_me lpc_ich mei shpchp ioatdma ipmi_si mac_hid ipmi_devintf ipmi_msghandler acpi_pad sch_fq_codel nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_co
dma_cm(OE) iw_cm(OE) ib_cm(OE) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi knem(OE) ip_tables x_tables autofs4
btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx4_ib(OE) ib_uverbs(OE) ib
f_pclmul crc32_pclmul ghash_clmulni_intel ast hid_generic i2c_algo_bit pcbc ttm usbhid aesni_intel hid drm_kms_helper aes_x86_64 crypto_simd syscopyarea glue_helper ixgbe sysf
sys_fops ptp nvme_core devlink aacraid ahci pps_core drm mlx_compat(OE) libahci mdio wmi
CPU: 51 PID: 140354 Comm: pacd Tainted: G OE 4.15.0-29-generic #31-Ubuntu
Hardware name: Supermicro PIO-848B-TRF4T-ST031/X10QBI, BIOS 3.2a 08/08/2019
RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
RSP: 0018:ffff8d09bf2c3d68 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
RAX: 00000000ff5e4a5d RBX: ffff8d49ab049b68 RCX: ffff8d09bf2c3d98
RDX: ffff8d49ab049b70 RSI: 0000000000000202 RDI: 0000000000000202
RBP: ffff8d09bf2c3d68 R08: 000000000000000e R09: 0000000000000000
R10: ffff8d09bf2c3c70 R11: 0000000000000073 R12: 00000000ff5e4a5d
R13: 0000000000000202 R14: 0000000000000000 R15: 0000000000000000
FS: 00007ff6c3e9e700(0000) GS:ffff8d09bf2c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000152b4038 CR3: 0000003dc5766001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
__wake_up_common_lock+0x8e/0xc0
__wake_up+0x13/0x20
rwb_wake_all+0x2c/0x60
scale_up.part.20+0x28/0x40
wb_timer_fn+0x21b/0x3d0
? blk_mq_tag_update_depth+0x110/0x110
blk_stat_timer_fn+0x147/0x150
call_timer_fn+0x30/0x130
run_timer_softirq+0x3fb/0x450
? ktime_get+0x43/0xa0
? lapic_next_deadline+0x26/0x30
__do_softirq+0xdf/0x2b2
irq_exit+0xb6/0xc0
smp_apic_timer_interrupt+0x71/0x130
apic_timer_interrupt+0x84/0x90
</IRQ>
RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
RSP: 0018:ffff9a6f5aa7fa68 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
RAX: 0000000000000202 RBX: ffff9a6f5aa7fac0 RCX: ffff8c8805690000
RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000202
RBP: ffff9a6f5aa7fa68 R08: 0000000000d00000 R09: ffff8d09bf663440
R10: 0000000000000000 R11: 0000000000000082 R12: ffff8d49ab049b68
R13: 0000000000000002 R14: 0000000000000060 R15: ffff8d49ab049b00
prepare_to_wait_exclusive+0x72/0x80
wbt_wait+0x137/0x350
? wait_woken+0x80/0x80
blk_mq_make_request+0xe0/0x570
generic_make_request+0x124/0x300
submit_bio+0x73/0x150
? submit_bio+0x73/0x150
? xfs_setfilesize_trans_alloc.isra.13+0x3e/0x90 [xfs]
xfs_submit_ioend+0x87/0x1c0 [xfs]
xfs_vm_writepages+0xd1/0xf0 [xfs]
do_writepages+0x4b/0xe0
? iomap_write_begin.constprop.18+0x140/0x140
? iomap_file_buffered_write+0x6e/0xa0
? iomap_write_begin.constprop.18+0x140/0x140
? xfs_iunlock+0xf8/0x100 [xfs]
__filemap_fdatawrite_range+0xc1/0x100
? __filemap_fdatawrite_range+0xc1/0x100
file_write_and_wait_range+0x5a/0xb0
xfs_file_fsync+0x5f/0x230 [xfs]
vfs_fsync_range+0x51/0xb0
do_fsync+0x3d/0x70
SyS_fdatasync+0x13/0x20
do_syscall_64+0x73/0x130
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7ff729d273e7
RSP: 002b:00007ff6c3d9ec50 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
RAX: ffffffffffffffda RBX: 0000000000000027 RCX: 00007ff729d273e7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000027
RBP: 00007ff6c3d9ed30 R08: 0000000000000000 R09: 000000000000002c
R10: 00007ff6bc0008d0 R11: 0000000000000293 R12: 00007ff6c3d9ece0
R13: 00007ff6bc010965 R14: 00007ff6c3d9ed60 R15: 00007ff6bc00e3f0
Code: 47 76 ff ff ff 7f 89 d0 5d c3 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f
For everyone that is facing similar issue:
The RAID card Adaptec 72405 had an outdate firmware we flashed it, but this did not solved the issue) -- mentioning because it may be part of the solution; so, after working only with the Intel 3606 (no other disk) we had the hard lock again;
We also add some throttling points to our routines and increase
vm.min_free_kbytes
, changed the scheduler to noop, but it all seemed to delay the hard lock.Since we have other 10 machines with the same Intel NVME 3605 with no issue until now (those have less 30% of RAM and 1/4 of CPU power than the ones that show the issue), we did not expect the problem to related to Intel NVME hardware, but as it was the only direction, after some searching, we found a know and fixed bug about "rigorous write to NVME".
After installing Ubuntu 20, we did not face the hard lockup again in 60h of testing. We must disclose that we did not do the proper isolation by downgrade the Adapter and remove the throttling points on routines, but we really believe that more RAM and more CPU made possible a throughput that can take the NVME to the limit and culminate in the hard lockup described in the bug 1810998. We can confirm that the change of scheduler did not made any difference after Ubuntu 20