I have a remote linux system that became super slow yesterday. Since the remote luks unlocking I've setup doesn't seem to work reliable and I won't be able to physically access the machine within the next 10 days I'm trying to debug this instead of rebooting.
The system status tools I'm used to are htop
and dstat
and since I had dstat
running in an ssh session I can see that since yesterday 2021-09-09 08:51:42 one cpu core is always fully used by "sys" - which I guess means the kernel?
I can't see any culprit process or thread in htop
.
I've stopped all user services and unmounted everything non essential which made the system respond a bit better again but still not nearly as fast as it should (got an Intel i7 CPU with an SSD).
I've found https://tanelpoder.com/posts/high-system-load-low-cpu-utilization-on-linux/ and installed the referenced https://0x.tools/ to get this result for psn -G syscall,wchan
:
=== Active Threads ========================================================================================
samples | avg_threads | comm | state | syscall | wchan
-----------------------------------------------------------------------------------------------------------
100 | 1.00 | (btrfs-cleaner) | Running (ON CPU) | [running] | 0
100 | 1.00 | (dpkg) | Disk (Uninterruptible) | fsync | btrfs_commit_transaction
100 | 1.00 | (systemd-journal) | Disk (Uninterruptible) | ftruncate | wait_current_trans
1 | 0.01 | (sshd) | Running (ON CPU) | [running] | 0
1 | 0.01 | (thermald) | Disk (Uninterruptible) | [running] | ec_guard
1 | 0.01 | (thermald) | Running (ON CPU) | [running] | 0
The dpkg
process can be explained by me trying to run apt upgrade
which run's around at a 1/1000th of the speed you'd normally expect (just a feeling, didn't measure it).
Maybe there's a problem with my btrfs root file system...? I can't find the btrfs-cleaner
in htop
, I guess I'm gonna research some more on what that is..
I did run a btrfs scrub
last night which completed super fast and didn't find any problems:
# btrfs scrub status /
UUID: 2f38e0ad-7f16-4a36-8096-b7981d47b4ff
Scrub started: Thu Sep 9 23:59:00 2021
Status: finished
Duration: 0:00:24
Total to scrub: 53.09GiB
Rate: 1.78GiB/s
Error summary: no errors found
But when I used nano to modify a config file on the root partition loading and saving it was super slow just now.
I just stumbled upon this: https://www.reddit.com/r/btrfs/comments/fmucrq/btrfs_snapshots_make_entire_system_lag_cpu_usage/ which has a comment that sounds similar to my problem:
every time on boot and after a snapshot btrfs-transacti and btrfs-cleaner would use up a core completely causing immense lag
only that this says it just lasts a few minutes on boot and snapshot creation, but I've disabled my btrbk
backup setup on this system a few days ago when one of the attached disks started to show problems.
I'm not sure if my btrfs root filesystem was using qgroups
, but I just ran btrfs quota disable /
which took around 10 seconds and didn't give any feedback.
Anybody got any other hint's for me how to debug / solve this problem?