Every once in a while, our Linux LAMP server (using PHP-FPM, XFS on thin LVM on HW RAID, Centos8) becomes inaccessible and stops responding to HTTP(S) requests.
Via centralized logging we found out that in those cases, load average quickly shoots up to hundreds, while more and more processes (systemd-journald, php processes, kernel xfs/dm threads...) get into a D state. According to iostat and pidstat, CPU and disk are not loaded much at all while load average hovers around 170, which is quite strange. From htop/ps output, there is no single or group of rogue processes that would explain this behaviour. It's just standard processes that seem to encounter some kind of "road block".
The only other strange thing with disk monitoring is that during those overload events, iostat intermitently reports quite high w_await for partition /var (2500-5000ms, while other partitions like /var/log, /var/lib/mysql mostly do not get over 10ms). This partition should be quiet most of the time, so it is not clear why iostat reports such large w_await times there.
The only solution then is to power cycle the server.
This happens on two servers of the same kind, never on others. It seems to be some kind of FS/block layer/controller/disk malfunction; lots of processes suddenly start waiting for disk or something else in the kernel, but according to iotop/iostat, disk is not doing much.
Is there a way to query Linux kernel FS/block layer/controller driver what exactly they are doing with storage and on behalf of which process? Standard tools like iotop/iostat tell me only names of I/O active processes and disk partition activity, but not which processes access which disk partition and what exactly they are doing there.
In situations like this, I find that it helps to throttle the number of connections higher up the stack.
When more than, say, 100 active processes are running, they stumble over each other. They are vying for resources (CPU, etc). The net effect is that all processes run slower, sometimes to the point where you feel like the only solution is to reboot the server.
In the case of MariaDB, I recommend turning on the slowlog so that you can identify the query that is having the most impact on the system. Then speed it up. If you want help, provide the query, its Explain and Create Table. More: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog
Speeding up a few queries is likely to decrease the 170 Load Average and I/O, thereby alleviating the bottleneck.