JamesApril

Asked: 2020-05-22 07:52:14 +0800 CST2020-05-22 07:52:14 +0800 CST 2020-05-22 07:52:14 +0800 CST

Process blocking or disk corruption? Very high load + wait times, but low CPU/memory use

I am using a vServer that is suddenly experiencing very high wait times (10/20/30 seconds) or even timeouts on basic requests since yesterday after being in use for over a year without any problems. This is my configuration:

8 CPU vCores, 32 GB memory, 800 GB SSD
Standard Plesk Obsidian with latest updates

The server runs a couple of websites with PHP and MariaDB via Apache, nothing too fancy, not a huge amount of in-going or out-going traffic, not too much processing on the server. While the average load on this vServer has usually been between 1 and 3 now it is suddenly 20-100... once I start either the Apache or MariaDB service.

Via htop I can see:

up to 30 processes in the "D" state (Uninterruptible Sleep)
very low CPU use (<5% or even 0% on most cores)
plenty of free memory available (disk space is available as well)
no unknown/unusual processes (mostly Plesk-related, MariaDB and Apache)

Via iotop I can see:

very limited disk activity, both read/write are 0 or close to 0 most of the time

And vmstat 1 5 gives me the following output

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs US SY ID WA st
 1 13      0 1055820      0 25834552    2    2   128    59    0   93  8  1 90  0  0
 2 14      0 975180      0 25870484    0    0    16     0    0  586 31  4 65  0  0
 1 16      0 910584      0 25873184    0    0   100    48    0  374  7  2 92  0  0
 0 16      0 920048      0 25883484    0    0    16    64    0  415  8  1 91  0  0
 0 15      0 954344      0 25883472    0    0    96  1432    0  383  1  0 99  0  0

So it looks like something is blocking these processes from being executed until every minute or so they are executed. I can then load a few pages on one of the websites, theses processes don't show up in htop anymore but a few more clicks and suddenly the same situation...

Interactions with this vServer via e.g. SFTP or SSH are also considerably slower than before due to the high average load. I have checked the health of the MariaDB databases already and couldn't find any problems and the load issue also happens when the MariaDB service isn't running.

My questions:

What can I do or use to find the specific reason why these processes cannot be executed / what is blocking them?
Is it possible that either the memory or disk has a problem? Should I run fsck (this would require taking the server offline)?

Anything to document e.g. hardware-related problems would be really helpful. I have checked other posts about a high load average but couldn't find a solution for my problem.

UPDATE

I've noticed that both buffer and swpd above are always 0. Here is the output of cat /proc/meminfo, could this be a/the reason?

MemTotal:       33554432 kB
MemFree:          639036 kB
MemAvailable:   25227064 kB
Cached:         24259912 kB
Buffers:               0 kB
Active:         19847944 kB
Inactive:       12315884 kB
Active(anon):    7664604 kB
Inactive(anon):   572316 kB
Active(file):   12183340 kB
Inactive(file): 11743568 kB
Unevictable:       11228 kB
Mlocked:           28388 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:           8485104 kB
Writeback:             8 kB
AnonPages:       8236920 kB
Shmem:            328712 kB
Slab:             683440 kB
SReclaimable:     661120 kB
SUnreclaim:        22320 kB

Output for iostat -d (but this shows "40 CPU", so probably the whole server?):

Device       tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
somename     13805,99     41065,30   2557318,07 2443426623 152162982648

UPDATE

Sample of blocked processes:

UPDATE

Ongoing discussions with other customers of this hoster indicate widespread problems with the virtualization platform used for these vServers (e.g. resources not being made available to the vServers). Some customers having had problems for days now. I'll update once more information is available.

UPDATE

Here is a news report in German about the ongoing problems with this hoster: Lang anhaltende Störung bei Stratos V-Servern

FINAL UPDATE

The problem with this vServer seems to have been resolved now by the hoster after a week, even almost 2 weeks for some customers with other vServers. Main reason: communication issues between switches leading to delays with io operations. Details can be found here: Strato: Massive V-Server-Störung bald behoben

Process blocking or disk corruption? Very high load + wait times, but low CPU/memory use

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?

Process blocking or disk corruption? Very high load + wait times, but low CPU/memory use

0 Answers