I have a forum with a lot of visitors, Some days the load increase to reach 40 without increase of the number vistors. As you can see from the below output, the waiting time is high (57%). how do I find the reason for that?
The server software is Apache, MySQL and PHP.
root@server:~# top
top - 13:22:08 up 283 days, 22:06, 1 user, load average: 13.84, 24.75, 22.79
Tasks: 333 total, 1 running, 331 sleeping, 0 stopped, 1 zombie
Cpu(s): 20.6%us, 7.9%sy, 0.0%ni, 13.4%id, 57.1%wa, 0.1%hi, 0.9%si, 0.0%st
Mem: 4053180k total, 3868680k used, 184500k free, 136380k buffers
Swap: 9936160k total, 12144k used, 9924016k free, 2166552k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23930 mysql 20 0 549m 122m 6580 S 90 3.1 4449:04 mysqld
17422 www-data 20 0 223m 20m 10m S 2 0.5 0:00.21 apache2
17555 www-data 20 0 222m 19m 9968 S 2 0.5 0:00.13 apache2
17264 www-data 20 0 225m 19m 8972 S 1 0.5 0:00.17 apache2
17251 www-data 20 0 220m 12m 4912 S 1 0.3 0:00.12 apache2
.
root@server:~# top
top - 13:39:59 up 283 days, 22:24, 1 user, load average: 6.66, 10.39, 13.95
Tasks: 318 total, 1 running, 317 sleeping, 0 stopped, 0 zombie
Cpu(s): 13.6%us, 4.2%sy, 0.0%ni, 40.5%id, 40.6%wa, 0.2%hi, 0.8%si, 0.0%st
Mem: 4053180k total, 4010992k used, 42188k free, 119544k buffers
Swap: 9936160k total, 12160k used, 9924000k free, 2290716k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23930 mysql 20 0 549m 122m 6580 S 44 3.1 4457:30 mysqld
19946 www-data 20 0 223m 21m 10m S 5 0.6 0:00.77 apache2
17316 www-data 20 0 226m 23m 11m S 1 0.6 0:01.76 apache2
17333 www-data 20 0 222m 21m 11m S 1 0.5 0:01.55 apache2
18212 www-data 20 0 225m 22m 11m S 1 0.6 0:01.58 apache2
19528 www-data 20 0 220m 13m 5480 S 1 0.3 0:00.63 apache2
19600 www-data 20 0 224m 20m 11m S 1 0.5 0:00.73 apache2
19942 www-data 20 0 225m 21m 10m S 1 0.5 0:00.82 apache2
20232 www-data 20 0 222m 16m 8760 S 1 0.4 0:00.65 apache2
20243 www-data 20 0 223m 21m 11m S 1 0.5 0:00.57 apache2
20299 www-data 20 0 225m 20m 9m S 1 0.5 0:00.67 apache2
20441 www-data 20 0 225m 21m 10m S 1 0.5 0:00.57 apache2
21201 www-data 20 0 220m 12m 5148 S 1 0.3 0:00.19 apache2
21362 www-data 20 0 220m 12m 5032 S 1 0.3 0:00.17 apache2
21364 www-data 20 0 220m 12m 4916 S 1 0.3 0:00.14 apache2
21366 www-data 20 0 220m 12m 5124 S 1 0.3 0:00.22 apache2
21373 www-data 20 0 222m 14m 7060 S 1 0.4 0:00.26 apache2
Here are a few tools to find disk activity:
iotop
vmstat 1
iostat 1
lsof
strace -e trace=open <application>
strace -e trace=open -p <pid>
In
ps auxf
you'll also see which processes are are in uninterruptible disk sleep (D
) because they are waiting for I/O.You may also want to create a backup, and see if the harddrive is slowly failing. A harddrive generally starts to slow down before it deceases. This could also explain the high load.
The output from top suggests that the DBMS is experiencing most of the I/O waits, so database tuning issues are an obvious candidate to investigate.
I/O waiting on a database server - particularly on load spikes - is a clue that your DBMS might be either disk bound (i.e. you need a faster disk subsystem) or it might have a tuning issue. You should probably also look into profiling your database server - i.e. get a trace of what it's doing and what queries are taking the time.
Some starter points for diagnising database tuning issues:-
Find the queries that take up the most time, and look at the query plans. See if any have odd query plans such as a table scan where it shouldn't be. Maybe the database needs an index added.
Long resource wait times may mean that some key resource pool needs to be expanded.
Long I/O wait times may mean that you need a faster disk subsystem.
Are your log and data volumes on separate drives? Database logs have a lot of small sequential writes (essentially they behave like a ring buffer). If you have a busy random access workload sharing the same disks as your logs this will disporportionately affect the throughput of the logging. For a database transaction to commit the log entries must be written out to disk, so this will place a bottleneck on the whole system.
Note that some MySQL storage engines don't use logs so this may not be an issue in your case.
Footnote: Queuing systems
Queuing systems (a statistical model for throughput) get hyperbolically slower as the system approaches saturation. For a high level approximation, a system that is 50% saturated has an average queue length of 2. A system that is 90% saturated has a queue length of 10, a system that is 99% saturated has a queue length of 100.
Thus, on a system that is close to saturation, small changes in load can result in large changes to wait times, in this case manifesting as time spent waiting on I/O. If the I/O capacity of your disk subsystem is nearly saturated then small changes in load can result in significant changes in response times.
Run
iotop
, oratop -dD
, to see what processes are doing io. Usestrace
if you need a closer look.What the users are doing could be as significant as the number that are actually there. Operations like searching the forum will be more demanding than just loading and viewing individual threads or lists of threads.
Also: are you running on a dedicated server or a VPS? If your service is not on a dedicated server then the actions of apps running on the same host will have an effect as the VMs your VM shares a host with will compete for a share of the I/O resource.
As others have pointed out, tools like
iotop
will help you to look deeper into what tasks are sat waiting for I/O responses and what files they are accessing at the time.In both screens sure looks like "mysqld" is responsible.
You need to see what that daemon is doing...what queries are running.
As Flip says, it looks like the problem is around what mysql is doing.
Around half of your physical memory is currently being used for I/O caching - forum software usually generates lots of quick queries returning small numbers of rows, with highly skewed hot areas of disk - so there's something definitely screwy going on if the system is spending this much time in wait.
I only ever see CPU/disk usage like that when running queries which update millions of rows.
The high load average is direct consequence of the I/O.
Crank up your mysql logging to see if there's bad code in there / changing indexes would help. Analyzing your tables may help (but probably not much).
C.
After checking all iotop and other tools, also check the "dmesg" queue, you might see the root problem for this issue. In my case it was "CIFS VFS: Server file.core.windows.net has not responded in 120 seconds. Reconnecting..."
I got this very high
wa
CPU usage on a server. It turned out it has not enough available memory and thekswapd0
process was causing this highwa
CPU usage.The server didn't have any Swap memory, so I created some (1Gb) by running these commands (Ubuntu server):
The
wa
CPU usage is now very low, or at 0% most of times.