Hello superior server gurus!
I'm running a Ubuntu server that hosts an apache tomcat service along with an MySQL database. The server load is always close to zero, even during the busiest hours of the week. Despite that I am experiencing random hangups 1-2 times per week, where the entire server stops responding.
An interesting effect of this lockdown is that all cronjobs seems to be executed later than scheduled, at least that is what the timestamps in various system logs indicate. Thus it appears to me that it is indeed the entire server that freezes, not only the custom software running as a part of the tomcat service. The hangup normally lasts for about 3-5 minutes, and afterwards everything jumps back to normal.
Hardware:
Model: Dell PowerEdge R720, 16 cores, 16 GB ram
HDD-configuration: Raid-1 (mirror)
Main services:
apache tomcat, mysql, ssh/sftp
#uname -a
Linux es2 2.6.24-24-server #1 SMP Tue Jul 7 19:39:36 UTC 2009 x86_64 GNU/Linux
Running sysstat I can see huge peaks in both average load and disk block waits that corresponds in time exactly to when customers has reported problems with the backend system. Following is a plot of the disk usage from sar with a very obvious peak around 12.30pm.
My sincere apologies for putting this on a external server, but my rep is to low to include files here directly. Also had to put them together since i can only post one link :S
Sar plots: http://213.115.101.5/abba/tmpdata/sardata_es.jpg
Graph 1: Block wait, notice how the util% goes upp to 100% at approx 12.58
Graph 2: Block transfer, nothing unusual here.
Graph 3: Average load, peaks together with graph 1
Graph 4: CPU usage, still close to 0%.
Graph 5: Memory, nothing unusual here
Does anyone have any clue on what could cause this effect on a system? As I earlier explained the only software running on the server is a tomcat server with an SOAP interface, to allow users to connect to the database. Remote applications does also connect to the server via SSH to pull and upload files to it. At busy times im guessing that we have about 50 concurrent SSH/SFTP connections and not more than a 1-200 connections over http (soap/tomcat).
Googling around I found discussions about file handles and inode handles, but I think these are normal for 2.6.x kernals. Anyone that dissagrees?
cat /proc/sys/fs/file-nr
1152 0 1588671
cat /proc/sys/fs/inode-state
11392 236 0 0 0 0 0
At the same time "sar -v" shows these values for the time of the hangup above, but the inode-nr here is ALWAYS very high compared to above.
12:40:01 dentunusd file-nr inode-nr pty-nr
12:40:01 40542 1024 15316 0
12:45:01 40568 1152 15349 0
12:50:01 40587 768 15365 0
12:55:01 40631 1024 15422 0
13:01:02 40648 896 15482 0
13:05:01 40595 768 15430 0
13:10:01 40637 1024 15465 0
I have seen this on two independent servers running the same setup of hardware, OS, software, raid-configuration etc. Thus I want to beleive that its more software/configuration dependent then hardware dependent.
Big thanks for your time
/Ebbe
The problems were related to a incompability issue between Ubuntu 8.04 LTS (Hardy) and the Dell PERC 6/i RAID controller, as reported in this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/607167 Upgrading to Ubuntu 10.04 LTS Lucid (kernel 2.6.32) resolves the issue.
In case anyone else runs into the same issues.
May be you are running some heavy query which is doing a full table scan. Have you checked your slow query log.
If thats case just add proper indexes.
PS: Sorry If you have done this already.