I've got an Ubuntu 9.10 64-bit server that seems to use up all available memory. According to my munin graphs, almost all of the memory used up is in the swap cache, cache, and slab cache. (I take this to mean virtual memory caches, am I right in assuming this?)
Once memory usage approaches 100%, some (although not all) system services such as SSH become sluggish and unresponsive. After rebooting the system, performance and memory usage become normal for a time.
Some interesting tidbits:
- The system runs Apache 2, MySQL, Munin, and sshd.
- The memory usage spikes happen at the same time every night (at 10 PM sharp.)
- There appears to be nothing in the crontab for any of the users, and nothing in /etc/cron.d/* out of the ordinary, let alone something that would occur at 10 PM.
My question is, how do I figure out what is causing the memory suckage? I've tried the usual utilities (e.g. ps, top, etc) but I can't seem to find anything unusual.
Any ideas? Thanks in advance!
Are you sure it's memory related? Caches shouldn't be sucking all the memory; they're temporary and dynamic, and reallocate as active memory is needed. Caches are just there to speed things up and make use of memory that otherwise would be going to waste.
I'd probably see if you have something else bogging down the system. When using Top, are you seeing a high system load? What is it at those times compared to "normal" times? Do you sort top into CPU usage and active memory usage?
Did you try running iotop to see disk i/o and see if something is hammering the drive?
What do all the crontabs look like?
Have you taken a snapshot of ps during the day a few times and compared it at around 10:00 or so to see what processes have appeared?
On a long shot, how about network connections with netstat? anything unusual going in or out of the system at that time?
Sounds like it may be rebuilding or indexing a system database like "locate", but it shouldn't slow the system to a halt doing that.
The use of memory and the sluggishness are symptoms of the same problem. Something happens at that time that causes the system's disk cache to thrash. The system uses every drop of memory to avoid excess disk I/O, but still fails.
This is most common when lot of disk reads happen to areas of the disk haven't been read recently. The recently-used data that may be used again is pushed out of the cache and when it needs to be read back in, it has to compete with the existing flow of reads.
Check for some process that does lots of disk I/O. For example,
updatedb
or some kind of backup script. If it turns out to be a homemade tool, it may help to change it to bypass the disk cache so it doesn't force other stuff out of cache. It may help toionice
it so it doesn't hurt time-sensitive reads as much.