I'm going crazy trying to find a memory leak on one of our main boxes. It runs CentOS, kernel 2.6.18, x86-64. The box (actually VM on Xen) has been running great with no problems since it was created about 6 months ago. It was created to replace an older physical box, and was configured in the same way. The VM is a web server, and runs just Tomcat and Apache. It was solid, had no problems, and no memory leaks.
About two weeks ago, we had an issue where two of the four physical servers in our Xen setup rebooted (for some reason). We got back up after a little bit, and didn't have many problems (we had to reload one MySQL database that missed some records during replication due to the outage).
Since then, we've had memory problems on this VM. Every other VM we have hasn't had a problem, only this one. Memory usage would increase, up to 200 MB/h, until the box ran out. It would chew through swap then the OOM killer would start causing problems until we rebooted the VM.
After trying other things (rebooting the VM, rebooting the physical server, migrating the VM to a different physical server), I used RPM to verify all the files on disk to try to find corruption. I found a few files, in packages I'm not sure we even used, so I reinstalled the packages so they are clean again.
That slowed the leak, but it's still there. We're now leaking at 10-50 MB/h, but it seems to accelerate near the end.Yesterday, when the server was nearly idle, memory climbed fast, going up 2.5 gigs in under 12 hours for some reason.
Interestingly, after running rpm to verify everything, jut before the process exited it grabbed almost all free physical memory and the VM had to be rebooted after that. The only configuration change that has been made is to up the VM's memory from 2GB to 4GB, so that it takes longer to run out of memory and we have to reboot it less.
I've tried tracking memory. It seems to be anonymous pages we're losing, and since the box doesn't really use it's disk I'm not surprised that the pages we're losing are not backed by disk. Tomcat/Java has 2 gigs of virtual memory, and hangs out around 1 gig resident (it's allotted up to 1.5 gigs). Like I said earlier, this is the configuration it's had for 6+ months, and the configuration the box it had before used for years before that.
Our software hadn't been updated on the box in about a week before the incident, so that wasn't it. We're rebuilt it and updated it since then, but that hasn't solved the problem.
We've tried updating everything else on the system using yum, but that didn't make a difference. The only software installed without yum was Java (which I updated) and our software (which I updated).
I wrote a little program to track the total virtual size, resident size, and data segment size for each process on the VM by totaling numbers in the /proc filesystem. After letting it run for a day, you could see Apache's virtual size bounce up and down with load, but resident size basically never changed. Java crept up very slowly, to the tune of maybe 50mb all day, and in line with what we would expect. Yet during that time we lost 500+ MB of memory. Top doesn't show anything using more memory that Java. My program found every process on the server (with the exception of Java and Apache) didn't change by more than a few kilobytes over the day.
Basically, something is eating our memory, but I'm at a total loss to figure out what. The kernel is my best guess, but even when the memory usage is high, the kernel's memory size (a listing in /proc/vmstat that I don't remember off the top of my head) was only about 200 megabytes.
At this point, we're about ready to rebuild the VM from scratch. I figure that's the eventual conclusion.
How do you track down what is leaking memory when something like this happens? I've never seen a memory leak like this (that doesn't show up in top), but my experience is quite limited. Can anyone suggest something I can look at or a tool I can use in cases like this?
Is it possible the VM has been hacked? Are the tools you're using to monitor the list of processes and memory from a known-good, read-only medium? You could have a rootkit installed which hides the leaky processes.
We have something similar with Centos05 and the same kernel. The ram memory continuously increases, it is unexplainable. We are thinking that it might be related to some library/program we installed. Do you have Hdf5 installed and which version? At the moment, that is our prime suspect. Openmpi/mpich2 was our first suspect, however, those seem OK. Not sure though. If I remember correctly, the problem was also occurring with a previous kernel.
I had issues with xen rebooting, it was really bad on later kernels and found 1 VM which i moved from a rebooting server to another server caused it to crash too, so I realised it was one of the guests. I used Citrix xen which used the 2.6.18 kernel and had rebuilt the guest and had been rock solid but now moved from citrix to normal xen 2.6.18 kernel with a rebuilt guest again and it's still fine. I later starter having problems with yet another xen guest but it didnt take out the host but did cause the guest to crash so badly that I couldn't get into the console and just updated all the component to unstable release. freaky enough, it's actually stable now :)