Note that inode_cache & ext3_inode_cache slabs are very small compared to dentry_cache.
What happens is that slowly and steadily the within a week dentry_cache grows from 1M to ~5-6G
Then I need to run
echo 2 > /proc/sys/vm/drop_caches && echo 0 > /proc/sys/vm/drop_caches
This started to happening one day on all servers hosting some web code - the developers are saying that they have not changed anything related to filesystem access pattern around the time then the problem started.
The system is centos5 with 2.6.18 kernel so I don't have any instrumentation features available th newer kernels. Any I idea how I can debug the problem? maybe with systemtap? This is a ec2 instance - so not even sure that systemtap will work there.
Thanks Alex
Late, but maybe useful for others who come upon this.
If you are using the AWS SDK on that EC2 instance, it is highly likely that curl is causing the dentry bloat. While I haven't seen this trigger OOM, it is known to impact the performance of the server, due to the additional work required by the OS to reclaim SLAB.
If you can confirm that curl is being used by your developers to hit https (many of the AWS SDK do this), then the solution is to upgrade the nss-softokn library to at least v3.16.0 and set the environment variable, NSS_SDB_USE_CACHE (YES and NO are valid values, you may have to benchmark to see which performs curl requests more efficiently) for the process which is using libcurl.
I recently ran into this myself and wrote a blog entry (old blog entry link and upstream bug report) with some diagnostics & more detailed information, in case that helps.
You have a few options. If I were in this situation I would start tracking the stats in:
Over time to see how fast it is growing. If the rate is somewhat regular I think you could identify possible culprits in two ways. First looking at the output of lsof might indicate that some process is leaving around deleted file handles. Second, you could strace the main resource using applications and look for an excessive number of fs related calls (like open(), stat(), etc).
I am also curious about @David Schwartz's comment. I haven't seen issues where the dentry cache causes the oom to kill things, but maybe that happens if they are all still referenced and active? If that is the case I'm pretty confident lsof would expose the issue.
In our case we were able to identify the rough process by looking at
minflt/s