can anyone tell me where the memory is gone: (no, this time neither buffers nor cache)
# free
total used free shared buffers cached
Mem: 3928200 3868560 59640 0 2888 92924
-/+ buffers/cache: 3772748 155452
Swap: 4192956 226352 3966604
top, sorted by memory, descending:
top - 13:42:06 up 1 day, 3:47, 2 users, load average: 0.08, 0.12, 0.36
Tasks: 228 total, 1 running, 227 sleeping, 0 stopped, 0 zombie
Cpu0 : 2.0%us, 4.0%sy, 0.0%ni, 90.1%id, 0.0%wa, 0.0%hi, 4.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id,100.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3928200k total, 3868020k used, 60180k free, 2896k buffers
Swap: 4192956k total, 226048k used, 3966908k free, 82068k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3863 root 20 0 902m 199m 3296 S 7 5.2 99:08.77 ndsd
21906 root 20 0 138m 9076 2988 S 0 0.2 0:00.02 sfcbd
2332 root 20 0 126m 4660 1332 S 0 0.1 0:17.72 mono
4243 wwwrun 20 0 683m 4468 668 S 0 0.1 0:07.38 java
2994 root 20 0 202m 2288 1660 S 0 0.1 6:10.02 httpstkd
4338 root 20 0 184m 2240 1112 S 0 0.1 0:00.52 namcd
21898 root 20 0 32368 1832 1256 R 1 0.0 0:00.08 top
In fact, some time ago oom kicked in and crashed the system (kernel panic), and I'm afraid we're again not far from that point....
UPDATE
# cat /proc/meminfo
MemTotal: 3928200 kB
MemFree: 51336 kB
Buffers: 2964 kB
Cached: 72876 kB
SwapCached: 29128 kB
Active: 233440 kB
Inactive: 88040 kB
Active(anon): 188920 kB
Inactive(anon): 56752 kB
Active(file): 44520 kB
Inactive(file): 31288 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 4192956 kB
SwapFree: 3966824 kB
Dirty: 32 kB
Writeback: 0 kB
AnonPages: 225112 kB
Mapped: 11356 kB
Shmem: 32 kB
Slab: 1624080 kB
SReclaimable: 13740 kB
SUnreclaim: 1610340 kB
KernelStack: 4176 kB
PageTables: 10500 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 6157056 kB
Committed_AS: 2397684 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 441372 kB
VmallocChunk: 34359246755 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 10240 kB
DirectMap2M: 4184064 kB
slabtop
Active / Total Objects (% used) : 9041019 / 9207548 (98.2%)
Active / Total Slabs (% used) : 401132 / 401156 (100.0%)
Active / Total Caches (% used) : 91 / 159 (57.2%)
Active / Total Size (% used) : 1491537.88K / 1519791.56K (98.1%)
Minimum / Average / Maximum Object : 0.02K / 0.17K / 4096.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
4240470 4240319 99% 0.12K 141349 30 565396K pid
2245140 2219675 98% 0.25K 149676 15 598704K size-256
2238090 2210087 98% 0.12K 74603 30 298412K size-128
...
If you're oom-ing, you almost certainly have an application that has a memory leak. Often the offender is the one the kernel selects to kill (but sometimes not).
Have you tried something like memtop?
you can execute
and check which app is candidate for oom kill -usually it consumes more memory- It seems to me like an app running wild. Either allocates too many descriptors or some threads are not ending properly.
slabtop is showing at least 1.3 GB of memory used in the slab.
Without seeing the rest of slabtop, it's hard to tell what was wrong, but if it's inodes or directory entries, these articles may help:
http://rackerhacker.com/2008/12/03/reducing-inode-and-dentry-caches-to-keep-oom-killer-at-bay/
http://people.arsc.edu/~kcarlson/software/man/drop_caches.html