I am a programmer for a popular website hosted by two web servers with apache. We're in the ~1000 alexa rank range to give you an idea. I'm not a server administrator.
Only one of them (always the same one) has been having a problem that started for the first time last week: every 2-3 days, it crawls to a complete halt timing out all HTTP connections and SSH connections as well. If you are already SSH'd in typing commands is either extremely slow (20-30 minutes before they appear) or doesn't work at all.
The only way to temporarily fix the problem that we found is rebooting the server.
I noticed this in our graphs:
Every peak corresponds to the server halting, and every drop corresponds to the reboot we were forced to do.
How can I further debug this? Is there a way to see what process is using the most inodes? What would you do?