My rails installation was chugging along nicely. Last night we had to perform a hot-patch with was really a standard deploy of some exception code. Once capistrano finished the operation one of our admins discovered that there were two long running passenger processes. While we have deployed release over the past two weeks it would appear that these processes have been here and alive the whole time. Granted they could have been zombies or any other artifact and at this point we do not know what state they were in.
Which leads me to the question: There are so many moving parts between the rails application and the OS/hardware that being a SME is probably no longer possible. So; how does a sysadmin perform root-cause analysis with any certainty?
And: When do I just start rebooting servers?
Do your developers use some kind of performance monitoring tool like NewRelic RPM or Scout? Or they may be using one of performance-monitoring plugins. All these tools allow you to profile you production application in nearly real-time and see which parts of code take the most time to execute, so you can locate the problem and fix it.
If the developers don't use the monitoring tool, you should digg the logs, they contain some useful information, including the execution time of each request.
Also, it would be wise to rollback the production code to the previous version to see whether the drop in performance was caused by the most recent updates. If it is so then update production file-by-file and test the performance after every update to find the problem.