Forgive me for the length of this question... it is mostly details... only attempt to follow if you also enjoy reading log files... or drinking coffee.
I'll state the questions first:
1) how the heck did a nano process fire off based on what I've stated below
2) how did nano manage to take so much resource
3) working with ossec restarts surely isn't a coincidence so is that related?
This is a Red Hat 4.1.2-46 XEN environment, three cluster members. We updated our Hurricane monitoring code manually on Jan17 at 11:34am. Two files were changed (using nano) while ossec was running:
preloaded-vars.conf
ossec.conf
ossec was then restarted and then the root user logged off.
Unfortunately the three servers went offline (still had ssh) because a nano process ran away (I imagine that this would have happened had I used VI - so the editor type is not in question). Oddly, no crons started the nano service and no one was logged into the server at the time, and I'm sure that I properly closed out of nano. Before I killed the PID, top provided me with the following insight:
Mem: 28359680k total, 28325064k used, 34616k free, 3424k buffers
Swap: 4194296k total, 4194296k used, 0k free, 70208k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26351 root 18 0 29.7g 25g 784 R 100.1 95.6 4424:38 nano
Note: the nano editor took up ~28GB of ram.
It took just over three days for this to take our servers down. I found something else before I killed the process. Notice that the nano process began two hours later after the file was first edited and root logged off. Notice that the tty = ?.
# ps -ef | grep nano
root 7836 7689 0 13:19 pts/5 00:00:00 grep nano
root 26351 1 99 Jan17 ? 3-01:44:46 nano /opt/ossec/etc/ossec.conf
Thankfully after I killed the PID I had:
Mem: 28359680k total, 1189924k used, 27169756k free, 4584k buffers
Swap: 4194296k total, 260284k used, 3934012k free, 104352k cached
I first expected to find that the process status would be stopped
or traced
but it was running
(see the R before the %CPU usage stat)
Additional Notes. The preloaded-vars.conf file was created from a .tar file (therefore the 1000:1000). It was edited by root. The .sav file was created when I killed nano (and it's smaller than the main file). On two of the Xen servers nano was stuck editing the preloaded-vars.conf and on the third nano was stuck editing the ossec.conf file. No ossec.conf.save was create when nano was killed.
-rwxr-xr-x 1 1000 1000 2918 Jan 17 11:04 preloaded-vars.conf
-rw------- 1 root root 2909 Jan 20 13:13 preloaded-vars.conf.save
Further Findings: I've discovered that if I open the preloaded-vars.conf file and then from another terminal kill the pid, the default behavior of nano is to create a preloaded-vars.conf.save file when it receives a SIGHUP or SIGTERM message. Still don't understand what caused it to go off the rails to begin with.
Well, the answer to (2) is probably "You don't have any resource limits configured" - check out ulimit to solve that one.
No clue on the others though.