I run the perl script in screen (I can log in and check debug output). Nothing in the logic of the script should be capable of killing it quite this dead.
I'm one of only two people with access to the server, and the other guy swears that it isn't him (and we both have quite a bit of money riding on it continuing to run without a hitch). I have no reason to believe that some hacker has managed to get a shell or anything like that. I have very little reason to suspect the admins of the host operation (bandwidth/cpu-wise, this script is pretty lightweight).
Screen continues to run, but at the end of the output of the perl script I see "Killed" and it has dropped back to a prompt. How do I go about testing what is whacking the damn thing?
I've checked crontab, nothing in there that would kill random/non-random processes. Nothing in any of the log files gives any hint. It will run from 2 to 8 hours, it would seem (and on my mac at home, it will run well over 24 hours without a problem). The server is running Ubuntu version something or other, I can look that up if it matters.
Put in signal handlers for all the signals (TERM, SEGV, INT, HUP, etc) and have them log out whenever they are hit. It wont tell you what is sending the signal, but it will allow you to see what signal it is and perhaps ignore it.
That would print out when it caught a sigterm or sigint and then return control back to the program. Of course with all those signals being ignored, the only way to kill it would be to have the program itself exit, or to send it a SIGKILL which cant be caught.
I realize this isn't exactly an answer to the question you asked, so I apologize if it's somewhat off-topic, but: does your app really need to run continuously, forever? Perl is not the most resource-thrifty environment in the world, and while the overhead of interpreter start-up is not without its drawbacks, extremely long-running scripts can have troubles of their own - memory leaks, often at a level below your control, are the bane of the vanilla-perl developer's existence, which is why folks often mitigate those issues either by running in a more formally resource-conservationist sub-environment like Perl::POE, or by handing over the long-running listener part of the job to a front-end service like xinetd and only executing the perl component when work needs to be done.
I run several perl scripts which run continuously reading and processing the output of our (considerably large) central syslog stream; they suffer from terrible, inexplicable "didn't free up memory despite pruning hash keys" problems at all times, and are on the block to be front-ended by something better suited to continuous high-volume input (an event queue like Gearman, for example), so we can leave perl to the data-munging tasks it does best.
That went on a bit; I do apologize. I hope it's at least somewhat helpful!
Without much in the way of actual knowledge, I'd start looking in dmesg output or assorted syslogs if the OOM killer is running. If so, that's probably it.
Syslog is the first thing to consult. If it isn't sufficient…
You can't determine who sends a signal to a process. It could be another process, it could be the kernel, etc. Short of involving the very recent perf framework some guesswork is involved.
However, you can set up some better monitoring. The
atop
package, in debian/ubuntu, sets up a service that will log system load and per-process activity (disk, memory, cpu). You can then consult those logs and get a feel of what was happening at the time the process crashed.Crash course:
sudo atop -r
, navigate witht
andT
, typeh
to get help about the various visualisations.Also consider adding a signal handler that dumps
pstree
to a temporary file.Likely you are running into resource limits. For example CPU time. Try
ulimit -a
to check. If it's only a soft-limit, set in a login script then you can fix it with, eg,ulimit -t unlimited
. If it's a hard limit, as is set for example for regular users on OpenBSD and other OSs, then you'll have to override.Until you nail the issue, running the script with
can help. If it still crashes, examine the nohup.out file.
And if nothing mentioned here helps, I'd try to use strace/ltrace to see what system or library calls script was doing before failure, but they generate a LOT of output.
In a previous life I found a DEC Ultrix box that had a very clever cron job which looked for all processes with more than 1 CPU hour and killed them. Which was why the nightly batch report job died every night.
Any clever cron jobs/scripts that might be killing it? Or it might be another performance tuning parameter or somethng like the ulimit answer already given.