I have a process - a perl script - that does:
while true
check a POP account on a server on the lan
process any email found
write logs - messages found, actions taken, errors
sleep for 15 seconds
It's running on a redhat 7.3 server (I inherited it, I'm not happy about the age of that box). It's run out of /etc/inittab like:
spop:2345:respawn:/usr/local/gw/bin/popdmn
If it dies, init restarts it.
In the last couple of days, the process will no longer work unless it's straced. When it's just running, it never logs into the pop server. As soon as it's straced (via "strace -Ff -p cat /usr/local/gw/var/popdmn.pid
"), it works flawlessly.
As a workaround, I'm running screen on the server with an strace running. Obviously this is less than ideal.
Why would a process do this? I haven't seen this happen before.
I think I've been bitten by an ancient strace bug:
https://bugzilla.redhat.com/show_bug.cgi?id=64303
https://bugzilla.redhat.com/show_bug.cgi?id=75709
This box has strace-4.4-4 on it, so it sounds possible that it's that bug. It sounds like this one is self-inflicted, as we were stracing while trying to debug - and made it worse.
kill -CONT
works to resume the process.Definitely time to upgrade this box.
Biggest point of difference is speed and signal handling I suppose.
Regarding speed, if the process is multithreaded, then strace will be altering the timing, which my change behaviour regarding race conditions etc. or timing information relating to protocol behaviour.
Example. Let's say the POP server has been upgraded and is now more careful in ensuring that a peer hasn't sent multiple POP commands at a time. This is more useful in a SMTP server as a means of spam prevention.
Does your process observe correct POP behaviour, in that it waits for a response from the server after each and every POP command? Or does it assume success or wait some period of time between commands.
If you capture the actual protocol traffic in a passing and failing case, is there any sign of a protocol violation?