I have a long-running CLI cron job written in PHP. After an inconsistent period of time, the script stops writing output to its log file, indicating that it has stopped executing.
However, the script is still running as confirmed with ps aux
.
The script isn't throwing any exceptions or outputting any errors. When it stalls, it doesn't seem to consume any more CPU.
Memory isn't an issue - the script uses ini_set("memory_limit", "1G");
and seems to freeze when the script is only using about 73 MB. The script is running on a capable dedicated machine (8 GB RAM, 8 CPUs, PHP 5.6.25) with plenty of free memory, disk space, inodes, /tmp space, available CPU, etc.
Using xdebug's function trace logging, I can see that the script simply stops at an unremarkable function call. A "normal" execution of xdebug writes something like TRACE END [2019-03-26 13:02:55]
to the end of the logfile and then exits. However, when the script freezes, this line does not appear and the script does not exit. For example:
"Stalled" xdebug function log, Example A:
1481.7659 73187512 -> in_array() /home/xyz/inc/util/email.inc:156
1481.7659 73187056 -> strpos() /home/xyz/inc/util/email.inc:170
1481.7659 73187208 -> mail() /home/xyz/inc/util/email.inc:181
The "stall" does not always happen at the same function.
"Stalled" xdebug function log, Example B:
1422.5710 73306784 -> strrpos() /home/xyz/inc/util/external-common.inc:106
1422.5710 73306840 -> substr() /home/xyz/inc/util/external-common.inc:108
1422.5710 73307008 -> shell_exec() /home/xyz/inc/util/external-common.inc:112
"Normal" xdebug function log:
55190.4859 35183552 -> WP_Object_Cache->__destruct() /home/xyz/public_html/wp/wp-includes/cache.php:0
55190.8278 34830224 -> wpdb->__destruct() /home/xyz/public_html/wp/wp-includes/wp-db.php:0
55193.1823 8672
TRACE END [2019-03-26 13:02:55]
This might be mere coincidence, but the only common thread that I can see is that it might have something to do with C's popen()
- it (or a relative) seems to be used by both PHP's mail()
and shell_exec()
. I don't think I have too many open processes (verified recently; don't have the command I used on-hand). The script only executes at most 5 shell_exec()
calls per second and each call takes less than a second to execute. (Plus, if this were a problem, wouldn't I receive some sort of error instead of this "stalling" behavior?)
What could be going on?
What should my next steps be?
Trace it further, to whatever user or kernel level function is taking a long time.
Use
ltrace
to watch system calls and C library calls. Get an idea of what it is doing.To track down the exec a program theory, try watching execs with
execsnoop
.In that same script collection, check the popular meme that its always DNS with
gethostlatency
.Both of these are eBPF, and so will not be available with the CentOS kernel. However, you may find or write variants of these implemented in available tools like ftrace.