We need to run pt-stalk on a handful of servers to keep an eye on mySQL, and I was sick of manually starting it every time the server rebooted. A little googling turned up an init script for pt-stalk, and it seemed to work just fine. [my slightly modified version included at the bottom of this post]
It was taking too long to figure out how to push the script and config out via ssh [long story, please don't ask] so I decided to just log into the 20-odd servers and set everything up manually and everything worked.
A couple days later my coworker commented that he was getting the emails, but I clearly wasn't, and it looked like I had put the wrong email in the config. This time I had figured out how to push the change via ssh, and finished everything off with:
for server in `cat serverlist.txt`; do
ssh -t $server sudo -i service pt-stalk restart
done
And this is the point where pt-stalk stopped working on every single server with:
2013_08_23_11_43_20 Caught signal, exiting
2013_08_23_11_43_20 Exiting because OKTORUN is false
2013_08_23_11_43_20 /usr/bin/pt-stalk exit status 1
2013_08_23_11_43_22 Starting /usr/bin/pt-stalk --function=status --variable=Threads_connected --threshold=100 --match= --cycles=5 --interval=1 --iterations= --run-time=30 --sleep=300 --dest=/var/lib/pt-stalk --prefix= [email protected] --log=/var/log/pt-stalk.log --pid=/var/run/pt-stalk.pid
2013_08_23_11_43_22 Caught signal, exiting
Through yesterday's testing I've deciphered that 'Caught signal, exiting' means it's caught a HUP
/TERM
/KILL
. The first one is from service pt-stalk restart
, and the second one immediately after the successful start is from when the ssh session closes. wat.jpg
If I simply ssh to the server, enter sudo -i service pt-stalk start
or restart
I can log out and it continues happily. However, if I just feed a command to ssh like the above loop pt-stalk it catches a signal and exits. Sometimes it catches two signals before it exits.
What the hell is going on?
My /etc/init.d/pt-stalk for reference:
#!/usr/bin/env bash
# chkconfig: 2345 20 80
# description: pt-stalk
### BEGIN INIT INFO
# Provides: pt-stalk
# Required-Start: $network $named $remote_fs $syslog
# Required-Stop: $network $named $remote_fs $syslog
# Should-Start: pt-stalk
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
### END INIT INFO
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
DAEMON="/usr/bin/pt-stalk"
DAEMON_OPTS="--config /etc/pt-stalk.conf"
NAME="pt-stalk"
DESC="pt-stalk"
PIDFILE="/var/run/${NAME}.pid"
STALKHOME="/var/lib/pt-stalk"
test -x $DAEMON || exit 1
[ -r /etc/default/pt-stalk ] && . /etc/default/pt-stalk
#. /lib/lsb/init-functions
sig () {
test -s "$PIDFILE" && kill -$1 `cat $PIDFILE`
}
start() {
if [[ -z $MYSQL_OPTS ]]; then
HOME=$STALKHOME $DAEMON $DAEMON_OPTS
else
HOME=$STALKHOME $DAEMON $DAEMON_OPTS -- $MYSQL_OPTS
fi
return $?
}
stop() {
if sig TERM; then
while sig 0 ; do
echo -n "."
sleep 1
done
return 0
else
echo "$DESC is not running."
return 1
fi
}
status() {
if sig 0 ; then
echo "$DESC (`cat $PIDFILE`) is running."
return 0
else
echo "$DESC is stopped."
return 1
fi
}
log_begin_msg() {
echo $1
}
log_end_msg() {
if [ $1 -eq 0 ]; then
echo "Success"
else
echo "Failure"
fi
}
case "$1" in
start)
log_begin_msg "Starting $DESC"
start
log_end_msg $?
;;
stop)
log_begin_msg "Stopping $DESC"
stop
log_end_msg $?
;;
status)
status ;;
restart)
log_begin_msg "Restarting $DESC"
stop
sleep 1
start
log_end_msg $?
;;
*)
echo "Usage: $0 {start|stop|status|}" >&2
exit 1
;;
esac
Since your daemon is terminated at once I'm pretty sure that if the
--daemonize
option is given to/usr/bin/pt-stalk
it might not close one of the file descriptorsstdin
,stdout
orstderr
properly and early enough or/and does not handle theSIGHUP
signal correctly.To test which of my assumptions is correct, modify your
init
script so that input and output ofstart
are redirected from and to/dev/null
. Example:If this removes the early termination problem narrow it down by removing these redirections one after the other again. It might be that
pt-stalk
simply forks to early. In this case inserting anothersleep 1
after the call tostart
might also be able to work around this. If it comes out to the handling of theSIGHUP
signal then it might also be a workaround to modify yourinit
script by adding this:before the call to
start
and this:right after the call to
start
.I did not download
pt-stalk
and had no look into it and did not test my theory described above. This was all from my experiences with other daemons.