Issue Summary:
Jenkins LTS + The Durable Task plugin does not properly resume a pipeline job if the Jenkins service is restarted during the task run.
This is a regression in Jenkins 2.3x and seems to coincide with the migration to systemd (it used to work perfectly fine in 2.2x).
Steps to reproduce the issue:
- Start with a single node Jenkins host with the durable task plugin installed.
- Start a pipeline job on the host. I've included a sample pipeline file at the bottom of this question.
- While running, restart the jenkins service "service jenkins restart" ( OR using jenkins-cli.jar to restart )
- After Jenkins starts, the task attempts to resume, but instead eventually fails (log below).
Resuming build at Tue Jul 19 23:26:56 UTC 2022 after Jenkins restart
Waiting to resume part of test-job #5: Waiting for next available executor
Ready to run at Tue Jul 19 23:27:01 UTC 2022
wrapper script does not seem to be touching the log file in /data/jenkins_home/workspace/test-job@tmp/durable-b0167617
(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
After the above message throws, the job goes into a "failed" state.
- Manually touching/writing to the mentioned log file does not resolve the problem.
- The issue is not the filesystem nor available memory as other solutions have mentioned in related tickets/posts. (This is a regression in the latest versions of Jenkins.)
- There are no available plugin updates (fully up to date).
- This seemed to happen when we got on the 2.332 version which also included the migration to systemd. So, there is a possibility that the service restart using systemd (versus the old init system used previous to 2.332) is breaking the durable tasks.
This issue has been filed on the Jenkins official tracker: https://issues.jenkins.io/browse/JENKINS-69061
However, nobody has responded to that report in over 2 months so I'm asking if anyone here has any idea what the issue could be, to find potential workarounds, and to overall increase visibility/traction on the problem.
Example minimal/simple pipeline used in testing this issue:
pipeline {
agent any
stages {
stage("Sleep for 60 seconds") {
steps {
echo "Go restart jenkins service now and see that this job wont resume"
sh "sleep 60"
echo "The job will never get this far"
}
}
}
}
0 Answers