Within three weeks, on two of my Ubuntu 20.04LTS servers systemd
has suddenly become unresponsive.
Symptoms:
- All
systemctl
commands for controlling services or accessing logs fail with error messages:
Failed to retrieve unit state: Connection timed out
Failed to get properties: Connection timed out
systemd
does not heed the signal fromlogrotate
for reopening its log, continuing to write to the renamed log file/var/log/syslog.1
while the newly created/var/log/syslog
remains empty.- Lots of zombie processes accumulating from cronjobs and system management tasks, ie. PID 1 systemd neglects its duty of reaping orphaned processes.
- Running services continue to run normally but starting or stopping services is no longer possible as even the legacy scripts in
/etc/init.d
redirect to the non-functionalsystemctl
. - Nothing unusual in the logs except the
Connection timed out
messages from attempted interactions withsystemd
.
The commonly proposed corrective measures:
systemctl daemon-reexec
kill -TERM 1
- removing
/run/systemd/system/session-*.scope.d
do not fix the problem. The only remedy is to reboot the entire system, which is of course both disruptive and problematic for a server on the other side of the globe.
The same problem occurred with Ubuntu 16.04LTS about once per month in a population of about 100 servers. It is much less frequent since the upgrade to 20.04LTS, but not completely gone. Of the two servers that have been hit since 20.04LTS, one had already been hit when it was still running 16.04LTS.
Questions:
- What are possible causes for that sort of
systemd
malfunction? - How can I diagnose this further?
- Is there a less disruptive way to recover from an unresponsive
systemd
than to reboot?
this is a very old question, but I hope it can save someone else time.
I had a identical problem, some zombies and systemctl respond any request with a timeout. As expected the problem was to remove the daemons. At least on our case the solution was: