Firstly, i've recently taken on the management of a proxmox cluster which I have had no experience managing previously (i'm completely new to cluster management, but not too bad at linux).
pve-manager/5.1-46/ae8241d4 (running kernel: 4.13.13-6-pve)
I have 2 xen nodes which run a number of containers and VMs within them. Yesterday, a container on Xen2, which runs a mysql database, stopped responding. I was able to log in to the container via ssh and attempted to restart mysql only to receive an error along the lines that it was unable to connect to the mysql.sock. So I decided to simply shutdown the container and start it back up. I chose 'shutdown' in proxmox UI for the container, which it then shutdown. Then I clicked 'start', in which proxmox logs recorded:
CT 110 - Start ERROR: command 'systemctl start pve-container@110' failed: exit code 1
So, I've tried running the 'system start ...' via ssh. It takes a while, and then I get the following:
Job for [email protected] failed because a timeout was exceeded.
See "systemctl status [email protected]" and "journalctl -xe" for details.
Here is the output of 'systemctl status ...':
● [email protected] - PVE LXC Container: 110
Loaded: loaded (/lib/systemd/system/[email protected]; static; vendor preset: enabled)
Active: failed (Result: timeout) since Thu 2018-06-07 08:35:22 BST; 43s ago
Docs: man:lxc-start
man:lxc
man:pct
Process: 1603366 ExecStart=/usr/bin/lxc-start -n 110 (code=killed, signal=TERM)
Tasks: 1 (limit: 4915)
CGroup: /system.slice/system-pve\x2dcontainer.slice/[email protected]
└─1532500 [lxc monitor] /var/lib/lxc 110
Jun 07 08:33:52 xen2 systemd[1]: Starting PVE LXC Container: 110...
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Failed with result 'timeout'.
and 'journalctl -xe':
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
-- Subject: Unit [email protected] has failed
-- Defined-By: systemd
--
-- Unit [email protected] has failed.
--
-- The result is failed.
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Failed with result 'timeout'.
Shortly after attempting to restart the container the first time, the entire xen2 node started displaying grey questions marks along side all it's VM/containers and they lost their labels (see screenshot):
Despite this, all the other VMs/Containers within xen2 are still functioning fine. So, I've then decided to run the following commands to see what would happen:
service pvedaemon restart (nothing changed) service pveproxy restart (nothing changed) service pvestatd restart (The VMs started showing names within proxmox UI (but not containers), but this only lasted 10-15 minutes)
I'm hesitant to upgrade or restart the entire xen node due to the unknown side of configuration and what potential pitfalls may lie ahead and that its business critical to have at least something running. Furthermore, i've ran through /var/log/syslog and didn't see anything that indicated why the container crashed.
Ideally, I want to achieve: Determine why the database container crashed (110) Successfully start up the database container again Determine why the xen2 node isn't reporting data to the UI about it's VM/Containers Fix the reporting data in the UI for the node Again, please appreciate i'm new to proxmox, but I do know my away around linux.
Thank you for any tips/knowledge on troubleshooting this problem. If there is any other info you'd like me to share, please let me know.
Cheers, David
I've also suffered from a problem with similar symptoms (all nodes, VMs, and CTs go into an "unknown" status). Using the command line everything seemed fine and so it was more of a nuisance than anything because it meant I had to migrate everything and reboot each node individually before I could use the web UI again. I eventually figured out that restarting the following services on each node as follows fixes the problem:
I recommend dropping these in a script and running it with
./script.sh &
to fork it off if you plan on using the web UI since this will disconnect your console session.I run the below commands in ssh to solve the same problem I have on my server, although I did not use ./script.sh
Just stumbled over the same problem (one cluster node only showed grey question marks and the containers lost their labels). In my case this was shortly after a proxmox update (from 5.3 to 5.4). After doing similiar things like the OP I finally figured out that my sshd was not listing on port 22 anymore. After restarting sshd it was not ok immediately but needed about 15min or so. Then everything was fine again.