Ping a Specific Port

Question

David

Asked: 2018-06-08 00:08:18 +0800 CST2018-06-08 00:08:18 +0800 CST 2018-06-08 00:08:18 +0800 CST

Entire proxmox Xen Node has grey question marks + database container gone

772

Firstly, i've recently taken on the management of a proxmox cluster which I have had no experience managing previously (i'm completely new to cluster management, but not too bad at linux).

pve-manager/5.1-46/ae8241d4 (running kernel: 4.13.13-6-pve)

I have 2 xen nodes which run a number of containers and VMs within them. Yesterday, a container on Xen2, which runs a mysql database, stopped responding. I was able to log in to the container via ssh and attempted to restart mysql only to receive an error along the lines that it was unable to connect to the mysql.sock. So I decided to simply shutdown the container and start it back up. I chose 'shutdown' in proxmox UI for the container, which it then shutdown. Then I clicked 'start', in which proxmox logs recorded:

CT 110 - Start          ERROR: command 'systemctl start pve-container@110' failed: exit code 1

So, I've tried running the 'system start ...' via ssh. It takes a while, and then I get the following:

Job for [email protected] failed because a timeout was exceeded.
See "systemctl status [email protected]" and "journalctl -xe" for details.

Here is the output of 'systemctl status ...':

● [email protected] - PVE LXC Container: 110
   Loaded: loaded (/lib/systemd/system/[email protected]; static; vendor preset: enabled)
   Active: failed (Result: timeout) since Thu 2018-06-07 08:35:22 BST; 43s ago
     Docs: man:lxc-start
           man:lxc
           man:pct
  Process: 1603366 ExecStart=/usr/bin/lxc-start -n 110 (code=killed, signal=TERM)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/system-pve\x2dcontainer.slice/[email protected]
           └─1532500 [lxc monitor] /var/lib/lxc 110

Jun 07 08:33:52 xen2 systemd[1]: Starting PVE LXC Container: 110...
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Failed with result 'timeout'.

and 'journalctl -xe':

Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Start operation timed out. Terminating.
Jun 07 08:35:22 xen2 systemd[1]: Failed to start PVE LXC Container: 110.
-- Subject: Unit [email protected] has failed
-- Defined-By: systemd
--
-- Unit [email protected] has failed.
--
-- The result is failed.
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Unit entered failed state.
Jun 07 08:35:22 xen2 systemd[1]: [email protected]: Failed with result 'timeout'.

Shortly after attempting to restart the container the first time, the entire xen2 node started displaying grey questions marks along side all it's VM/containers and they lost their labels (see screenshot):

Despite this, all the other VMs/Containers within xen2 are still functioning fine. So, I've then decided to run the following commands to see what would happen:

service pvedaemon restart (nothing changed) service pveproxy restart (nothing changed) service pvestatd restart (The VMs started showing names within proxmox UI (but not containers), but this only lasted 10-15 minutes)

I'm hesitant to upgrade or restart the entire xen node due to the unknown side of configuration and what potential pitfalls may lie ahead and that its business critical to have at least something running. Furthermore, i've ran through /var/log/syslog and didn't see anything that indicated why the container crashed.

Ideally, I want to achieve: Determine why the database container crashed (110) Successfully start up the database container again Determine why the xen2 node isn't reporting data to the UI about it's VM/Containers Fix the reporting data in the UI for the node Again, please appreciate i'm new to proxmox, but I do know my away around linux.

Thank you for any tips/knowledge on troubleshooting this problem. If there is any other info you'd like me to share, please let me know.

Cheers, David

3 Answers

Voted

Allison · Answer 1 · 2019-07-18T00:15:19+08:00

Allison

2019-07-18T00:15:19+08:002019-07-18T00:15:19+08:00

I've also suffered from a problem with similar symptoms (all nodes, VMs, and CTs go into an "unknown" status). Using the command line everything seemed fine and so it was more of a nuisance than anything because it meant I had to migrate everything and reboot each node individually before I could use the web UI again. I eventually figured out that restarting the following services on each node as follows fixes the problem:

systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd

I recommend dropping these in a script and running it with ./script.sh & to fork it off if you plan on using the web UI since this will disconnect your console session.

3

Ebenezer Omotere · Answer 2 · 2020-06-09T06:50:04+08:00

Ebenezer Omotere

2020-06-09T06:50:04+08:002020-06-09T06:50:04+08:00

I run the below commands in ssh to solve the same problem I have on my server, although I did not use ./script.sh

systemctl restart pvedaemon
systemctl restart pveproxy
systemctl restart pvestatd

1

Argl Bargl · Answer 3 · 2019-04-13T05:58:33+08:00

Argl Bargl

2019-04-13T05:58:33+08:002019-04-13T05:58:33+08:00

Just stumbled over the same problem (one cluster node only showed grey question marks and the containers lost their labels). In my case this was shortly after a proxmox update (from 5.3 to 5.4). After doing similiar things like the OP I finally figured out that my sshd was not listing on port 22 anymore. After restarting sshd it was not ok immediately but needed about 15min or so. Then everything was fine again.

0

Entire proxmox Xen Node has grey question marks + database container gone

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?