Ping a Specific Port

Question

TJ Zimmerman

Asked: 2018-10-27 17:31:33 +0800 CST2018-10-27 17:31:33 +0800 CST 2018-10-27 17:31:33 +0800 CST

Ansible fails at "Gathering hosts" presumably because SSH is slow to connect. Setting `UseDNS no` resolves the issue

772

I am having what appears to be a DNS related issue that I would appreciate some assistance resolving.

I'm using Ansible to provision a Kubernetes cluster on my Proxmox server. The project works in two ways, by letting the user modify the site.yml to deploy using Linux Containers (LXC) or Virtual Machines from a CentOS7 qcow2 image.

When deploying with LXC, the project experiences no issues and correctly bootstraps a Kubernetes cluster. However, when using the qcow2 image, I encounter what appears to be a DNS related issue. This occurs when the changeover happens between the playbook that provisions my virtual machines, and the one that connects to them for the first time to prepare them.

What happens, is that the Gathering Facts stage eventually timeouts out and Ansible throws the following error:

TASK [Gathering Facts] *******************************************************************************************************************************************************************************************************************************************************
fatal: [pluto.sol.milkyway]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host pluto.sol.milkyway port 22: Operation timed out\r\n", "unreachable": true}
fatal: [ceres.sol.milkyway]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host ceres.sol.milkyway port 22: Operation timed out\r\n", "unreachable": true}
fatal: [eris.sol.milkyway]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host eris.sol.milkyway port 22: Operation timed out\r\n", "unreachable": true}
fatal: [haumea.sol.milkyway]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host haumea.sol.milkyway port 22: Operation timed out\r\n", "unreachable": true}

If, after this occurs, I try to manually SSH into the servers, I can verify that SSH is taking a very long time to connect. I would like to remind you at this point that this does NOT occur with the LXC instances that use the same exact hostnames, IP addresses, and name servers.

The issue can then be resolved by setting the UseDNS no directive in my sshd_config file on each of the servers. And running the playbook again after restarting the sshd.service.

So, naturally, this looks like a DNS issue. However, since it doesn't occur with LXC I'm skeptical. So here are a few more data points about my DNS configuration.

1) The DNS server that they're all using is BIND and is installed on a server named IO.Sol.Milkyway at 192.168.1.10. There are no VNets or Subnets or anything in my homelab, and the Gateway is correctly set to my router, 192.168.1.1 so there are no routing issues to this server.

2) Here are the relevant parts of the DNS zones on my BIND server.

3) Here are some nslookups performed from the Proxmox server and appended with the time command to demonstrate that my BIND server responds correctly in <= .01 seconds.

$> time nslookup pluto.sol.milkyway
Server:     192.168.1.100
Address:    192.168.1.100#53

Name:   pluto.sol.milkyway
Address: 192.168.1.170

nslookup pluto.sol.milkyway  0.00s user 0.02s system 39% cpu 0.042 total

-and-

$> time nslookup 192.168.1.170
Server:     192.168.1.100
Address:    192.168.1.100#53

170.1.168.192.in-addr.arpa  name = pluto.sol.milkyway.

nslookup 192.168.1.170  0.01s user 0.01s system 96% cpu 0.013 total

4) And, lastly, you can see that my nameservers are correctly configured on the VMs via cloud-init lines 104, 115, 126, & 137 here. Which reference the variables defined here.

-----EDITS BELOW-----

5) I'm able to successfully perform a forward and reverse nslookup from the following. Each response takes < 1.5 seconds:

My personal workstation (Executes Ansible)
My Proxmox Server (Runs Ansible Commands & VMs)
The 4 Virtual Machines

Here is an example from what would be the Kubernetes Master server.

1 Answers

Voted

TJ Zimmerman · Answer 1 · 2018-10-27T18:17:31+08:00

I have found the problem. It appears that my resultant VMs contained an additional nameserver that was introduced by qemu automatically. This occurs when a VM is created and a network device is not specified for it. From the Proxmox documentation for qm:

net[n]: [model=] [,bridge=] [,firewall=<1|0>] [,link_down=<1|0>] [,macaddr=] [,queues=] [,rate=] [,tag=] [,trunks=] [,=]
Specify network devices.

bridge=
Bridge to attach the network device to. The Proxmox VE standard bridge is called vmbr0.

If you do not specify a bridge, we create a kvm user (NATed) network device, which provides DHCP and DNS services. The following addresses are used:

10.0.2.2 Gateway
10.0.2.3 DNS Server
10.0.2.4 SMB Server
The DHCP server assign addresses to the guest starting from 10.0.2.15.

My procedure was as follows:

1) Create VM using Proxmox API through the Proxmox_KVM Ansible Module.
2) Clone four Kubernetes VMs from this VM.
3) Configure each of the Kubernetes VMs in turn.

During Step 1) I did, in fact, declare a bridge. However, In Step 2) I did not, as it is a simple qm clone. Which, according to the documentation, does not support a net[n] flag to be passed. It was at this point the random nameserver was introduced. Then, when Step 3) came around, and I set a nameserver through cloud-init, it appended it to my /etc/resolv.conf file as the second nameserver.

I'm currently reworking my Playbook to try and get around this by running the following task between Step 1) and Step 2):

- name: Setting the name server for the template to ensure that QEMU doesn't automatically configure the clones to use 10.0.2.3. 
  shell: >
      qm set {{ proxmox_template_id }}
      --ipconfig0 gw={{ k8s_master_gw }},ip={{ k8s_master_ip }}{{ k8s_master_sn }} 
      --nameserver {{ k8s_master_ns }} 
      --searchdomain {{ k8s_master_sd }}

Crossing my fingers that this will resolve the issue.

-----EDIT-----

It did not. And it does not appear that it is possible to provision a network adapter when doing a qm clone. Meaning I will have to rework my playbook to provision four individual instances rather than cloning from a template.

-----EDIT 2-----

It also does not appear that the crappy Proxmox_kvm Ansible module supports cloudinit related API stuff. Meaning I'm going to have to do everything through shell commands and leverage qm instead. :(

-----EDIT 3-----

Looks like that nameserver is actually IN THE BASE IMAGE BY DEFAULT. WTF CENTOS?

root@hypervisor-1:/rpool/data# modprobe nbd max_part=8

root@hypervisor-1:/rpool/data# qemu-nbd --connect=/dev/nbd0 /tmp/CentOS7.qcow2c 

root@hypervisor-1:/rpool/data# fdisk -l /dev/nbd0
Disk /dev/nbd0: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x000b2638
Device      Boot Start      End  Sectors Size Id Type
/dev/nbd0p1 *     2048 16777215 16775168   8G 83 Linux

root@hypervisor-1:/rpool/data# mount /dev/nbd0p1 /mnt/tmp

root@hypervisor-1:/rpool/data# cd /mnt/tmp

root@hypervisor-1:/mnt/tmp# ls
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

root@hypervisor-1:/mnt/tmp# cat etc/resolv.conf 
# Generated by NetworkManager
nameserver 10.0.2.3

Ansible fails at "Gathering hosts" presumably because SSH is slow to connect. Setting `UseDNS no` resolves the issue

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?