I am having what appears to be a DNS related issue that I would appreciate some assistance resolving.
I'm using Ansible to provision a Kubernetes cluster on my Proxmox server. The project works in two ways, by letting the user modify the site.yml
to deploy using Linux Containers (LXC) or Virtual Machines from a CentOS7 qcow2 image.
When deploying with LXC, the project experiences no issues and correctly bootstraps a Kubernetes cluster. However, when using the qcow2
image, I encounter what appears to be a DNS related issue. This occurs when the changeover happens between the playbook that provisions my virtual machines, and the one that connects to them for the first time to prepare them.
What happens, is that the Gathering Facts
stage eventually timeouts out and Ansible throws the following error:
TASK [Gathering Facts] *******************************************************************************************************************************************************************************************************************************************************
fatal: [pluto.sol.milkyway]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host pluto.sol.milkyway port 22: Operation timed out\r\n", "unreachable": true}
fatal: [ceres.sol.milkyway]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host ceres.sol.milkyway port 22: Operation timed out\r\n", "unreachable": true}
fatal: [eris.sol.milkyway]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host eris.sol.milkyway port 22: Operation timed out\r\n", "unreachable": true}
fatal: [haumea.sol.milkyway]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host haumea.sol.milkyway port 22: Operation timed out\r\n", "unreachable": true}
If, after this occurs, I try to manually SSH into the servers, I can verify that SSH is taking a very long time to connect. I would like to remind you at this point that this does NOT occur with the LXC instances that use the same exact hostnames, IP addresses, and name servers.
The issue can then be resolved by setting the UseDNS no
directive in my sshd_config
file on each of the servers. And running the playbook again after restarting the sshd.service
.
So, naturally, this looks like a DNS issue. However, since it doesn't occur with LXC I'm skeptical. So here are a few more data points about my DNS configuration.
1) The DNS server that they're all using is BIND and is installed on a server named IO.Sol.Milkyway
at 192.168.1.10
. There are no VNets or Subnets or anything in my homelab, and the Gateway is correctly set to my router, 192.168.1.1
so there are no routing issues to this server.
2) Here are the relevant parts of the DNS zones on my BIND server.
3) Here are some nslookup
s performed from the Proxmox server and appended with the time
command to demonstrate that my BIND server responds correctly in <= .01 seconds.
$> time nslookup pluto.sol.milkyway
Server: 192.168.1.100
Address: 192.168.1.100#53
Name: pluto.sol.milkyway
Address: 192.168.1.170
nslookup pluto.sol.milkyway 0.00s user 0.02s system 39% cpu 0.042 total
-and-
$> time nslookup 192.168.1.170
Server: 192.168.1.100
Address: 192.168.1.100#53
170.1.168.192.in-addr.arpa name = pluto.sol.milkyway.
nslookup 192.168.1.170 0.01s user 0.01s system 96% cpu 0.013 total
4) And, lastly, you can see that my nameservers are correctly configured on the VMs via cloud-init
lines 104, 115, 126, & 137 here. Which reference the variables defined here.
-----EDITS BELOW-----
5) I'm able to successfully perform a forward and reverse nslookup from the following. Each response takes < 1.5 seconds:
- My personal workstation (Executes Ansible)
- My Proxmox Server (Runs Ansible Commands & VMs)
- The 4 Virtual Machines
Here is an example from what would be the Kubernetes Master server.
I have found the problem. It appears that my resultant VMs contained an additional nameserver that was introduced by qemu automatically. This occurs when a VM is created and a network device is not specified for it. From the Proxmox documentation for
qm
:My procedure was as follows:
1) Create VM using Proxmox API through the Proxmox_KVM Ansible Module.
2) Clone four Kubernetes VMs from this VM.
3) Configure each of the Kubernetes VMs in turn.
During Step 1) I did, in fact, declare a bridge. However, In Step 2) I did not, as it is a simple
qm clone
. Which, according to the documentation, does not support anet[n]
flag to be passed. It was at this point the random nameserver was introduced. Then, when Step 3) came around, and I set a nameserver throughcloud-init
, it appended it to my/etc/resolv.conf
file as the second nameserver.I'm currently reworking my Playbook to try and get around this by running the following task between Step 1) and Step 2):
Crossing my fingers that this will resolve the issue.
-----EDIT-----
It did not. And it does not appear that it is possible to provision a network adapter when doing a
qm clone
. Meaning I will have to rework my playbook to provision four individual instances rather than cloning from a template.-----EDIT 2-----
It also does not appear that the crappy Proxmox_kvm Ansible module supports cloudinit related API stuff. Meaning I'm going to have to do everything through shell commands and leverage
qm
instead. :(-----EDIT 3-----
Looks like that nameserver is actually IN THE BASE IMAGE BY DEFAULT. WTF CENTOS?