I have a problem where dns entry for a external domain broke. The nature of the problem at the time is unknown.
That domain got queried from kubernetes cluster pod in the Google Kubernetes Engine while the entry was broken. The problem persists (incident happened over 2 months ago) when querying that domain from the cluster.
The cluster dns resolver uses metadata.google.internal for dns resolving and from the cluster these queries with dig will:
dig problematic.external.domain @169.254.169.254
# does not resolve and takes over 2 seconds
dig problematic.external.domain @1.1.1.1
# resolves correctly under 200ms
Creating a new vm in the same project and zone resolves the problematic domain correctly. This is affects only the active cluster metadata server dns resolver.
Is there a way to flush dns caches or any other suggestions?
In general I'm trying to avoid editing in-cluster dns settings and would prefer some other means to fix it.
Edit more info:
NodeLocal DNSCache
is already active in the cluster and referencing that documentation https://cloud.google.com/kubernetes-engine/docs/how-to/nodelocal-dns-cache the problem is the metadata dns server.
This excerpt from the benefits list:
DNS queries for external URLs (URLs that don't refer to cluster resources) are forwarded directly to the local Cloud DNS metadata server, bypassing kube-dns.
Which is the ip 169.254.169.254
Although there is no specific way to flush Cloud DNS's metadata server, still each query has TTL, and mostly GCE DNS respects that, it expires after a certain time and cache becomes invalidated.
Nevertheless, if the problem is with cache, it should be fixed by cordoning the GKE node using
kubectl cordon $NODENAME
command.Furthermore, you can bypass GCE DNS by specifying a stub DNS configuration. Check out this link for details.
NodeLocal DNS cache addon can help resolving the mentioned domains in your case as it forwards DNS queries for external URLs directly to the local Cloud DNS metadata server, bypassing kube-dns, and since your Compute Engine VM can resolve the mentioned DNS (using local cloud DNS) so your cluster would also be able to do so.
Refer to this documentation for detailed instruction on how to configure NodeLocal DNSCache on a GKE cluster.