I am managing a couple of web proxies running Squid 4.10 on Ubuntu 20.04LTS in several locations distributed worldwide. One of them has developed a nasty habit of occasionally failing to access a web page. The user receives instead an error page saying:
Hmmm... can't reach this page
It looks like the webpage at <URL> might be having issues,
or it may have moved permanently to a new web address.
ERR_TUNNEL_CONNECTION_FAILED
After adding %err_code/%err_detail
to the end of the relevant logformat
as recommended on this mailing list post, Squid access.log entries for the failing accesses look like this:
1635169354.239 171 10.72.1.103 NONE/503 0 CONNECT ad.360yield.com:443 - HIER_
NONE/- - ERR_DNS_FAIL/-
Squid status is NONE/503
, and the error code and detail always ERR_DNS_FAIL/-
.
The timestamp, client IP address and requested URL vary of course.
Each occurrence of the problem affects a single FQDN or very small number of FQDNs, often all from the same organisation (eg. lm.licenses.adobe.com and cc-api-data.adobe.io, both from Adobe.) All other accesses continue to work normally. An occurrence lasts typically between five and ten minutes. During that time all clients trying to access that FQDN are affected. Before and after that, the same FQDN works without a problem. There is no discernible regularity in the affected FQDNs.
Some of the occurrences are accompanied by a message like:
2021/10/25 15:42:34 kid1| ipcacheParse No Address records in response to 'ad.360yield.com'
in /var/log/squid/cache.log
but in the majority of cases nothing is logged there.
How can I find out what goes wrong there?
Increasing the loglevel for DNS lookups to 6 by putting the directive
into
/etc/squid/squid.conf
makes Squid log to/var/log/squid/cache.log
which nameserver was used for the failed queries, for example:The failures can then be further investigated on that nameserver.
In my case, this pointed to a
dnsmasq
DNS proxy server running on the same machine. Enabling query logging ondnsmasq
revealed that one of the four configured external nameservers was responsible for the failures. Queries that got sent to that nameserver failed, while queries sent to one the other three succeeded. So the solution was to drop the faulty external nameserver from the configuration.