For the last few years we have been running Varnish as a cache and load balancer in front of several apache servers serving several thousand websites.
We also use monit to ensure that if varnish ever dies it gets restarted. The varnish section in monitrc looks like this:
# Check varnish on port 80
check process varnish with pidfile /var/run/varnishd.pid
start program = "/etc/init.d/varnish start"
stop program = "/etc/init.d/varnish stop"
if failed host 127.0.0.1 port 80 protocol http
and request "/monit-check-url"
then restart
This has worked fine at least 3 years. We get occasional failures of the port 80 check, but monit restarts varnish accordingly and it's generally unnoticeable to users.
However, over the last few weeks we are seeing flurries of these failures, usually over a period of a couple of hours, and users are noticing connection failures. Today has been particularly bad.
There are no clues in syslog (it's a debian box btw) as suggested by the "Varnish crashing" section at: https://www.varnish-cache.org/docs/3.0/tutorial/troubleshooting.html and all we see in there is monit failing it's check on port 80 then stopping and starting varnish.
Additionally we are not seeing any spike in bandwidth or number of hits to the backend webservers that would suggest it's failing under higher than normal load.
We were running Varnish 3.0.3 which I upgraded to 3.0.7 but the problem has continued. No other changes have been made to this box that coincide with the problems starting, and the varnish configuration hasn't been changed in quite a long time.
Has anyone had any similar experiences with varnish or have any suggestions on troubleshooting this further? Could it be some sort of attack?
Any help or advice greatly appreciated!
Your approach here seems a little heavy-handed as there many reasons why a request could fail, not all of which are varnish problems (eg connectivity issues, failures on the backends etc) Restarting varnish will cause an outage whilst it starts up again, so should only be used as a last resort.
Before restarting anything, I'd recommend running
varnishadm debug.health
on the varnish box to see what state varnish considers your backend to be in. Depending on the result, you can decide where to look further: