using: varnish-3.0.4
can anyone suggest potential cause of backend connection failure, this normally happens when N-Worker_thread goes anything above default of 100 worker_thread(not necessarily all the time)?
In one of the several case, While trying to create 491 thread in peak it was unable to connect to backend. whereas, backend servers were not in load or anything. To narrow down issue its not problem with backend server as it healthy and reachable.
backend_unhealthy 0 0.00 Backend conn. not attempted
backend_busy 0 0.00 Backend conn. too many
As i understood, "backend conn. failure" is oppse to configuration 1) Thread max is 1000 * 2[pools], 2) Server load is below 1
Theoretically it should be able to handle that many spikes, And i would not see why backend would fail here.
[NOTE, Due to demand of use, it is designed to cache 1s to 5s at most]
n_worker_thread = 100 , all good
n_worker_thread = 491 , 8 backend_connection failure.
varnishadm
thread_pool_add_delay 2 [milliseconds]
thread_pool_add_threshold 2 [requests]
thread_pool_fail_delay 200 [milliseconds]
thread_pool_max 1000 [threads]
thread_pool_min 50 [threads]
thread_pool_purge_delay 1000 [milliseconds]
thread_pool_stack unlimited [bytes]
thread_pool_timeout 120 [seconds]
thread_pool_workspace 65536 [bytes]
thread_pools 2 [pools]
thread_stats_rate 10 [requests]
varnishstat
32+03:45:05
Hitrate ratio: 2 2 2
Hitrate avg: 0.9404 0.9404 0.9404
backend_conn 4516262 1.63 Backend conn. success
backend_unhealthy 0 0.00 Backend conn. not attempted
backend_busy 0 0.00 Backend conn. too many
backend_fail 9562 0.00 Backend conn. failures
backend_reuse 67350518 24.24 Backend conn. reuses
backend_toolate 361647 0.13 Backend conn. was closed
backend_recycle 67715544 24.38 Backend conn. recycles
backend_retry 5133 0.00 Backend conn. retry
n_backend 5 . N backends
backend_req 71855086 25.87 Backend requests made
LCK.backend.creat 5 0.00 Created locks
LCK.backend.destroy 0 0.00 Destroyed locks
LCK.backend.locks 149007648 53.64 Lock Operations
LCK.backend.colls 0 0.00 Collisions
Hi Shane thanks for response,
just managed to figure out that backend communication issue was not due to any config failure but due to hardware switch between backend and varnish.
This was difficult to analyse as primary would work fine as oppose to secondary switch which was causing issue while fail-over communication.
This explains loud that backend conn failure without other having other backend n_worker busy/too many/ or over queue is unlikely.
Hope this will be useful for someone in future.