Problem
Regular timeouts from the memcached backends in Nginx (see example errors below)
Things already tested:
- it occurs both on local and remote memcached servers so the network can't be the problem
- it occurs both with the memc and memcached backends
- it occurs both with and without the upstream keepalive module
- it occurs in nginx 0.8, 1.0.x and 1.1.x
- buffering logs appear to slightly decrease the amount of errors, but they are still there and this could simply be coincidence
- logging through udp (i.e. non-blocking) to another makes no difference
Other info:
- the memcached server handles ~500 requests per second from both this and other (remote) clients
- there are no errors/reports in the memcached logs
- memcached is started with the following settings:
- both tcp and udp are enabled (only tcp is used currently)
- 16GB of memory
- 16 threads (on a 32 core machine)
- limited to 8192 connections
- started in non-daemonic mode through supervisord
- The servers run FreeBSD 8.2
- The timeouts occur both during low and high loads (right now only about 50 request per second)
The question
So my question is... how can I debug these problems? Anyone has an idea where the underlying cause could be? The errors don't occur continuously but frequently enough to show up in the logs every couple of minutes.
Example errors from the logs:
2011/08/30 17:23:34 [error] 13921#0: *38602 upstream timed out (60: Operation timed out) while reading response header from upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:23:34 [error] 13921#0: *38591 upstream timed out (60: Operation timed out) while reading response header from u`enter code here`pstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:23:34 [error] 13921#0: *38601 upstream timed out (60: Operation timed out) while reading response header from upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:23:34 [error] 13921#0: *38918 upstream timed out (60: Operation timed out) while reading response header from upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:23:35 [error] 13921#0: *38595 upstream timed out (60: Operation timed out) while connecting to upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:24:32 [error] 13921#0: *41768 upstream timed out (60: Operation timed out) while connecting to upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:24:36 [error] 13921#0: *38599 upstream timed out (60: Operation timed out) while connecting to upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:25:12 [error] 13921#0: *42489 upstream timed out (60: Operation timed out) while connecting to upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:25:30 [error] 13922#0: *39444 upstream timed out (60: Operation timed out) while reading response header from upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
2011/08/30 17:25:30 [error] 13922#0: *39452 upstream timed out (60: Operation timed out) while reading response header from upstream, client: x.x.x.x, server: some.host.com, request: "GET /some_random_file HTTP/1.1", subrequest: "/memc_get", upstream: "memcached://127.0.0.1:11211", host: "some.host.com"
Output from netstat -m
# netstat -m
3404/25531/28935 mbufs in use (current/cache/total)
3318/25004/28322/229376 mbuf clusters in use (current/cache/total/max)
1161/20599 mbuf+clusters out of packet secondary zone in use (current/cache)
26/4420/4446/192000 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
7591K/74070K/81661K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
4925075 requests for I/O initiated by sendfile
0 calls to protocol drain routines
From the error log output it looks as if NGiNX is waiting for network communication both in attempts to setup new connections and get data back from old connections. You mentioned it happens both locally and remote so you say you ruled out networking. However even the local one is still using a TCP socket to connect right? Have you already checked that you aren't running out of mbufs on the BSD box:
If you see mbufs are the issue you need to update your loader.conf and reboot(sadly not a runtime tunable), from /boot/loader.conf:
If you don't see anything there that indicates total==max then you should probably start by confirming that nginx isn't to blame. I would probably do this by tcpduming the connection from nginx to memcached and verifying that nginx has indeed sent connection setups or requests when it claims its timing out waiting. If that works out then I'd probably start running ktrace on the memcached process for a few minutes while an error occurs and look through the kdump to see if there are system calls failing (like send or accepts).