Ubuntu Server 10.04.2
$ uname -a Linux my.local 2.6.32-30-generic-pae #59-Ubuntu SMP Tue Mar 1 23:01:33 UTC 2011 i686 GNU/Linux
It seems that my domain socket queue is overflowing, but I can't prove it.
I've got this stack nginx->[spawn-fcgi->multiwatch->]custom-fcgi-service
Nginx is communicating with custom-fcgi-service
by the means of unix domain socket.
Today we've got slight increase in traffic, and suddenly my nginx error.log
is full of eels:
2011/04/07 15:31:51 [error] 28187#0: *469350 connect() to unix:/tmp/my.socket failed (11: Resource temporarily unavailable) while connecting to upstream, client: [IP witheld], server: my.local, request: "GET /myurl HTTP/1.0", upstream: "fastcgi://unix:/tmp/my.socket:", host: "example.com"
Some requests make it through, but many return 5xx error.
If I restart custom-fcgi-service
, error goes away, but soon enough reappears. After inspecting custom-fcgi-service
status, I'm reasonably sure that it works OK (though may be too slow for this amount of traffic, but that is a mere hypothesis).
I've tried doing this:
echo 65535 > /proc/sys/net/unix/max_dgram_qlen
But it did not help much. (Not sure if time-to-error became longer, may be, but not enough to fix it.)
If I increase number of worker forks of custom-fcgi-service
, error does not appear for longer time, but so far I was not able to increase number of workers high enough to fix it for ever. CPU and memory and IO load on that machine are well within limits, so, again, I think that custom-fcgi-service
is just being slow on some subsequent network call.
Question is: how to debug this issue? And if it is indeed socket queue length, how to make a sensor that will warn us that we need to fork more custom-fcgi-service
workers?
It seems like you have problem with connect, not with send. Try to increase kernel receiver backlog:
or
Have you checked system logs (e.g. dmesg)?
try to change spawn's configuration file, backlog: 4096.