I have Apache acting as a load balancer in front of 3 Tomcat servers. Occasionally, Apache returns 503 responses, which I would like to remove completely. All 4 servers are not under significant load in terms of CPU, memory, or disk, so I am a little unsure what is reaching it's limits or why. 503s are returned when all workers are in error state - whatever that means. Here are the details:
Apache config:
<IfModule mpm_prefork_module>
StartServers 30
MinSpareServers 30
MaxSpareServers 60
MaxClients 200
MaxRequestsPerChild 1000
</IfModule>
...
<Proxy *>
AddDefaultCharset Off
Order deny,allow
Allow from all
</Proxy>
# Tomcat HA cluster
<Proxy balancer://mycluster>
BalancerMember ajp://10.176.201.9:8009 keepalive=On retry=1 timeout=1 ping=1
BalancerMember ajp://10.176.201.10:8009 keepalive=On retry=1 timeout=1 ping=1
BalancerMember ajp://10.176.219.168:8009 keepalive=On retry=1 timeout=1 ping=1
</Proxy>
# Passes thru track. or api.
ProxyPreserveHost On
ProxyStatus On
# Original tracker
ProxyPass /m balancer://mycluster/m
ProxyPassReverse /m balancer://mycluster/m
Tomcat config:
<Server port="8005" shutdown="SHUTDOWN">
<Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on" />
<Listener className="org.apache.catalina.core.JasperListener" />
<Listener className="org.apache.catalina.mbeans.ServerLifecycleListener" />
<Listener className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener" />
<Service name="Catalina">
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" />
<Connector port="8009" protocol="AJP/1.3" redirectPort="8443" />
<Engine name="Catalina" defaultHost="localhost">
<Host name="localhost" appBase="webapps"
unpackWARs="true" autoDeploy="true"
xmlValidation="false" xmlNamespaceAware="false">
</Engine>
</Service>
</Server>
Apache error log:
[Mon Mar 22 18:39:47 2010] [error] (70007)The timeout specified has expired: proxy: AJP: attempt to connect to 10.176.201.10:8009 (10.176.201.10) failed [Mon Mar 22 18:39:47 2010] [error] ap_proxy_connect_backend disabling worker for (10.176.201.10) [Mon Mar 22 18:39:47 2010] [error] proxy: AJP: failed to make connection to backend: 10.176.201.10 [Mon Mar 22 18:39:47 2010] [error] (70007)The timeout specified has expired: proxy: AJP: attempt to connect to 10.176.201.9:8009 (10.176.201.9) failed [Mon Mar 22 18:39:47 2010] [error] ap_proxy_connect_backend disabling worker for (10.176.201.9) [Mon Mar 22 18:39:47 2010] [error] proxy: AJP: failed to make connection to backend: 10.176.201.9 [Mon Mar 22 18:39:47 2010] [error] (70007)The timeout specified has expired: proxy: AJP: attempt to connect to 10.176.219.168:8009 (10.176.219.168) failed [Mon Mar 22 18:39:47 2010] [error] ap_proxy_connect_backend disabling worker for (10.176.219.168) [Mon Mar 22 18:39:47 2010] [error] proxy: AJP: failed to make connection to backend: 10.176.219.168 [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state [Mon Mar 22 18:39:47 2010] [error] proxy: BALANCER: (balancer://mycluster). All workers are in error state
Load balancer top
info:
top - 23:44:11 up 210 days, 4:32, 1 user, load average: 0.10, 0.11, 0.09 Tasks: 135 total, 2 running, 133 sleeping, 0 stopped, 0 zombie Cpu(s): 0.1%us, 0.2%sy, 0.0%ni, 99.2%id, 0.1%wa, 0.0%hi, 0.1%si, 0.3%st Mem: 524508k total, 517132k used, 7376k free, 9124k buffers Swap: 1048568k total, 352k used, 1048216k free, 334720k cached
Tomcat top
info:
top - 23:47:12 up 210 days, 3:07, 1 user, load average: 0.02, 0.04, 0.00 Tasks: 63 total, 1 running, 62 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 99.8%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2097372k total, 2080888k used, 16484k free, 21464k buffers Swap: 4194296k total, 380k used, 4193916k free, 1520912k cached
Catalina.out does not have any error messages in it.
According to Apache's server status, it seems to be maxing out at 143 requests/sec. I believe the servers can handle substantially more load than they are, so any hints about low default limits or other reasons why this setup would be maxing out would be greatly appreciated.
Solution for this Problem is pretty simple:
add to Proxypass:
BalancerMember ajp://10.176.201.9:8009 keepalive=On ttl=60
add to Tomcats Server.xml:
Connector port="8009" protocol="AJP/1.3" redirectPort="8443 connectionTimeout="60000"
After these changes everything should be work fine :-)
It looks like Apache is getting a connection timeout connecting to the servers in the pool, which is causing it to be unable to serve the request. Your timeout value looks VERY low, intermittent network latency, or even a page that takes a little extra time to generated, could cause the server to drop out of the pool. I would try higher timeout and retry values, and possibly a higher ping value.
You might also consider switching to the worker or event mpm, the prefork mpm generally has the worst performance.
Dedicated proxy/balancer software, such as squid, might also be a good option.
Given that the Apache log illustrates that it cannot connect to Tomcat (from your error log) it would seem that it is the Tomcat application that cannot keep up.
When I was working as a sys admin for a large-ish Tomcat web site I noticed severe performance restrictions, and they weren't down to CPU but synchronisation issues between threads or delays in querying a back-end web service.
The latter was a huge problem because the popular Java HTTP interface limits the number of simultaneous connections to another web server to 2 by default (when I discovered this my jaw dropped). See http://hc.apache.org/httpclient-3.x/threading.html
Does your web app call any other web services?
Have your tomcat instances deadlocked? I've witnessed two large corporate (different companies) tomcat projects suffer from deadlock - one was caused by an older version of a 3rd-party library being used.
Can you still connect directly to a tomcat instance locally? That is:
Then type:
(where
\n
refers to the <enter> key).If not then it would appear that your tomcat instance has died or deadlocked. If it has deadlocked then it is time to obtain a stack dump of your tomcat java instance using the
jstack
program (with the PID of the tomcat java program).PAS,
I did not see the timeout value on the Apache log you pasted. If it is 300, try changing it to 1200. We had the same problem and changing the timeout on Apache httpd.conf file from 300 to 1200 fixed it.
I faced exactly the same issue. Take a thread dump at the time issue occurs, you will know which thread is getting blocked and henceforth blocking other threads too. Meanwhile all AJP ports get used and eventually Apache dies. But this issue has nothing to do with Apache settings. Issue is at application (tomcat level).
Let's answer this question, 6 years later =D
That's the problem. The timeout and retry are way too short.
The timeout will consider servers to be dead if they don't answer within 1 second. It's way too short to process some requests (especially if you're doing load testing at 500 req/s).
Note that once a server goes down, the 2 servers left receive +50% requests and their response time will increase significantly, to the point they'll probably instantly timeout as well. Typical cascading failure.
You get 503 "Service Unavailable" because all the servers are considered dead by Apache, because they don't answer fast enough under load, because your timeout is too short.
Remove both settings. Generally speaking, NEVER configure a timeout under 5 seconds anywhere.