Setup:
Fedora 8
Apache 2.2.8
Tomcat 5.5.8
Apache is forwarding requests using AJP.
Problem:
After a certain period of time (no constant at all, can be between an hour or two, or one or more days) Tomcat will go down. Either it stops responding, or it puts up the generic 'Service Temporarily Unavailable'.
Diagnosis:
There are two servers with the same setup. One houses a higher traffic website (several requests per second), the other a low traffic one (a handful of requests every few minutes). Both websites are completely different codebases, but they exhibit similar issues.
On the first server, when the problem occurs all threads slowly start getting taken up until it reaches the limit (MaxThreads 200). At that point the server is no longer responding (and comes up with the service unavailable page after a long period of time).
On the second server, when the problem occurs requests take a long time and when they are done all you see is the service unavailable page.
Other than the mention of the MaxThreads problem, the Tomcat logs do not indicate any specific issues that could be causing this.
However, in the Apache logs we're seeing random messages referring to AJP. Here's a sample of random message that we see (in no specific order):
[error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
[error] (104)Connection reset by peer: ajp_ilink_receive() can't receive header
[error] proxy: AJP: disabled connection for (localhost)
[error] ajp_read_header: ajp_ilink_receive failed
[error] (120006)APR does not understand this error code: proxy: read response failed from 127.0.0.1:8009 (localhost)
[error] ap_proxy_connect_backend disabling worker for (localhost)
The other odd thing that we've noticed on the higher traffic server is that right before the problem starts happening, database queries are taking much longer than before (2000-5000 ms versus normally 5-50ms). This only lasts for 2-4 seconds before the MaxThreads message comes up. I'm assuming this is a result of the server suddenly dealing with too much data/traffic/threads.
Background Information:
These two servers had been running without a problem for quite some time. The systems were actually setup each using two NICs during that time. They separated internal and external traffic. After a network upgrade, we moved these servers to single NICs (this was recommended to us for security/simplicity reasons). After that change, the servers started having these problems.
Resolution:
The obvious solution would be to move back to a setup of two NICs. The problems with that are that it would cause some complications with the network setup, and it seems like ignoring the problem. We'd prefer to try and get it running on a single NIC setup.
Googling the various error messages didn't provide anything useful (either old solutions or unrelated to our problem).
We've tried adjusting the various timeouts but that just made the server run slightly longer before dying.
We're not sure where to look to diagnose the problem further. We're still grasping at straws at what the problem could be:
1) The setup with AJP and Tomcat is incorrect, or outdated (i.e. known bugs?)
2) The network setup (two NICs versus one NIC) is causing confusion or throughput problems.
3) The websites themselves (there's no common code, no platforms being used, just basic Java code with servlets and JSP)
Update 1:
Following David Pashley's helpful advice, I did a stack trace/thread dump during the issue. What I found was that all 200 threads were in one of the following states:
"TP-Processor200" daemon prio=1 tid=0x73a4dbf0 nid=0x70dd waiting for monitor entry [0x6d3ef000..0x6d3efeb0]
at oracle.jdbc.pool.OracleConnectionCacheImpl.getActiveSize(OracleConnectionCacheImpl.java:988)
- waiting to lock <0x7e3455a0> (a oracle.jdbc.pool.OracleConnectionCacheImpl)
[further stack trace removed for brevity]
"TP-Processor3" daemon prio=1 tid=0x08f142a8 nid=0x652a waiting for monitor entry [0x75c7d000..0x75c7ddb0]
at oracle.jdbc.pool.OracleConnectionCacheImpl.getConnection(OracleConnectionCacheImpl.java:268)
- waiting to lock <0x7e3455a0> (a oracle.jdbc.pool.OracleConnectionCacheImpl)
[further stack trace removed for brevity]
Curiously, only one thread out of all the 200 threads was in this state:
"TP-Processor2" daemon prio=1 tid=0x08f135a8 nid=0x6529 runnable [0x75cfe000..0x75cfef30]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at oracle.net.ns.Packet.receive(Unknown Source)
at oracle.net.ns.DataPacket.receive(Unknown Source)
at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
[further stack trace removed for brevity]
It might be that the Oracle driver in this thread is forcing all other threads to wait for it to complete. For some reason it is must be stuck in this reading state (the server never recovers on its own, it requires a restart).
This suggests that it must be related to either the network between the server and the database, or the database itself. We're continuing diagnosis efforts, but any tips would be helpful.
It turns out that this version (classes12 - quite old) of the Oracle driver had various bugs in it that caused a deadlock (as seen in the TP-Processor2 state quoted above). It didn't become active until we switched to the new environment. Upgrading to the latest version (ojdbc14) has resolved the issue on the primary server.
From the description, I'd suggest the problem may be due to the database queries taking too long. If the queries are taking longer, request will take longer and therefore you'll have more of them running at once. As you're seeing, you're running out of tomcat threads. When you solve the problem with the database you should be okay.
Add connectionTimeout and keepAliveTimeout to your AJP connector found in /etc/tomcat7/server.xml.
Info about the AJP connector at https://tomcat.apache.org/tomcat-7.0-doc/config/ajp.html
connectionTimeout = The number of milliseconds this Connector will wait, after accepting a connection, for the request URI line to be presented. The default value for AJP protocol connectors is -1 (i.e. infinite).
keepAliveTimeout = The number of milliseconds this Connector will wait for another AJP request before closing the connection. The default value is to use the value that has been set for the connectionTimeout attribute.
If connectionTimeout and keepAliveTimeout values is not defined, then AJP connections will be kept alive for infinite. Causing to many threads, default max threads is 200.
I recommend installing psi-probe - an advanced manager and monitor for Apache Tomcat, forked from Lambda Probe. https://code.google.com/p/psi-probe/
Because of the way AJP works, the persistent connections between apache (using either mod_proxy_ajp or mod_jk) can only be safely closed by the client. In this case, the client is the apache worker that opens, and then holds a connection to tomcat for the life for the worker process.
Because of this behavior you cannot have more apache workers than tomcat worker threads. Doing so will cause additional http workers to fail to connect to tomcat (as the accept queue is full) and will mark your backend as DOWN!
I've had better results with mod_proxy instead of mod_ajp in terms of stability, so try that solution. It's non-invasive - at best it will solve the problem and at worst it will rule out mod_ajp.
Other that that, sounds like your Tomcats stop responding and all request threads are tied up. Have your dev team look into what's going on - taking a thread dump and delivering it to them will be useful.
The first thing I think of when I hear that a server runs for a while, suddenly slows down and then starts having service failures is that it's running out of RAM and thrashing swap. I'm not clear on whether the AJP failures you're seeing could be consequent to timeouts, but it doesn't seem completely unreasonable; don't see any obvious way it would connect to the NIC, though. In any event, I recommend you get a picture of what's going on with your memory usage when these events come about.
If you are running out of RAM, you may need to turn down your Apache
MaxClients
and increase yourListenBacklog
.By the way, thanks for making your question so well-organized and complete.
I had similar log errors in Redhat environment with proxy_ajp and Tomcat. Resolved by updating httpd package:
from:
to:
Then restarted apache, followed by restarting Tomcat.
That fixed it for me!