I have an application which is running out of file descriptors, apparently by opening sockets, but I can't find out exactly what these sockets do. These appear in the lsof output as
java 9689 appuser 1010u sock 0,5 263746675 can't identify protocol
java 9689 appuser 1011u sock 0,5 263746676 can't identify protocol
java 9689 appuser 1012u sock 0,5 263746677 can't identify protocol
java 9689 appuser 1014u sock 0,5 263746678 can't identify protocol
java 9689 appuser 1015u sock 0,5 263746679 can't identify protocol
java 9689 appuser 1016u sock 0,5 263746681 can't identify protocol
and in /proc/$PID/fd as
lrwx------ 1 appuser appuser 64 Jun 23 11:49 990 -> socket:[263732085]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 991 -> socket:[263732086]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 992 -> socket:[263735307]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 993 -> socket:[263732088]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 995 -> socket:[263735308]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 996 -> socket:[263735309]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 997 -> socket:[263745434]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 998 -> socket:[263745435]
lrwx------ 1 appuser appuser 64 Jun 23 11:49 999 -> socket:[263745436]
but there is no similar output in netstat -a
.
What are these sockets and how can I find out what they do?
Edit: I have tried running grep $SOCKET /proc/net
, as recommended in the lsof FAQ , where $SOCKET is for instance 263746679, but that gave no results either.
As background, the application is a container for multiple tasks which, amongst others, perform network calls. I need to single out the one which goes berserk, but until I find out who those sockets communicates with, I'm stuck.
This can occur if you create a socket, but never connect() or bind() with it. Your best bet may be to strace (-fF) the application, and then cross-reference with the output of lsof to determine which sockets are causing the issue. As a bonus method of debugging: if you wrap your socket calls with debugging information and write them out to /dev/null, it'll appear in strace without giving you hilariously-large log files.
Using Python, I have encountered the same problem on SSL sockets:
The solution was to unwrap the SSL layer before closing:
This closes the sockets properly in my app.
The first thing I would do is incrase if your file descriptor limit:
Next I would make sure your system is up to date, this includes all libraries and servers. Its possible that your Java application server is out of date (if you are using one). Its also a possibility that your application server is misconfigured, you should look at your configuration file and lower your
connectionTimeout
and/or yourmaxKeepAliveRequests
(I'm not sure what application server your using or if you are using one at all...).I'm not sure what this application does, but if you don't think it requires tens of thousands of sockets then this is almost certainly a "file descriptor leak" in your Java application. You may have to send a bug report to the vendor. In this bug report you should include information on how to recreate the issue.
Here are some ways to debug the issue.
Wireshark (or twireshark for the cli) is the best tool to see how these sockets are being used. Wireshark will give you a break down of the type of traffic being thrown over the wire. Its likely that the first few connections will succeed and then it will hit the file descriptor limit. Once the file descriptor limit is hit then Wireshark isn't going to pick up on anything (and neater is netstat for that matter), but this will help narrow down the issue. There maybe case where a lot of outgoing SYN's are being sent, however no SYN/ACKs are being received thus a lot of tcp connections are just stuck in the SYN_WAIT state.
If you have access to the source code and you know the type of sockets being created (such as using strace or just searching the code) then you can open the project in Eclipse (or another IDE) and set a break point at the function that is creating these sockets. When the breakpoint gets hit, then you can look at the stack trace. This file descriptor leak maybe a simple infinite loop or perhaps the socket timeout value is too large. Another possibility is that the java app isn't doing a
socket.close()
to clean up the connections. Doing a close is commonly done in a thefinely
block of atry/catch
(Yes a socket must always have a try/catch in Java or it won't build :). At the end of the day its likely that the Java app isn't handling its IOException's properly.