OS: Windows Server 2008, SP2 (running on EC2 Amazon).
Running web app using Apache httpd & tomcat server 6.02 and Web server has keep-alive settings.
There are around 69,250 (http port 80) + 15000 (other than port 80) TCP connections in TIME_WAIT state (used netstat & tcpview). These connections don't seem to close even after stopping web server (waited 24 hours)
Performance monitor counters:
- TCPv4 Active Connections: 145K
- TCPv4 Passive Connections: 475K
- TCPv4 Failure Connections: 16K
- TCPv4 Connections Reset: 23K
HKEY_LOCAL_MACHINE\System \CurrentControlSet\Services\Tcpip\Parameters
does not have TcpTimedWaitDelay key, so value should be the default (2*MSL, 4 mins)
Even if there are thousands of connection requests are coming at the same time, why windows OS is not able to clean them eventually?
What could be the reasons behind this situation?
Is there any way to forcefully close all these TIME_WAIT connections without restarting windows OS?
After few days we app stops taking any new connections.
We've been dealing with this issue too. It looks like Amazon found the root cause and corrected it. Here is the info they gave me.
Ryan's answer is good general advice except that it doesn't apply to the condition Ravi is experiencing in EC2. We too have seen this problem and for whatever reason Windows is completely ignoring the TcpTimedWaitDelay and never releasing the socket from its TIMED_WAIT state.
Waiting doesn't help... restarting the app doesn't help... the only remedy we've found is to restart the OS. Really ugly.
I completely randomly found this thread while looking to debug a separate issue, but this is a little-brought-up, but well-known issue with Windows on EC2. We used to have premium support, and discussed this with them in a non-public setting via that channel, but this is a related issue that we did discuss in the public forums.
As others have mentioned, you do need to tune Windows Servers out of the box. However, in the same way that StopWatch isn't working in the above thread, the TCP/IP stack also uses the
QueryPerformanceCounter
call to determine exactly when the TCP_TIME_WAIT period should last. The problem is that on EC2, they've encountered, and know about, an issue in whichQueryPerformanceCounter
goes haywire, and may return times far, far into the future; it's not that your TIME_WAIT state is being ignored, it's that the expiration time of TIME_WAIT is potentially years into the future. When running in an httpd setting, you can see how you quickly accumulate these zombie sockets once the state is encountered (we generally see that this is a discrete event, not that you slowly accumulate zombies).What we do is run a service in the background that queries the number of sockets in the TIME_WAIT state, and once this hovers over a certain threshold, we take action (reboot the server). Somehow in the past 45 seconds, someone pointed out that you can stop/start the server to fix the issue--I suggest you couple these two approaches.
The default settings for the TCP stack in Windows is, to say the least, not optimal for systems that are going to host an HTTP server.
To get the best out of your windows machine when used as an HTTP server, there are a few parameters that you'd normally tweak like MaxUserPort TcpTimedWaitDelay, TcpAckFrequency, EnableDynamicBacklog, KeepAliveInterval etc
I had written a note to self on this a few years ago, just in case I need some quick defaults to start with. Feel free to understand the parameters and then tweak them.
Unrelated to AWS, we just ran into this problem, it seems as a result of this KB article:
http://support.microsoft.com/kb/2553549/en-us
Basically, it kicks in if a system is up for >497 days and the hotfix hasn't been applied. A reboot has, of course, cleared it down - we might not know for then next 16 months if the hotfix worked, but this may help anyone who has long-uptime servers out there.
I was experiencing the almost exact same thing on a number of boxes with Windows Server 2008 R2 x64 with SP1, mostly with CLOSE_WAIT (which is somewhat different than TIME_WAIT). I bumped into this answer which referenced a KB at Microsoft and a hotfix if the servers were running behind a load balancer (which mine are). After installing the hotfix and rebooting, all of the CLOSE_WAIT stuff was resolved.