Ping a Specific Port

Question

Ravi Reddy

Asked: 2011-03-19 06:26:10 +0800 CST2011-03-19 06:26:10 +0800 CST 2011-03-19 06:26:10 +0800 CST

Tons of TCP connections in TIME_WAIT state on windows 2008 - running on amazon AWS

772

OS: Windows Server 2008, SP2 (running on EC2 Amazon).

Running web app using Apache httpd & tomcat server 6.02 and Web server has keep-alive settings.

There are around 69,250 (http port 80) + 15000 (other than port 80) TCP connections in TIME_WAIT state (used netstat & tcpview). These connections don't seem to close even after stopping web server (waited 24 hours)

Performance monitor counters:

TCPv4 Active Connections: 145K
TCPv4 Passive Connections: 475K
TCPv4 Failure Connections: 16K
TCPv4 Connections Reset: 23K

HKEY_LOCAL_MACHINE\System \CurrentControlSet\Services\Tcpip\Parameters does not have TcpTimedWaitDelay key, so value should be the default (2*MSL, 4 mins)

Even if there are thousands of connection requests are coming at the same time, why windows OS is not able to clean them eventually?
What could be the reasons behind this situation?
Is there any way to forcefully close all these TIME_WAIT connections without restarting windows OS?

After few days we app stops taking any new connections.

6 Answers

Voted

GregB · Answer 1 · 2011-04-05T09:48:46+08:00

We've been dealing with this issue too. It looks like Amazon found the root cause and corrected it. Here is the info they gave me.

Hi, I am pasting below an explanation of what was causing this issue. Good news is that this has been fixed very recently by our engineering team. To get fix, all you'll have to do is STOP/START the Windows Server 2008 instances where you are seeing this issue. Again, I am not talking about REBOOT which is different. STOP/START causes the instance to move to a different (healthy) host. When these instances launch again, they will be running on hosts that have the fix in place so they won't have this issue again. Now below is the engineering explanation of this issue. After an in depth investigation, we've found that when running Windows 2008 x64 on most available instance types, we've identified an issue which may result in TCP connections that remain in TIME_WAIT/CLOSE_WAIT for excessively long periods of time (in some cases, remaining in this state indefinitely). While in these states, the particular socket pairs remain unusable and if enough accumulate, will result in port exhaustion for the ports in question. If this particular circumstance occurs, the only solution to clear the socket pairs in question is to reboot the instance in question. We have determined the cause to be the values produced by a timer function in Windows 2008 kernel API which, on many of our 64-bit platforms, will occasionally retrieve a value that is extremely far in the future. This affects the TCP stack by causing the timestamps on the TCP socket pairs to be stamped significantly far in the future. According to Microsoft, there is a stored cumulative counter which will not be updated unless the value produced by this API call is larger than the cumulative value. The ultimate result is that sockets created after this point will all be stamped too far in the future until that future time is reached. In some cases, we have seen this value several hundred days into the future, thus the socket pairs appear to be stuck forever.

Jeff · Answer 2 · 2011-03-23T15:07:48+08:00

Jeff

2011-03-23T15:07:48+08:002011-03-23T15:07:48+08:00

Ryan's answer is good general advice except that it doesn't apply to the condition Ravi is experiencing in EC2. We too have seen this problem and for whatever reason Windows is completely ignoring the TcpTimedWaitDelay and never releasing the socket from its TIMED_WAIT state.

Waiting doesn't help... restarting the app doesn't help... the only remedy we've found is to restart the OS. Really ugly.

4

Marc Bollinger · Answer 3 · 2011-04-05T09:53:26+08:00

I completely randomly found this thread while looking to debug a separate issue, but this is a little-brought-up, but well-known issue with Windows on EC2. We used to have premium support, and discussed this with them in a non-public setting via that channel, but this is a related issue that we did discuss in the public forums.

As others have mentioned, you do need to tune Windows Servers out of the box. However, in the same way that StopWatch isn't working in the above thread, the TCP/IP stack also uses the QueryPerformanceCounter call to determine exactly when the TCP_TIME_WAIT period should last. The problem is that on EC2, they've encountered, and know about, an issue in which QueryPerformanceCounter goes haywire, and may return times far, far into the future; it's not that your TIME_WAIT state is being ignored, it's that the expiration time of TIME_WAIT is potentially years into the future. When running in an httpd setting, you can see how you quickly accumulate these zombie sockets once the state is encountered (we generally see that this is a discrete event, not that you slowly accumulate zombies).

What we do is run a service in the background that queries the number of sockets in the TIME_WAIT state, and once this hovers over a certain threshold, we take action (reboot the server). Somehow in the past 45 seconds, someone pointed out that you can stop/start the server to fix the issue--I suggest you couple these two approaches.

Ryan Fernandes · Answer 4 · 2011-03-22T19:44:20+08:00

Ryan Fernandes

2011-03-22T19:44:20+08:002011-03-22T19:44:20+08:00

The default settings for the TCP stack in Windows is, to say the least, not optimal for systems that are going to host an HTTP server.

To get the best out of your windows machine when used as an HTTP server, there are a few parameters that you'd normally tweak like MaxUserPort TcpTimedWaitDelay, TcpAckFrequency, EnableDynamicBacklog, KeepAliveInterval etc

I had written a note to self on this a few years ago, just in case I need some quick defaults to start with. Feel free to understand the parameters and then tweak them.

2

rmc47 · Answer 5 · 2013-07-26T03:24:47+08:00

rmc47

2013-07-26T03:24:47+08:002013-07-26T03:24:47+08:00

Unrelated to AWS, we just ran into this problem, it seems as a result of this KB article:

http://support.microsoft.com/kb/2553549/en-us

Basically, it kicks in if a system is up for >497 days and the hotfix hasn't been applied. A reboot has, of course, cleared it down - we might not know for then next 16 months if the hotfix worked, but this may help anyone who has long-uptime servers out there.

2

Jonathan Oliver · Answer 6 · 2012-08-29T12:03:23+08:00

Jonathan Oliver

2012-08-29T12:03:23+08:002012-08-29T12:03:23+08:00

I was experiencing the almost exact same thing on a number of boxes with Windows Server 2008 R2 x64 with SP1, mostly with CLOSE_WAIT (which is somewhat different than TIME_WAIT). I bumped into this answer which referenced a KB at Microsoft and a hotfix if the servers were running behind a load balancer (which mine are). After installing the hotfix and rebooting, all of the CLOSE_WAIT stuff was resolved.

0

Tons of TCP connections in TIME_WAIT state on windows 2008 - running on amazon AWS

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?