AT&T U-verse fiber 24Mbit down / 3Mbit up
2Wire Router Model 3800HGV-B
Software Version 6.1.9.24-enh.tm
Our speed is as advertised. The AT&T internet connection is fast. The problem isn't the speed.
The problem is our IRC and SSH sessions with remote hosts on the public internet don't last for more than a few seconds or a few minutes at the most. The TCP session timeout on the 2Wire configured to 86400. SSH sessions with servers on our LAN behave as expected. Our LAN doesn't appear to be the issue. The issue appears to be the 2Wire router. I can't get shell on the 2Wire router so I can't run tcpdump there etc. Tcpdump on the LAN shows us that each session dropping is caused by a TCP Reset initiated by the remote server. It's my understanding, from googling, that the TCP Reset is being sent because the remote host has decided something has gone wrong with the TCP session which again leads me to question what's happening on the 2Wire router. IRC and SSH sessions to these same remote servers from other internet connections of many types, mobile tethering, Time Warner cable, our T1 at another office etc. behave as expected with no issues.
All of this was working fine until we switched to AT&T and started using the 2Wire. The entire time we've had AT&T, 2 weeks now, we've had this issue.
At peak times in our office we have about 50 devices, laptops, desktops, mobile devices, using this internet connection. On our LAN I've tried several known working (with other providers) managed switches among other things. I've tried having everyone connect only to the 2Wire wireless SSID, etc. None of these attempts to isolate the issue changed the problem which seems to point to the 2Wire router.
In general when there are very few people in the office our IRC and SSH sessions will stay up longer, more than a few minutes. Sometimes the sessions will still drop in 5 seconds but sometimes I can keep one open for 10 or more minutes if I'm the only one in the office.
If the issue is the 2Wire router I'm not sure what it is or how to solve it. I'm also not sure how to even troubleshoot it and figure out what it is.
tcpdump output captured on our LAN of an SSH session dropping, a TCP Reset having been sent from the remote server:
10:51:33.357748 IP (tos 0x10, ttl 63, id 11177, offset 0, flags [DF], proto TCP (6), length 52)
2wire.ip.53096 > remote.server.ip.22: Flags [.], cksum 0xd8bb (correct), seq 3878, ack 3193, win 65535, options [nop,nop,TS val 904726345 ecr 194200103], length 0
10:51:33.357757 IP (tos 0x10, ttl 63, id 54768, offset 0, flags [DF], proto TCP (6), length 52)
2wire.ip.53096 > remote.server.ip.22: Flags [.], cksum 0xd86b (correct), seq 3878, ack 3273, win 65535, options [nop,nop,TS val 904726345 ecr 194200103], length 0
10:51:33.456382 IP (tos 0x10, ttl 63, id 37832, offset 0, flags [DF], proto TCP (6), length 100)
2wire.ip.53096 > remote.server.ip.22: Flags [P.], seq 3878:3926, ack 3273, win 65535, options [nop,nop,TS val 904726346 ecr 194200103], length 48
10:51:33.493452 IP (tos 0x0, ttl 48, id 35965, offset 0, flags [DF], proto TCP (6), length 100)
remote.server.ip.22 > 2wire.ip.53096: Flags [P.], seq 3273:3321, ack 3926, win 157, options [nop,nop,TS val 194200137 ecr 904726346], length 48
10:51:33.493757 IP (tos 0x0, ttl 48, id 35966, offset 0, flags [DF], proto TCP (6), length 132)
remote.server.ip.22 > 2wire.ip.53096: Flags [P.], seq 3321:3401, ack 3926, win 157, options [nop,nop,TS val 194200137 ecr 904726346], length 80
10:51:33.494297 IP (tos 0x10, ttl 63, id 12429, offset 0, flags [DF], proto TCP (6), length 52)
2wire.ip.53096 > remote.server.ip.22: Flags [.], cksum 0xd7e7 (correct), seq 3926, ack 3321, win 65535, options [nop,nop,TS val 904726347 ecr 194200137], length 0
10:51:33.494485 IP (tos 0x10, ttl 63, id 28130, offset 0, flags [DF], proto TCP (6), length 52)
2wire.ip.53096 > remote.server.ip.22: Flags [.], cksum 0xd797 (correct), seq 3926, ack 3401, win 65535, options [nop,nop,TS val 904726347 ecr 194200137], length 0
10:53:04.123228 IP (tos 0x0, ttl 255, id 48599, offset 0, flags [DF], proto TCP (6), length 40)
remote.server.ip.22 > 2wire.ip.53096: Flags [R.], cksum 0x9bbf (correct), seq 3401, ack 3926, win 0, length 0
Has anyone else had this issue, solved this issue? Or does anyone have advice regarding troubleshooting, identifying, and solving the issue?
Update:
First of all thanks a lot for reading this long question and for your replies. +1
I too was suspicious of the NAT translation table, but not suspicious enough apparently. I had guessed the 2Wire or any device could handle 2^16 sessions. I guessed wrong:
I didn't see the session table on the 2Wire before but upon your suggestion I went looking for it and it was easy enough to find:
session table 15/1024 available, 0/512 used in inbound sessions:
The session table details above are from a time in the afternoon when perhaps a quarter of our office wasn't at their desks using their computers and we're already nearing the limit of 1024 concurrent sessions.
Also googling for "uverse session table" gave me some useful search results.
Being a residential piece of gear, my initial gut reaction was that it's not able to support all of the concurrent TCP connections and NAT translations that're being thrown at it (and forging reset packets for those that go over the limit).
I'm having a hard time finding specs on that device to confirm my suspicion, but in looking for them, there seems to be a lot of anecdotal evidence out there supporting that theory.
Got any way to check how many connections it's running?
You've covered your bases with troubleshooting honestly. I would call the ATT and have them run diagnostics on the connection focusing on layer 1 and layer 2 issues. Do you have access to the gateway? Does it provide you any kind of diagnostics for troubleshooting problems?
I know its a different technology, but when I was supporting DSL sometimes if the client was too far from the DSLAM and had a wiring issue causing attenuation you would see something similar. I'd start there at the gateway (plug directly into it, no wireless!) and work your way out. If this is a business class line ATT should be able to troubleshoot you all from their front line team all the way back to the NOC and see if there is an issue.