We have a LAN with ~40 workstations (mostly Windows) and a couple of servers. All of them use an internal DNS (196.168.0.4
running BIND 9.5.0-P2
) and a gateway (192.168.0.1
running OpenBSD Packet Filter) which is a local PC acting as router.
For the last couple of months at some points during the workday the network is slowed down to an extent where doing anything internet related is not possible. On those bad times pinging 8.8.8.8
gives:
12:16:12.078: Timeout waiting for seq=11a1
12:16:13.484: From 8.8.8.8: bytes=60 SEQ=11a9 TTL=48 ID=0000 time=399.334ms
12:16:15.078: Timeout waiting for seq=11a4
12:16:15.437: From 8.8.8.8: bytes=60 SEQ=11ab TTL=48 ID=0000 time=355.409ms
12:16:18.078: Timeout waiting for seq=11a8
12:16:19.453: From 8.8.8.8: bytes=60 SEQ=11af TTL=48 ID=0000 time=376.317ms
12:16:21.078: Timeout waiting for seq=11aa
12:16:21.078: Timeout waiting for seq=11ac
12:16:21.390: From 8.8.8.8: bytes=60 SEQ=11b1 TTL=48 ID=0000 time=306.727ms
12:16:22.437: From 8.8.8.8: bytes=60 seq=11b2 TTL=48 ID=0000 time=364.351ms
12:16:23.453: From 8.8.8.8: bytes=60 seq=11b3 TTL=48 ID=0000 time=371.944ms
12:16:24.078: Timeout waiting for seq=11ad
12:16:24.078: Timeout waiting for seq=11ae
12:16:26.390: From 8.8.8.8: bytes=60 SEQ=11b6 TTL=48 ID=0000 time=307.729ms
12:16:27.078: Timeout waiting for seq=11b0
12:16:29.437: From 8.8.8.8: bytes=60 SEQ=11b9 TTL=48 ID=0000 time=361.575ms
12:16:30.078: Timeout waiting for seq=11b4
12:16:30.453: From 8.8.8.8: bytes=60 seq=11ba TTL=48 ID=0000 time=367.647ms
12:16:33.078: Timeout waiting for seq=11b5
12:16:33.078: Timeout waiting for seq=11b7
At that exact instance if I turn the DNS (at .0.4
) off then after a couple of seconds the network's health goes very good again:
12:47:43.046: From 8.8.8.8: bytes=60 seq=190b TTL=48 ID=0000 time=70.555ms
12:47:44.046: From 8.8.8.8: bytes=60 seq=190c TTL=48 ID=0000 time=82.684ms
12:47:45.046: From 8.8.8.8: bytes=60 seq=190d TTL=48 ID=0000 time=72.368ms
12:47:46.062: From 8.8.8.8: bytes=60 seq=190e TTL=48 ID=0000 time=84.310ms
12:47:47.046: From 8.8.8.8: bytes=60 seq=190f TTL=48 ID=0000 time=75.137ms
12:47:48.046: From 8.8.8.8: bytes=60 seq=1910 TTL=48 ID=0000 time=75.791ms
12:47:49.062: From 8.8.8.8: bytes=60 seq=1911 TTL=48 ID=0000 time=94.252ms
12:47:50.046: From 8.8.8.8: bytes=60 seq=1912 TTL=48 ID=0000 time=76.547ms
12:47:51.046: From 8.8.8.8: bytes=60 seq=1913 TTL=48 ID=0000 time=70.251ms
12:47:52.046: From 8.8.8.8: bytes=60 seq=1914 TTL=48 ID=0000 time=83.033ms
12:47:53.046: From 8.8.8.8: bytes=60 seq=1915 TTL=48 ID=0000 time=76.589ms
12:47:54.046: From 8.8.8.8: bytes=60 seq=1916 TTL=48 ID=0000 time=82.060ms
This is very consistent and reproducible. The fact that I ping 8.8.8.8
(Google's public DNS) is completelly random and just a way I have to test internet connectivity. I could be pinging 206.190.36.45
(an IP of Yahoo's public website).
The DNS is not open to the outside world.
So I think that maybe one (or more) of the workstations make very bad use of the DNS (probably indirectly via a virus) and flood it with requests or something. The problem is that I cannot trace that back. On the 0.4
machine top
gives me no CPU suspicious activity and on 0.1
(the gateway) filtering using dst host 192.168.0.4
in pftop
doesn't give me any internal IP using the DNS.
I've tried pluging out the ethernet cables the workstations one by one to find a possible offending workstation but this process is not very fast and accurate and by the time the network stabilizes I'm not really sure whether it was due to the last workstation I plugged out or whether the network simply went good again.
Any ideas on where to look at next?
Based on the information provided, I personally would lean towards a L2 switching loop and/or misconfigured link aggregation on the DNS server. It could also be a L3 routing loop, but that seems less likely. However, I can't be at all certain without more information.
The catch-22 is that I don't have the reputation to comment on the question in order to clarify the problem and determine if this answer has any merit before I post it. Hopefully this will point you in the right direction and you find your answer soon.
I'm not sure the evidence points to DNS. It looks to me like your Internet connection is being overwhelmed, based on the long ping times and packet loss. I would suggest that disabling the DNS server is preventing one or more clients (possibly misbehaving due to a virus, as you suggested) from using the Internet connection because it can no longer look up hostnames. This reduces the traffic and the Internet connection begins to perform normally.
I would recommend monitoring the Internet connection with something that can report on the top talkers to help you find the offending machine.
If your DNS server is publicly accessible, you could be a pawn in a DNS amplification attack and the resulting outgoing traffic is overwhelming your available bandwidth.