I'm troubleshooting a customer who requires the ability to send 5000 pings from the router to their remote site over a satellite link with zero timeouts, yet they keep experiencing one to five packets lost per test.
Under ordinary circumstances, I'd be willing to chalk up such a low loss rate as the cost of a satellite link, but the drops only show up when pinging from the router to the remote site. To clarify, here's the involved network devices:
Outbound Traffic
- 192.1.1.51 Router Hub
- 192.1.1.52 TX Switch Hub
- 192.1.1.50 Encapsulator Hub
- 172.1.1.1 Remote Site Remote
Return Traffic
- 172.1.1.1 Remote Site Remote
- 192.1.1.28 Channel Unit Hub
- 192.1.1.53 RX Switch Hub
- 192.1.1.51 Router Hub
When pinging from the Router to the remote site, the losses show up. When pinging from a Sun server attached to the TX switch (bypassing the router), the 5000 pings complete without a single loss. This verifies the entire satellite path, and all equipment except for the router.
Then I tried sending 5000 pings from the router to all of the other devices aside from the remote site...and I got back all 5000 almost instantaneously with no drops, so the connection from the router to everything else in the path is verified good.
The router in question is a Cisco 7206VXR, and the cpu utilization doesn't appear to ever go above 50%. The highest process is only at 20%, so I'm not confident that it's simply a matter of the router dropping ICMP packets due to lower priority, particularly given the router will send 5000 packets to local devices with no issues.
I also looked into the possibility of a null route, but the only possible culprit is an essential route for remote access, according to the customer, and I can't post their running config here to get a second opinion.
Any suggestions would be greatly appreciated. I have very little networking experience, and I'm beating my head against the wall to reconcile these seemingly contradictory symptoms.
Datagrams are a best effort service. If you have a requirement that data be reliably delivered, you cannot use datagrams It really is that simple. The entire design of the system, end to end, is not meant to meet this requirements. You can't just impose it on the system as a whole at the end like putting a cherry on a sundae.
It turns out the problem was that CEF was enabled globally on the hub router, but explicitly disabled ("no ip route-cache cef") on the interface which connects to the hub LAN. Once the explicit disable statements were removed, the packet loss vanished.
I don't understand why that worked, given that there was no packet loss between the hub devices and the hub router, but I can't argue with the results.
Hopefully, this can help anyone else who is stuck trying to isolate a very minor packet loss.
Thanks again to everyone who offered advice on this issue.