I am trying to duplicate UDP packets flowing to port 50007 on an Internet address from devices on a local NAT network (192.168.12.0/24) with the intent of processing them locally (on 192.168.12.1:50006).
On 38 of my 40 devices, the following iptables mangle and nat tables do the trick - port 50006 receives packets at the transmitted rate of 12 per minute - 1 per 5 seconds.
However, on two devices, that have an identical configuration as the other 38, port 50006 receives packets at rate of 1/11th of the transmitted rate, e.g. 1 packet every 55 seconds - the other 10 of 11 packets presumably being dropped.
Port 50006 is listened to by a socat script:
socat UDP-RECVFROM:50006,fork "EXEC:handler-script"
The handler script returns within 1 second and no change in receipt rate is observed when the script is changed to be a no-op.
One of the two malfunctioning devices spontaneously corrected itself and port 50006 started to receive packets at the transmitted rate.
The remaining device is still only receiving packets 1/11th of the transmitted rate, even though tcpdump shows the original packets arriving at the full rate.
$ sudo iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
TEE udp -- 192.168.12.0/24 anywhere udp dpt:50007 TEE gw:10.0.0.1
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
$ sudo iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
DNAT udp -- 192.168.12.0/24 anywhere udp dpt:50007 to:192.168.12.1:50006
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
MASQUERADE all -- 192.168.12.0/24 anywhere
One oddity with the malfunctioning node is that tcpdump shows the duplicated packet with the original source address and the source address corresponding to the local node's IP on the uplink. On the working nodes, tcpdump only shows the original (pre-masqueraded) packet.
$ sudo tcpdump -i eth2 port 50007
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 65535 bytes
03:56:41.851719 IP 192.168.12.66.4097 > example.com.50007: UDP, length 216
03:56:41.851996 IP 10.0.0.8.4097 > example.com.50007: UDP, length 216`
CPU on the malfunctioning node appears nominal and there are almost certainly no configuration differences between the functioning and malfunctioning nodes. The nodes are, however, deployed in different LANs.
So the question is: what could be causing the malfunctioning node to be dropping 10 out of every 11 packets? A second question is: why does is tcpdump behaving differently on this node and showing the masqueraded packet as well as the original packet.
Any suggestions about how I might go about debugging this issue?
Ok, it turns out that the explanation was provided by this question and answer:
Why is iptables not dropping packets?
It turns out that this installation had particularly good wifi connectivity between the downstream device and the router and the ip_conntrack entry was being established before the DNAT rule was established, resulting in the circumstance that applied in the referenced Q&A.
I was able to confirm this by disabling the AP on the router for 185 seconds (5 seconds longer than the default conntrack entry timeout) and then re-enabling it. Once I did this, packets flowed at the expected rate.
Also, the unexpected duplication of packets as visible in the tcpdump trace disappeared once this issue was resolved.