We have a pair of VMs running as virtual routers and BGP/TCP peering between the two virtual routers (running over QEMU/KVM). The VMs each have a tap interface that is connected to a Linux bridge that only has the two taps as members.
All works great, except that we see that conntrack seems to be reporting the TCP sessions between these two VMs. Initially we thought that the TCP sessions were leaking and that this was a security hole, but netstat reports nothing. So it seems we are not allocating a TCB for this on the host OS (which is correct); phew. The guest OS traffic should be transparent to the host OS which it seems it is; mostly.
The reason this conntrack behavior is an issue is that if both VMs are reset at the same time, then there is no-one left running to send any traffic on the guest TCP sessions to cause a TCP reset; so we get a conntrack "leak" on the host OS. Over time this builds up and eventually the host OS runs out of resources. We have a lot of BGP sessions in this test. Seems this is a way for a guest OS to to a DoS on a host OS...
Is this valid behaviour for conntrack? This is private VM to VM communication over an L2 bridge. Why should Linux be snooping and recording such TCP sessions? Is this a bug or a feature?
Most approaches seem to involve iptables to stop this; we don't really want to have to ask the customer to do that. Any other suggestions?
Yes, this behavior is expected, though I don't know that it is the problem you anticipate. Both TCP and UDP connections on the conntrack table expire over time on their own. You can see the timeout values in
/proc/sys/net/netfilter/*timeout*
and adjust these values through either/proc
or sysctl. Note, this may be different on older kernels, perhaps/proc/sys/net/ipv4/netfilter/
.If that isn't going to cut it and you are unsatisfied with the iptables -t raw -j NOTRACK solution, you can turn off iptables processing of bridged connections by setting
or setting the same parameters in
/etc/sysctl.conf
. Both of these will disable passing bridged traffic up to iptables which should have the effect of bypassing conntrack.Alternatively, you can disable ip_conntrack altogether, if you aren't using it by either blacklisting the module or otherwise disabling it in your kernel.