Preface
We are testing some host based IPS. In this test case, our application is listening on the loopback and the application is receiving traffic in clear text. We are using either nginx or haproxy to terminate TLS on the public interface. Our IDP will be monitoring the loopback so that it can see unencrypted traffic.
Our IDP was seeing malformed / incorrect dates, so we started digging deeper.
[ Update 2 ] As @kasperd mentioned, tcpdump is getting the timestamps from the OS. That said, it turns out, this bug is actually tripping up the IDP in addition to tcpdump. It sees the connection_established, but failed to see a valid http session, as the syn-ack is not valid.
A bug has been filed on redhat.com and centos.org.
Observation
The first syn-ack on the loopback always have a date close to the start of epoch, or within 2 years of it on VM's. This varies wildly from Dec 1970 to Feb 1973 on VM's and far in the futre on bare metal Xeon servers. NTP is correct on all of our VM's and bare metal servers, less than 50ms drift.
This only happens on the loopback. We never see this on bond0 on the servers or eth0 on VM's.
Test Servers and laptops
OS: CentOS 7
Platforms:
Dell 20 Core Xeon servers ( bare metal host OS )
HP 20 Core Xeon servers ( bare metal host OS )
VirtualBox on MacOS
Hyper-V on Windows 10 Enterprise on Lenovo P50 with 6 virtual cores.
One Celeron 4 Core 1.6 GHz based router (can not reproduce on the Celeron)
Steps to reproduce
On each platform, we start up a web listener on port 80 on the loopback.
./simple_python 127.0.0.1 &
The code for above is here
Then we start up tcpdump
tcpdump -p -NNnn -XXxx -tttt -vv -s0 -c2 -i lo &
Then we curl to localhost
curl -s -o /dev/null http://127.0.0.1/
Output
2018-04-10 21:05:30.087769 IP (tos 0x0, ttl 127, id 49233, offset 0, flags [DF], proto TCP (6), length 60)
127.0.0.1.25134 > 127.0.0.1.80: Flags [S], cksum 0xfe30 (incorrect -> 0xce27), seq 4053136920, win 65495, options [mss 65495,sackOK,TS val 22951497 ecr 0,nop,wscale 13], length 0
0x0000: 0000 0000 0000 0000 0000 0000 0800 4500 ..............E.
0x0010: 003c c051 4000 7f06 3d68 7f00 0001 7f00 .<.Q@...=h......
0x0020: 0001 622e 0050 f195 f618 0000 0000 a002 ..b..P..........
0x0030: ffd7 fe30 0000 0204 ffd7 0402 080a 015e ...0...........^
0x0040: 3649 0000 0000 0103 030d 6I........
1973-02-14 22:12:10.785902 IP (tos 0x0, ttl 127, id 0, offset 0, flags [DF], proto TCP (6), length 60)
127.0.0.1.80 > 127.0.0.1.25134: Flags [S.], cksum 0xfe30 (incorrect -> 0x2f28), seq 3928063281, ack 4053136921, win 65483, options [mss 65495,sackOK,TS val 22951497 ecr 22951497,nop,wscale 13], length 0
0x0000: 0000 0000 0000 0000 0000 0000 0800 4500 ..............E.
0x0010: 003c 0000 4000 7f06 fdb9 7f00 0001 7f00 .<..@...........
0x0020: 0001 0050 622e ea21 7d31 f195 f619 a012 ...Pb..!}1......
0x0030: ffcb fe30 0000 0204 ffd7 0402 080a 015e ...0...........^
0x0040: 3649 015e 3649 0103 030d 6I.^6I....
In every case, the syn-ack is always some date between 1970 and 1973 on VM's and way in the future on the Xeons.
- I can reproduce this 100% of the time on each of the platforms, except for the Celeron. We don't use Celerons in the data-center. I was just trying to find something not affected.
What else have I tried to make this go away?
- I have tried pinning applications to a core using
taskset
. - I tried setting different variables that affect libc, such as TZ, LANG, LC_ALL, etc...
- I have tried disabling all offloading capabilities of the interface, despite it being a loopback and those shouldn't actually do anything.
- I have tried a few different sysctl settings.
- I tried using different snaplen in tcpdump. ( I am aware of some historical issues around snaplength )
- I verified the hardware clock is correct.
What I have not tried
- I have not tried to set up receive flow steering, since we would not do this in our data-centers without really good reason.
- There are probably a myriad of other things I could try, but this really looks like a libc / buffer / race condition bug of some kind.
Any thoughts on where in the Linux code this might be occurring? I am hesitant to dig into glibc as I am not a C developer.
[Update] It appears @jackthecoiner found where someone else is having this issue as well and has not received any feedback on the Redhat site as of yet.
I filed this in the CentOS bug tracker where it was quickly confirmed, and they suggested I move it to the Redhat kernel bug tracker
From there, Redhat devs indicated this bug was a dupe of bz1473533, which was fixed in kernel-3.10.0-703.el7. For CentOS, it was confirmed that this is fixed in the 7.5 beta kernel (830.el7).