2 machines are mounting the same NFS folder(which is on a single 3rd machine).
Files get uploaded from either client server to this mount, usually in chunks. Server A might handle a chunk, then server B the next, all with start and end points defined so it all ads up in the end.
Still, there's been a few instances where if you run md5sum
from serverA, you get a different result than with serverB.
Though in reality, the file is on the NFS server and there should be only 1 version pushed to all clients, far as I'm aware.
And it's not fixing itself over time.
I'm currently assuming this is a race condition, related to the chunks not being added in order and NFS caching, you might get one of the servers thinking the file is at a certain length while it isn't, causing a lot of 0000 0000
padding to be added.
So, how come this is happening? Is there a mount setting I need to use to prevent this? Is there a way to tell the NFS server to re-sync the file to all clients?
And just in general, how should this be dealt with?
EDIT: Mounting options on the clients:
machine1:~$ nfsstat -m
/mnt/dirA from <SERVER_IP>:/dirA
Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP>
/mnt/dirB from <SERVER_IP>:/dirB
Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP>
machine1:~$ cat /proc/mounts | grep <SERVER_IP>
<SERVER_IP>:/dirA /mnt/dirA nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP> 0 0
<SERVER_IP>:/dirB /mnt/dirB nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP> 0 0
machine2:~$ nfsstat -m
/mnt/dirA from <SERVER_IP>:/dirA
Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP>
/mnt/dirB from <SERVER_IP>:/dirB
Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP>
<SERVER_IP>:/dirA /mnt/dirA nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP> 0 0
<SERVER_IP>:/dirB /mnt/dirB nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP> 0 0
EDIT2: The machines are both Ubuntu 18.04, fresh install, the md5sum tool is version 8.28 on both machines.
EDIT3:
I found this note that I kept on the files. I performed an xxd
to get the hexdump, from both machines, from the mount to the local filesystem of the machines.
As to be certain it was captured from the point of view of the individual machines.
As you can see, according to machine01, there's empty padding in the file, but not according to machine02.
This is the result:
root@machine01:/home/kdguser# grep -C 5 '2ddd5000' output01
2ddd4fb0: 0a78 95ff c53e e2c4 f79a db05 0a59 d7d1 .x...>.......Y..
2ddd4fc0: 85a8 1192 26a6 a25a d741 db3c a61f e72e ....&..Z.A.<....
2ddd4fd0: 4d0b 97b6 93cc 7845 6ef4 0cca f9aa 9390 M.....xEn.......
2ddd4fe0: 9f00 bacd 707f 2398 f419 e49e 8073 67fb ....p.#......sg.
2ddd4ff0: 89f5 9450 99f5 808f 4b21 3154 f97f 1271 ...P....K!1T...q
2ddd5000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
2ddd5010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
2ddd5020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
2ddd5030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
2ddd5040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
2ddd5050: ba34 fb76 5af3 69d2 9af0 4711 8a0c eae8 .4.vZ.i...G.....
root@machine02:/home/kdguser# grep -C 5 '2ddd5000' output02
2ddd4fb0: 0a78 95ff c53e e2c4 f79a db05 0a59 d7d1 .x...>.......Y..
2ddd4fc0: 85a8 1192 26a6 a25a d741 db3c a61f e72e ....&..Z.A.<....
2ddd4fd0: 4d0b 97b6 93cc 7845 6ef4 0cca f9aa 9390 M.....xEn.......
2ddd4fe0: 9f00 bacd 707f 2398 f419 e49e 8073 67fb ....p.#......sg.
2ddd4ff0: 89f5 9450 99f5 808f 4b21 3154 f97f 1271 ...P....K!1T...q
2ddd5000: c969 a259 431e 2a17 12b4 8365 07cb 5e56 .i.YC.*....e..^V
2ddd5010: fa61 327f eb63 1b13 bc30 eb4b c8f0 af14 .a2..c...0.K....
2ddd5020: 6ebe 3f79 9012 7ece 1662 e104 be19 b249 n.?y..~..b.....I
2ddd5030: 9b9c f61d 180b e92a b93b 9980 aba4 ba41 .......*.;.....A
2ddd5040: 0929 fece fc8a 5309 3883 2562 fe2a 459a .)....S.8.%b.*E.
2ddd5050: ba34 fb76 5af3 69d2 9af0 4711 8a0c eae8 .4.vZ.i...G.....
While the actual file is the one as seen from machine02. Yet machine01 is showing something else.
EDIT4: Just to be clear, the length of the files is identical, the md5 is different on each client.
I recommend reading the "Data And Metadata Coherence" section of the nfs man page.
In particular, you need to use
noac
:Alas,
So you may need to open the file with the
O_DIRECT
flag ifnoac
doesn't solve the problem for you.Disclaimers: First, I do not use Ubuntu. Second, I am "old school." Third, he documentation will probably disagree with me (see Second disclaimer).
BLUF: This is probably a timing, caching, or buffering issue.
Explanation: In the Old Days, the program would not actually write to disk immediately. The OS would actually send the file data to a buffer. When the buffer was (nearly) full, the buffer would flush to disk. I.e. the contents of the buffer would then be physically written to the disk itself.
For disk arrays, sometimes the disk controller would also have a cache. Data could potentially arrive at the controller faster than the disk could write, so it would be cached in the controller until it the disk could catch up.
For network traffic, data is generally transmitted in packets. TCP/IP, there is no guarantee that the packets will arrive in the order they were sent. So, there is a buffer which holds the packets and re-assembles them in the correct order.
Today, the buffers are supposed to cache immediately. Back in the day, we would run the
sync
command to force buffer flushes.The issues that I see here are:
Each server has a "Next Block Number" where it is supposed to start writing when it's turn comes. This value could be out of sync between server A and Server B.
The cache, or buffer, may not be writing quickly enough. E.g. Server A has to send it's data to Server C. Server C has to physically write it to disk. Server B has to re-read the file from the disk before it can "see" it.
This means that Server B may have a hole in it's data from the previous flush of Server A. And vice versa.
Server C, the NFS Server, could be overloaded with read/write requests. Does Server C (the NFS server) have yet a different check sum?
Server A and Server B may not be re-reading fast enough.
Hopefully, this will give you some insight into where to look for answers.
Possible Troubleshooting steps: Is it possible to quite the network, run a few
sync
command on each server and see if the match?Does the file eventually catch up? You mentioned a hole in the data.
After a time (TBD), does the padding fill in with the missing data? If so, you have a buffering or timing issue. If not, you have a much bigger problem with the entire system design.
Can you revisit the 2 server problem? Can you have only 1 of the servers take over all the writing and fail over to the other sever if necessary?
Are there caching parameters or timing values in your configurations that you can tweak?