This might be a bit esoteric, but I've got an odd sort of a problem with stat().
Specifically - on a Linux system, with Isilon storage NFS mounted. I have a process that is getting time differences when running fstat
on the same file.
I can reproduce this with a bit of C, which does nothing more complicated than open-write-stat-close; open-stat-close
And these two numbers will - occasionally - differ by a few milliseconds - a file might appear 'older' than it actually is.
I'm unclear if the 'stat' is reading the time from the host, or from the server. And I'm also not clear if the problem is with the Isilon (although it doesn't seem to happen on NetApp storage).
Can anyone give me a bit of insight into what might be going on?
As an example:
time mismatch: 1468936451.862865611 != 1468936451.860183107
The first is the time read when the file handle is open.
The second is the time read after closing (and fsync
).
As you can see - the file is a bit older 'on the server' than on the client.
The C
to test this is:
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
int main ()
{
while (1)
{
struct stat sb[2];
/*
* Open file
*/
int fd = open ("test.dat", O_CREAT | O_RDWR | O_APPEND, 0644);
if (fd < 0)
{
perror ("failed to open file (1)");
return EXIT_FAILURE;
}
/*
* Write to it
*/
write (fd, "", 1);
/*
* Ensure it's sync'd to disk
*/
if (fsync (fd) < 0)
{
perror ("failed to fsync file");
return EXIT_FAILURE;
}
/*
* Stat the file
*/
if (fstat (fd, &(sb[0])) < 0)
{
perror ("failed to fstat file (1)");
return EXIT_FAILURE;
}
/*
* Close it
*/
close (fd);
/*
* Open file
*/
fd = open ("test.dat", O_RDONLY);
if (fd < 0)
{
perror ("failed to open file (2)");
return EXIT_FAILURE;
}
/*
* Stat it again
*/
if (fstat (fd, &(sb[1])) < 0)
{
perror ("failed to fstat file (2)");
return EXIT_FAILURE;
}
/*
* Close it
*/
close (fd);
/*
* Check the times match
*/
if (sb[0].st_mtime != sb[1].st_mtime ||
sb[0].st_mtim.tv_nsec !=sb[1].st_mtim.tv_nsec)
{
printf ("time mismatch: %d.%09ld != %d.%09ld\n",
(int)(sb[0].st_mtime), sb[0].st_mtim.tv_nsec,
(int)(sb[1].st_mtime), sb[1].st_mtim.tv_nsec);
}
}
}
And you can trigger it by running this in an NFS mount, then calling 'stat' (shell command is fine) from another system. E.g. on the isilon itself:
while true
do
stat test.dat
done
This gets me errors like:
time mismatch: 1468936613.313045037 != 1468936613.310174576
time mismatch: 1468936613.547768543 != 1468936613.544878047
time mismatch: 1468936615.228495345 != 1468936615.225862947
time mismatch: 1468936619.036053897 != 1468936619.033362349
time mismatch: 1468936619.988151772 != 1468936619.985529620
time mismatch: 1468936664.541226541 != 1468936664.538549678
time mismatch: 1468936666.165674866 != 1468936666.163171366
time mismatch: 1468936666.787528027 != 1468936666.784797051
time mismatch: 1468936711.334376729 != 1468936711.331868927
And ... I don't think it should.
Edit:
The plot thickens - when running the above and capturing packets, the response from the NFS GETATTR call from the server is inconsistent. I get two different timestamps.
Mount options don't seem to make a difference - but we have tried setting noac
and actimeo=0
, sync
lookupcache=none
etc.
Our isilon nodes are all ntp-synced to the same sources. As usual, there's a small amount of drift between them.
We have found that we can apply a workaround in the form of isi export modify --zone <zoneid> <exportnumber> --time-delta 1
This sets the resolution of the timestamp to 1s rather than 1ns. However whilst it diminishes the frequency of the problem occuring, it will still happen - just less frequently, because it only occurs at a '1s' boundary.