I need to deploy an automated process (via 1 min cron script) that looks for tar files in a specific directory. If a tar file is found, it is untarred to the appropriate location and then the tar file is deleted.
The tar files are automatically copied to this server over SSH from another server. In some cases, the tar files are extremely large, with lots of files.
The problem that I am expecting to run into: If it takes > 1 minute for the tar file to be copied to the server, and the cron script runs once every minute, it's going to see the .tar.gz file and try to do untar it, even though the tar file is still in the process of being written to.
Is there any way (via bash commands) to test if a file is currently being written to, or if it's only a partial file, etc?
One alternative I was thinking of was to have the file be copied as a different file extension (like .tar.gz.part
) and then renamed to .tar.gz
after the transfer is complete. But I figured I'd try to figure out if there is simply a way to determine if the file is whole at the command line first... Any clues?
Your best bet is to use
lsof
to determine if a file has been opened by any process:You can't easily tell if it's in the process of being written to, but if it is being written to, it MUST be open.
Edit: let's solve the actual problem here rather than try to implement the proposed solution!
Use rsync to transfer the file:
This way, the file won't be copied over top of the existing one but copied into a temporary file (
.big.tar.gz.XXXXXX
) until transfer is complete, then moved into place.You are on the right track, renaming the file is an atomic operation, so performing the rename after upload is simple, elegant and not error prone. Another approach I can think of is to use
lsof | grep filename.tar.gz
to check if the file is being accessed by another process.A bit old, but most of the answers completely misses the point of the question:
In general, there isn't. You simply don't have enough information to determine that.
Because determining that the file is closed is not the same as determining if the file is whole. For example, a file will get "closed" if the connection is lost partway through the transfer.
Only @Alex's answer got this right. And even he fell for using
lsof
somewhat.To determine if the file has been fully, successfully transferred requires more data. Such as:
That's a perfectly fine way to communicate that the file has been fully and successfully transferred. You can also move files from one directory to another as long as you stay within the same filesystem. Or have the sender send an empty
filename.done
file to signal completion.But all methods have to rely on the sender somehow signalling that the transfer has completed successfully. Because only the sender has that information.
Some file formats (such as PDFs) have data in them that allow you to determine if the file is complete. But you have to open and read pretty much the entire file to find out.
lsof
will just tell you the file is no longer open - it won't tell you why it's no longer open. Nor will it tell you how big the file is supposed to be.The best way to do this is to use incron ("inotify cron system"). It allows you to set an inotify watch on a directory which will then notify you of file operations. In this case, you should watch the dir for a close_write. That'll allow you to then run your command once the file was closed after a write.
It seems like lsof can detect what mode a file is open under:
See where it says 1w? That means that the file descriptor number is 1 and the mode is w, or write.
Using
inotifywait
can achieve what you're after - it has the capability to wait until a file write has finished before executing a command.The following will continuously watch a folder for new files and execute the command in the loop when writing to the file has finished.
For more configuration options see https://linux.die.net/man/1/inotifywatch
I use a python script that iterates size check of up to it is the same on 2 iterations in different time (in my case, with 0.05s of diff between checks, the job is done!):