What's the best way of comparing two directory structures and deleting extraneous files and directories in the target location?
I have a small web photo gallery app that I'm developing. Users add & remove images using FTP. The web gallery software I've written creates new thumbnails on the fly, but it doesn't deal with deletions. What I would like to do, is schedule a command/bash script to take care of this at predefined intervals.
Original images are stored in /home/gallery/images/
and are organised in albums, using subdirectories. The thumbnails are cached in /home/gallery/thumbs/
, using the same directory structure and filenames as the images directory.
I've tried using the following to achieve this:
rsync -r --delete --ignore-existing /home/gallery/images /home/gallery/thumbs
which would work fine if all the thumbnails have already been cached, but there is no guarantee that this would be the case, when this happens, the thumb directory has original full size images copied to it.
How can I best achieve what I'm trying to do?
You need
--existing
too:From the manpage:
I don't think
rsync
is the best approach for this. I would use a bash one-liner like the following:If this one-liner produces the right list of files, you can then modify it to run an
rm
command instead of anecho
command.I have to transfer a large amount of data and many files. I have used msrsync to parallelise the rsync streams which works well but you cannot use rsync option '--delete' with msrsync as the multiple streams will conflict and try to delete each others files. So I started looking for a solution to delete files and found this question.
My final solution using the original question as an example and leveraging previous answers (Tom Shaw) is to use:
The intent here is to only remove files from thumbnails/ that do not exist in images/. This solution may leave empty directories in thumbnails that do not exist in images.
Using xargs allows this to be parallelised '-P64'.
As per Tom Shaw's solution I have used echo in the solution so you can check the outcome is as expected before making it actually delete files.
I post this alternate solution for those people who may have millions of files to deal with and have the resources to run many threads.