I have a file storage directory used together with a MySQL DB.
Some of the files in the directory are orphaned (i.e. created by mistake, deleted in the DB but not on disk, otherwise not used).
I was able to generate list of such files without file extension but now what's the best way to move them out of the storage directory. The problem is that the storage is multi-level, so I have to find each file first somehow.
Sample of the orphan list content (200K lines in total):
10218
10219
10220
10221
10370
10371
10372
10373
10374
Directory structure (example):
If you wonder how I ended up with such a file:
- first, saved list of files in directory recursively to one file per https://stackoverflow.com/a/5456136/505984
- secondly, dumped DB table id's to another file with MySQL CLI (because each filename without the extension matches the DB record ID)
- diff'ed the two files as advised here: https://stackoverflow.com/a/25407317/505984
I will concentrate on making the selection because you said:
Assuming the bash shell:
Clarification:
find
makes a list of all files in the tree.grep -f somefile
applies a filter to the piped output of find<( something )
is an ephemeral fileawk '{print $0".pdf"}'
appends every line in the orphans list with ".pdf", so grep -f does not match directory namestmp/orphans
is the input list of orphansYour question isn't exactly clear, but assuming you want to
turn a list of files without extension into a list of patterns that will match files with arbitrary extensions
pass that list of patterns efficiently to a sequence of
find
commandsthen you could try something like this, using
sed
and GNU parallel (available from the Ubuntu universe repository):With the
-X
option,parallel
will try to fit as many arguments as possible into eachfind
invocation; the arguments are drawn from three input sources:the string
-name
a line from standard input
the string
-o
thus building up an argument list that looks like
-name '1406.*' -o -name '6179.*' -o -name '17526.*' -o ...
(which you can verify by adding the parallel--dry-run
option). The final-false
predicate mops up the trailing-o
and-I%%
changes parallel's replacement string from the default{}
so as not to conflict with find's default replacement string.This is certainly more efficient than running
find
200,0000 times - but may not be as efficient as running a singlefind
command and grepping the result, as suggested in this answer. On my 64-bit Ubuntu 24.04 VM it manages to cram a file of 200,0000 integer IDs (generated from the bash$RANDOM
variable) into approximately 100 invocations offind
.If that successfully identifies the files, you can change
-print
to something like-exec mv -vnt "$dstdir" {} +
where similar tosrcdir
youexport dstdir=path/to/newdata
and pass it toparallel
via--env destdir
:(you can use
-execdir
in place of-exec
provided$dstdir
is an absolute path).