This question is prompted by a short script I found in a Linux magazine. As evidence that I didn't make this up, here's a picture of it:
I would like to write a letter to the editor of this publication about what's wrong with this and how to write it better.
The script attempts to capture jpeg files into a variable, so that something (compression using lepton
) can be done with them.
for jpeg in `echo "$(file $(find ./ ) |
grep JPEG | cut -f 1 -d ':')"`
do
/path/to/command "$jpeg"
...
Apparently in this instance we can't trust the files to be named with a .jpg
extension so we can't catch them with something like
for f in *.JPG *.jpg *.JPEG *.jpeg ; do ...
because the writer has used file
to check their type, but if the filenames can't be trusted to have a sensible extension, then I don't see how we can trust them not to be -rf *
or (; \ $!|
or have newlines or whatever else.
How can I sanely capture files into a variable by type with for
or while
, or perhaps avoid doing so by using find
with -exec
, or some other method?
Bonus for insights into and demonstrations of what's wrong with the code in the picture.
I've tagged this question with [bash] since it's about a bash script, but if you feel like answering with a way to do this that doesn't use bash, then please feel free to do that.
0. The script wants to do something like this.
The script shown in your question tries to enumerate files and check if they are JPEGs, but does neither reliably. It tries to pass all the paths to
file
in a single run and extract both filenames and types from the output offile
, which is reasonable since it may be faster than runningfile
again and again for each file. But to do it correctly, you need to be careful about how the paths are passed tofile
, howfile
delimits its output, and how you consume that output. You can use this:That's one of several correct ways. (It does not need to set
IFS=
; see below.)find
with+
passes multiple path arguments tofile
and only runs it as many times as necessary to process them all, usually just once. Credit goes to αғsнιη for the idea of passing--mime-type
tofile
to obtain the MIME type, which contains the information you actually want and is easy to parse.A detailed explanation follows. I've used the specific task of JPEG compression as an example. That's what the script you showed is for, and
lepton
has some oddities that should be considered in deciding how to improve that script. If you just want to see a script that runslepton
on each JPEG file, you can skip to section 7. Putting It All Together.1. Installing
lepton
The script you showed is meant to traverse a directory hierarchy, find JPEG images, and process them with the lossless JPEG compressor
lepton
. For the main motivation of your question, the command may not really matter, but different commands have different syntax. Some commands accept multiple input filenames for a single run. Most accept--
to indicate the end of options. I'll uselepton
as my example. Thelepton
command doesn't accept multiple input filenames and doesn't recognize--
.To use
lepton
, install it first. It's officially packaged for Ubuntu 17.04 and later (sudo apt install lepton
). For earlier Ubuntu releases, or to use a newer version than is packaged for your release, clone itsgit
repository (git clone https://github.com/dropbox/lepton.git
) and build the source as instructed in the README. Or you might be able to find a PPA.Depending how you install it,
lepton
may be in/usr/bin
,/usr/local/bin
, or elsewhere. Probably you will want it somewhere in$PATH
; then you can run it aslepton
. The script you showed uses absolute paths tolepton
and the standard utilitiesmv
andrm
, but not to the other standard utilitiesfile
,find
,grep
andcut
. (This is Bash, soecho
--pointless in that script anyway--is a shell builtin.exit
is always a builtin.) Though this isn't one of the script's serious flaws, there's no discernible reason for such inconsistency. Unless you're writing a script to tolerate not having$PATH
set sensibly--in which case you must use absolute paths for all external commands--I suggest using relative paths for standard commands and those you've installed.2. Running
lepton
Cautions and General Information
I tested with lepton v1.0-1.2.1-104-g209463a (from Git).
lepton
was released back in July 2016 so I'd guess the current syntax will keep working. But future versions may add features. If you're reading this years from now, you might check iflepton
has added support for tasks that once required scripting.Please be careful what command-line arguments you pass. For example, I tried running
lepton
with-verbose
as the first argument andart.jpg
as the second. It interpreted-verbose
as an input filename and quit with an error, but not before truncatingart.jpg
--which it interpreted as an output filename--down to zero bytes. Fortunately I had a backup!You can pass zero, one, or two paths to
lepton
. In all cases, it examines its input file or stream to see if it contains JPEG or Lepton data. JPEG is compressed to Lepton; Lepton is decompressed to JPEG.lepton
will remove and add file extensions but doesn't use them to decide what to do.Zero Filenames —
lepton -
reads from stdin and writes to stdout.Thus
lepton - < infile > outfile
is one way to read frominfile
and write tooutfile
, even if their names start with-
(like options do). But the method I'll use passes paths that start with.
, so I won't have to worry about this.One Filename —
lepton infile
readsinfile
and names its own output file.This is how the script you showed uses
lepton
.If the content of
infile
looks like a JPEG,lepton
outputs a Lepton file; if its content looks like a Lepton file,lepton
outputs a JPEG.lepton
decides how it wants to name its output file by stripping an extension frominfile
, if any, and adding either a.jpg
or.lep
extension depending on what kind of file it is creating. But it does not use the extension it is removing (if any) to infer the type of file it is operating on.It considers the last
.
and anything after it as an extension. Ifinfile
isa.b.c
, you geta.b.lep
ora.b.jpg
. If the filename starts with a.
with no other.
s,lepton
still regards that as an extension: from a JPEG called.abc
you get.lep
. Only.
in the filename--not directory names--triggers this, so from a Lepton filex/fo.o/abc
you getx/fo.o/abc.jpg
(which you want), notx/fo.jpg
(which would be bad).If the output filename obtained this way names an existing file,
_
s are added to the end, after the extension, until it doesn't, and the name with added underscores is used:abc.lep
,abc.lep_
,abc.lep__
, etc.,xyz.jpg
,xyz.jpg_
,xyz.jpg__
, etc.This works best when your files are named in a sensible way.
Automatically removing and adding extensions and adding underscores avoids a problem you'd otherwise have to manage yourself--preventing data loss when the output file already exists. But it also exposes what might be a deep design flaw in the script you showed. If your files are named sensibly, then all your JPEG files end in
.jpg
or.jpeg
(maybe capitalized), and no non-JPEG files are so named. But then you don't have to examine the files withfile
to find out which ones are JPEGs!Thus the premise of the script you showed is that files might not be named reasonably. It's always bad for a script to behave wrong or unexpectedly on filenames containing spaces,
*
, and other special characters. So its behavior of splitting on whitespace and expanding globs (the outer unquoted command substitution, intended just to split separate filenames, does this) is especially bad. See Byte Commander's excellent answer for details. This is probably the worst flaw in the script you showed.But it's also worth considering what happens to filenames whose last
.
doesn't conceptually begin a file extension. SupposePictures
has four files, all JPEGs:01. Milan wide-angle sunset
,01. Milan wide-angle sunset highres
,02. Kyle birthday party prep - blooper cakes
, and03. The subtle found art of unopened expired paint cans with peeling labels
. Thenfor f in ~/Pictures/0*; do lepton "$f"; done
creates01.lep
,01.lep_
,02.lep
, and03.lep
--probably not what you want.If you have JPEGs not named
.jpg
or maybe.jpeg
, the best general approach is to rename them that way and investigate any naming conflicts that arise while doing so. But that's beyond the scope of this answer.Those renaming problems happen with JPEGs not named like JPEGs, not non-JPEGs named like JPEGs. Yet even then, there may be a better solution. If the problem is
._
files from macOS and you don't want to delete them, just exclude files with a leading._
(or even a leading.
). Still, passing just one path tolepton
avoids data loss (due to its_
appending rules); if the main goal is to exclude non-JPEGs, the basic idea is sound even though the implementation needs fixing.So I'll use the one-path
lepton infile
syntax. But anyone who considers automatinglepton
like this on strangely named files should remember the generated.lep
files may be named in ways that don't reveal the input filenames.Two Filenames —
lepton infile outfile
does exactly what you expect.But just because you expect it doesn't make it the right thing to do.
As with the other ways to run
lepton
,lepton
determines whetherinfile
is a JPEG to be compressed or a Lepton file to be decompressed by examining its content. Ifinfile
is a JPEG,lepton
writes a Lepton file namedoutfile
; ifinfile
is a Lepton file,lepton
writes a JPEG namedoutfile
. With this two-path syntax,lepton
doesn't change your specified output filename in any way. It doesn't add or remove extensions or append_
s to resolve naming conflicts. Ifoutfile
already exists, it is overwritten.You may want that, but if not and you use this syntax then you have to solve the problem yourself by making your script adjust the output filenames. You may be able to do this in a way that serves you better than
lepton
's own scheme when run with just one path argument. But I won't try to guess your specific needs and preferences; I'll just use the one-path syntax.3. Passing Multiple Paths From
find
tofile
The script you showed tries to use
file $(find ./ )
to pass one path per argument tofile
by runningfind
in command substitution. This often won't work, because$(find ./ )
splits on whitespace, which filenames can contain. It is common for files--especially images!--and folders to have spaces in their names. The script you showed treats a path./abc/foo bar.jpg
as two paths,./abc/foo
andbar.jpg
. In the best case, neither exists; if they do, you unintentionally operate on the wrong thing. And the original path won't be processed at all.Although the breadth of this problem can be lessened by setting
IFS=$'\n'
so word splitting is only performed between lines (\n
represents a newline character), this isn't a good solution. Besides being awkward, it can still fail, as file and directory names may contain newlines. I advise against naming files or directories with them except to test programs or scripts for bugs. But such names can be created, including by accident where you don't expect them. The only characters a filename cannot contain are the path separator/
and the null character. The null character is thus the only one that can't appear in a path and the only safe choice to delimit lists of arbitrary paths. That's whyfind
has a-print0
action andxargs
has a-0
option.This can be done correctly with
find . -print0 | xargs -0 ...
but you don't need a third utility to pass paths fromfind
tofile
.find
's-exec
action is sufficient. Arguments after-exec
build the command to run, until\;
or+
.find ... -exec ... ;
runs a command once per file, whilefind ... -exec ... +
passes the command as many paths as it can per run, which is usually faster. Typically all the arguments fit and the command runs just once. In rare cases the command line would be too long andfind
runs the command more than once. So the+
form is only safe for running commands that (a) take their path arguments at the end and (b) work the same in one run with multiple filenames as they do in separate runs.lepton
is an example of a command that must not be run using the+
form of-exec
because it does not accept multiple source filenames. The first would be the input, the second would be the output, and others would be excessive. But many commands do do the same thing when run once with several arguments as when run several times with one argument, andfile
is one of them.This command will generate the table:
find
replaces the{}
argument with a path when it invokesfile
, and replaces+
with as many additional path arguments as will fit.The options
--mime-type -r0F ''
passed tofind
are explained below.Some people quote
{}
, e.g.,'{}'
. It's fine to do so, but neither Bash nor other Bourne-style shells require it. Bash and some other shells support brace expansion, but an empty pair of braces is not expanded. I choose not to quote{}
, in light of the misconception that quoting{}
preventsfind
from performing word splitting. Even if your shell required{}
to be quoted, this would still have nothing to do with word splitting, becausefind
never does that. (If you wanted word splitting, you'd have to tellfind
to-exec
a shell.) Andfind
can't tell if you've written{}
or'{}'
--the shell turns'{}'
into{}
(during quote removal) before passing it tofind
.4. Emitting a Usable ⟨Path, File Type⟩ Table with
file
The Problem
The reason I must pass some options to
file
--and can't just usefind . -exec file {} +
--is that the tablefile
generates by default is ambiguous:Those three rows look like four; one filename contains a newline. Filenames can also contain colons, so it won't always be clear where the filename ends. Way more confusing examples than shown above are possible.
The description column also has way more information than we need. Byte Commander explains one reason
grep
ing forJPEG
in each whole row returns wrong results: a non-JPEG file withJPEG
in its name gives a false positive. (The point of checking the type is that you can't rely on the name, so this is quite a self-defeating bug in the script you showed.) But even when you know you're looking in the description column, it may still containJPEG
even if that's not the type:Byte Commander's answer solved this by (a) passing the
-b
option tofile
, causing it to omit the paths,:
separator, and spaces in front of the type, then (b) usinggrep
to check if the description begins withJPEG
(the^
anchor in the pattern^JPEG image data,
does this). This works if you keep track of the paths passed tofile
--not a problem for Byte Commander's method, which ranfile
separately for each path anyway.The Solution
I must use a different solution, because my goal is to parse both paths and types from
file
's output so thatfile
needn't be run separately for each file. Fortunatelyfile
in Ubuntu has many options. I usefile --mime-type -r0F '' paths
:--mime-type
prints a MIME type rather than a detailed description. This is all I need, and then I can just perform an exact match against the whole thing. For a JPEG,file --mime-type
showsimage/jpeg
in the description column. (See also αғsнιη's answer.)man file
,-r
causes unprintable characters not to be replaced with octal escapes like\003
. I believe I would otherwise need to add a step to convert such sequences back to the actual characters, which probably can't be done reliably--what if such a sequence appears literally in a filename? (file
doesn't escape\
as\\
.) I say "I believe" as I haven't managed to getfile
to print out such an escape sequence, and I'm not sure it really does so in the filename column. Either way,-r
is safe here.-0
is the key option here. Without it, this method couldn't work reliably. It makesfile
print a null character--the one character that is never allowed in paths because it is usually used to mark the ends of strings in C programs--immediately after the filename. This marks the break, in each row, between the two columns of the table.-F ''
makesfile
print nothing (''
is an empty argument) instead of:
. The colon is unreliable (it can appear in filenames) and of no benefit here since a null character is already being printed to indicate the end of the path column and the start of the description column.To make
find
runfile --mime-type -r0F '' paths
I use-exec file --mime-type -r0F '' {} +
.find
's-exec
action replaces{} +
with the paths.5. Consuming the Table
I created the table this way:
As detailed above, this places a null character after each path. It would be handy if the description were also null-terminated, but
file
won't do that--the description always ends with a newline. So I must alternately read until a null character, then assume there is more text and read it until a newline. I must do this for each file and stop when nothing is left.Reading Each Row
That combination--read text that may contain a newline until a null character, then read text that can't contain a newline until a newline--isn't how any of the common Unix utilities are normally used. The approach I will take is to pipe the output of
find
to a loop. Each iteration of the loop reads a single row of the table by using theread
shell builtin twice, with different options.To read the path, I use:
-r
isread
's only standard option and you should almost always use it. Without it, backslash escapes like\n
from the input are translated into the characters they represent. We don't want that.read
reads until it sees a newline. To ignore newlines and stop at a null character instead, I use the-d
option, which Bash provides, to specify a different character. For a null character, pass the empty argument''
.-d
option), so I may as well avail myself of Bash's default behavior when no variable name is passed toread
. It puts everything it read--except the terminating character--in the special variable$REPLY
. Normallyread
strips whitespace ($IFS
characters) from the beginning and end of the input, and it's a common idiom to writeIFS= read ...
to prevent that. When reading implicitly to$REPLY
in Bash, this is not necessary.To read the description, I use:
-r
toread
unless you want\
escapes translated.mimetype
.IFS=
to prevent leading and trailing whitespace from being stripped is significant. I want it removed. This drops the spaces from the beginning of the description thatfind
writes to make the table more human-readable when it is shown in a terminal.Composing the Loop
The loop should continue as long as there is another path to be read. The
read
command returns true (in shell programming this is zero, unlike almost all other programming languages) when it successfully reads something, and false (in shell programming, any nonzero value) when it doesn't. So the commonwhile read
idiom is useful here. I pipe (|
) the output offind
--which is the output of one or (rarely) morefile
commands--to thewhile
loop.Inside the loop, I read the rest of the row to obtain the description (
read -r mimetype
). I don't bother checking if this succeeded.file
should only ever output complete rows even if it encounters errors. (file
sends error and warning messages to standard error, so they won't appear in the pipeline to corrupt the table.) You should be able to rely on this.If you want to check if
read -r mimetype
succeeded anyway, you can useif
. Or you can include it in thewhile
loop condition:You can see I also split the top line for readability. (No
\
is required to split at|
.)Testing the Loop
If you want to test the loop before proceeding, you can put this command under (or instead of) the
# Commands...
comment:The loop output looks something like this, depending on what you have in the directory (and I have left out most entries, for brevity):
This is just to see if the loop works right. Placing the table's entries in
[
]
like this wouldn't help the script do what it needs to do, as paths may contain[
,]
, and consecutive newlines.6. Using the Extracted Path and File Type
In each iteration of the loop,
"$REPLY"
contains the path and"$mimetype"
contains the type description. To find out if"$REPLY"
names a JPEG file, check if"$mimetype"
is exactlyimage/jpeg
.You can compare strings using
if
and[
/test
(or[[
) with=
. But I prefercase
:If you just wanted to show the JPEGs' paths in the same format as above--to help test with paths containing newlines--the entire
case
...esac
statement could be:But the goal is to run
lepton
on each JPEG file. To do that, use:7. Putting It All Together
Adding that
lepton
command, and a hashbang line to run it with Bash, here's the complete script:lepton
reports what it is doing but it doesn't show filenames. This alternative script prints a message with each path before runninglepton
on it:I've printed the messages to standard error (
>&2
), since that's wherelepton
sends its own messages. That way, the output all stays together when piped or redirected. Running that script produces output like this (but more of it if you have more than two JPEGs):The repetition in each stanza--which also appears when you run
lepton
without printing filenames--is becauselepton
checks that its output files can decompress correctly.The script you showed had
exit 0
at the end. You can do that if you like. It causes the script to always report success. Otherwise the script returns the exit status of the last command run--which is probably preferable. Either way, it may report success even iffind
,file
, orlepton
encountered problems, if the lastlepton
command succeeded. You can, of course, expand the script with more sophisticated error handling code.8. Maybe You Want The Paths, Too
If you want to generate a list of paths separate from
lepton
's own output, you can take advantage oflepton
's behavior of writing to standard error by printing the paths to standard output instead. In that case, you probably want to print just the paths and not a "Processing" message. You may optionally want to terminate the paths with null characters instead of newlines, as then you can process the list without breaking on paths that contain newlines.When you run that script, you can pass the
-0
flag to make it emit null characters instead of newlines. That script does not do proper Unix-style option processing: it only checks the first argument you pass; passing the flag repeatedly in the same argument (-00
) doesn't work; and no option-related error messages are ever generated. This limitation is for brevity, and because you probably don't need anything more sophisticated, as the script doesn't support any non-option arguments and-0
is the only possible option.On my system I called that script
jpeg-lep3
and put it in~/source
, then ran~/source/jpeg-lep3 -0 > out
, which printed justlepton
's output to my terminal. If you do something like that, you can test that null characters were properly written between paths using:Code first:
Let's do this with Bash's special globs and a
for
loop:Explanation:
First of all, we need to make the Bash globs more useful by enabling the
globstar
anddotglob
shell options. Here is their description fromman bash
in the SHELL BUILTIN COMMANDS section aboutshopt
:Then we use this new "recursive glob"
./**
in afor
loop to iterate over all files and folders inside the current directory and all its subdirectories. Please always use absolute paths or explicit relative paths starting with a./
or../
in your globs, not just**
, to prevent problems with special file names like~
.Now we test each file (and folder) name with the
file
command for its contents. The-b
option prevents it from printing the file name again before the content information string, which makes filtering more safe.Now we know that the content information of all valid JPG/JPEG files must start with
JPEG image data,
, which is what we test the output offile
for withgrep
. We use the-q
option to suppress any output, as we are only interested ingrep
's exit code, which indicates if the pattern matched or not.If it matched, the code inside the
if
/then
block will be executed. We can do anything we want in here. The current JPEG filename is available in the shell variable$f
. We just have to make sure to always put it in double quotes to prevent the accidental evaluation of filenames with special characters like spaces, newlines, or symbols. It is also usually best to separate it from other arguments by placing it after--
, which causes most commands to interpret it as a filename even if it's something like-v
or--help
that would otherwise be interpreted as an option.Bonus question:
Time to blow up some code, for science! Here is the version from your question/book:
First of all, allow me to mention how complex they wrote it. We have 4 levels of nested subshells, using mixed command substitution syntaxes (
``
and$()
), which are just necessary because of the incorrect/suboptimal usage offind
.Here
find
just lists all files and prints their names, one per line. Then the full output is passed tofile
to examine each of them. But wait! One file name per line? What about file names containing newlines? Right, those will break it!Actually even simple spaces break it too, because those are treated as separators as well by
file
. You can't even quote the"$(find ./ )"
here as a remedy, because that would then quote the whole multi-line output as one single filename argument.Next step, the
file
output gets scanned withgrep JPEG
. Don't you think it's a bit easy to trick such a simple pattern, especially as the output of plainfile
always contains the file name as well? Basically everything with "JPEG" in its file name will trigger a match, no matter what it contains.Okay, so we have the
file
output of all JPEG files (or those who pretend to be one), now they process all lines withcut
to extract the original file name from the first column, separated by a colon... Guess what, let's try this on a file with a colon in its name:So to conclude, the approach from your book works, but only if all files it checks do not contain any spaces, newlines, colons and probably other special characters and do not contain the string "JPEG" anywhere in their filenames. It is also kind of ugly, but as beauty lies in the eye of the beholder, I'm not going to ramble about that.
You have
find
and check withfile
command for its mime-type as well.Or to make it complete as like follow:
Or the
identify
option from the ImageMagic packages.