I need a bash command to delete the entire file if the file itself begins with <html>
.
I'm not sure the best way to go about this...
Context: I download a series of files via curl requests. Most time the downloads and processing work fine. But other times the download request results in a 404 for whatever reason. When I get those, the contents of the downloaded file begins with a html tag. When the rest of my processing hits this file, it hangs. So I want to run a command prior to my other processing to cat each of the files and delete the file if it has this html tag.
To address the question that prompted you to ask this one, rather than the one you actually asked:
curl can tell you the status code in addition to downloading the file. You do not need to check the file's contents for that. An example of how to check the status is
The various options you can use with
-w
are documented in the manual, and depending on your needs, you may want to extend this to output more information and parse it, and/or change the check of the status code to allow more than merely 200.You could use this find command to delete all files only containing only the
<html>
pattern in the first line:I just tested this, it works.
Run
shopt
first because we don't want to parsels
:then use a simple bash
for
loop to find files that begin with<html>
and remove them:It would be safer to use:
to have
rm
ask before removing any files, just in case.Note that
shopt
isn't strictly needed but it prevents certain issues from occurring if the directory is empty or there happens to be a file with an asterisk in its name.Not every automating task should be done with shell. Here is a Python script instead
Maybe it is more verbose than the equivalent bash commands, but it is