I'm trying to parse contents of an HTML file to scrape a download directory, however I've modified it to a MWE that reproduces my issue:
sed -e 's|\(href\)|\1|' index.html
Prints the entirety of index.html. I was originally thinking that it was an issue with my expression, but this very basic expression proves that wrong.
The same happens if I remove -e
or if I add g
at the end.
It's been a while since I've done sed, am I doing something wrong here? Is sed getting confused with the characters in an html file?
you should use
grep
to find text in a filesed
is better for text substitutionsIf you want to list the hypertext links, you can simply grep the file like this :
That you've explaned sounds as the normal behaviour of
sed
used with the commands
ubstitution. I suppose you are looking for something like this:Where:
/
is used as delimiter in this case (you can use|
or#
, etc.).The option
-n
(--quiet
,--silent
) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tell sed what to print.This additional command is the print command p, added to the end of the script. If sed wasn't started with an -n option, the p command will duplicate the input.
The option
-r
enables the extended regular expressions. Without this option our command can be:The command
s
means substitute:#<string-or-regexp>#<replacement>#
.^
will match to the beginning of the line.$
will match to the end of the line.within the the , the capture group
(http.*)
, will be treated as the variable\1
.Example of usage:
More examples:
This may be overly cumbersome, but I think it would work for you, as long as your href contents contains no spaces.
The first
grep
singles out only lines that contain the href. Thetr
converts spaces to newlines. The secondgrep
grabs just the href section you were interested in. Finally, thecut
grabs everything after the "href=
".