I have a big html file on my desktop that looks like
src="http://images.alaablubnan.com/images/Balls/20.jpg"
alt="http://images.alaablubnan.com/images/Balls/20.jpg"/></a></td><td><a
href="http://images.alaablubnan.com/images/Balls/32.jpg"
target="_blank"><img
src="http://images.alaablubnan.com/images/Balls/32.jpg"
alt="http://images.alaablubnan.com/images/Balls/32.jpg"/></a></td><td><a
href="http://images.alaablubnan.com/images/Balls/30.jpg"
target="_blank"><img
src="http://images.alaablubnan.com/images/Balls/30.jpg"
alt="http://images.alaablubnan.com/images/Balls/30.jpg"/></a></td></tr><tr><td><table><tr><td>webpage/url</td><td>http://www.playlebanon.com/webservices/website/lotto/PopUps/HistoryDetail.aspx?t=1405536730503&FromDraw=1&ToDraw=1213&Draw=0</td></tr></table></td><td>2</td><td>complete
lotto results</td><td>complete lotto results</td><td>2</td><td><a
href="http://www.playlebanon.com/webservices/website/lotto/PopUps/HistoryDetail.
If possible, I want to:
- get all the .jpg files, remove all the html code (it's 1.jpg, 2.jpg... to 42.jpg)
- I want to remove the .jpg extenstion
- I want each row of numbers to have 7 numbers only, then insert new line
This is not actually a particularly good job for
sed
but here goes:The above will get you a list of numbers, one per line:
Now, it is actually possible to get all these on the same line with 7 numbers per line using
sed
but it is really not worth the effort. Just use standard *nix tools instead:Or, if you want to remove duplicates:
Explanation
The
sed
command uses a few tricks:-n
: don't print any lines by default.-r
: enable extended regular expressions, this lets us use( )
to capture groups without needing to escape the parentheses and+
for "one or more".s#from#to#
: while the standard substitution operator insed
and other, similar tools, iss/from/to/
, you can a non standard delimiter so that you can include/
in the pattern. In this case I am using#
but you could use something else likes|from|to|
as well.s#.*/([^"]+).jpg.*#\1#p
: this will match everything from the beginning of the line until a/
and then captures the longest stretch of non-"
characters until.jpg
. This is the filename minus extension. The filename is captured in the parenthesis and the whole line (because of the.*
on either side) will be replaced with the captured patter (\1
). Thep
at the end means that it will print the lines where the substitution was successful.Personally though, I would have done all of this with
perl
in the first place:Or, for a larger file:
Or
grep
even:Or, stealing @Olli's clever
xargs
idea:I assume you're trying to scrape some sort of result. In this example there are only three balls and we can extract them by searching for
Balls/<one-or-many-digits>
and grouping (the\(..\)
construct) around the number and then replacing the whole lot with that group (the\1
is a reference to the first group).sed
is going through this line by line. I'm asking it to match and replace everything on the line (which is why we cap each end with.*
— "any amount of anything") with whatever it matches in the group. The-n
and/p
are used together to not print unless the line was a match and the/g
means it'll keep matching until it hits the end of the file.It's a fairly complicated example if you're new to regular expressions.
I'm passing it through
uniq
because there's a lot of duplication going on there.And I'm using
| xargs -n7 echo
on the end to group 7 arguments together and pass them all ontoecho
. There aren't 7 balls here so it's only showing 3.It probably slows it down but you can have a slightly more readable expression if you use the
-r
extended syntax forsed
:Does the same thing, just without some of the confusing looking escaping.
Probably ever-so-slightly slower.