How can I use docker without sudo?

Question

Brydon Gibson

Asked: 2019-03-22 12:25:38 +0800 CST2019-03-22 12:25:38 +0800 CST 2019-03-22 12:25:38 +0800 CST

Sed is dumping the entire file

772

I'm trying to parse contents of an HTML file to scrape a download directory, however I've modified it to a MWE that reproduces my issue:

sed -e 's|\(href\)|\1|' index.html

Prints the entirety of index.html. I was originally thinking that it was an issue with my expression, but this very basic expression proves that wrong.

The same happens if I remove -e or if I add g at the end.

It's been a while since I've done sed, am I doing something wrong here? Is sed getting confused with the characters in an html file?

3 Answers

Voted

cmak.fr · Answer 1 · 2019-03-22T13:07:35+08:00

Best Answer

cmak.fr

2019-03-22T13:07:35+08:002019-03-22T13:07:35+08:00

you should use grep to find text in a file
sed is better for text substitutions

If you want to list the hypertext links, you can simply grep the file like this :

grep -Po '(?<=href=")[^"]*' index.html

2

pa4080 · Answer 2 · 2019-03-22T13:07:37+08:00

That you've explaned sounds as the normal behaviour of sed used with the command substitution. I suppose you are looking for something like this:

sed -nr 's/^.*href="(http.*)".*$/\1/p' index.html

Where:

/ is used as delimiter in this case (you can use | or #, etc.).
The option -n (--quiet, --silent) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tell sed what to print.
This additional command is the print command p, added to the end of the script. If sed wasn't started with an -n option, the p command will duplicate the input.
The option -r enables the extended regular expressions. Without this option our command can be:
```
sed -n 's/^.*href="$http.*$".*$/\1/p' index.html
```
The command s means substitute: #<string-or-regexp>#<replacement>#.
^ will match to the beginning of the line. $ will match to the end of the line.
within the the , the capture group (http.*), will be treated as the variable \1.

Example of usage:

$ cat index.html 
<!DOCTYPE html>
<html><head><title>Page Title</title></head><body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <a href="https://www.w3schools.com">Visit W3Schools</a>
</body></html>

$ sed -nr 's/^.*href="(http.*)".*$/\1/p' index.html 
https://www.w3schools.com

More examples:

S. Nixon · Answer 3 · 2019-03-22T12:42:47+08:00

S. Nixon

2019-03-22T12:42:47+08:002019-03-22T12:42:47+08:00

This may be overly cumbersome, but I think it would work for you, as long as your href contents contains no spaces.

grep "href" index.html |tr ' ' '\n'|grep "^href" |cut -f2 -d'='

The first grep singles out only lines that contain the href. The tr converts spaces to newlines. The second grep grabs just the href section you were interested in. Finally, the cut grabs everything after the "href=".

1

Sed is dumping the entire file

How to install Google Chrome

Is there a command to list all users? Also to add, delete, modify users, in the terminal?

How to delete a non-empty directory in Terminal?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?