How can I use docker without sudo?

Question

S.R.

Asked: 2016-07-31 11:26:09 +0800 CST2016-07-31 11:26:09 +0800 CST 2016-07-31 11:26:09 +0800 CST

Getting text and links from a web page

772

I would like to have a script that downloads a web page with curl, pipes it to w3m, which is stripping it from all content except text and links.

Is it possible to specify for the -T option of w3m, more than just one content-type and how?

To clarify my question a bit more, here's an example:

curl --user-agent "Mozilla/4.0" https://askubuntu.com/questions -s | w3m -dump -T text/html

which returns only text from Ask Ubuntu's questions page but with no links. If w3m cannot do it is there any other tool which is capable of scraping text and links simultaneously?

2 Answers

Voted

S.R. · Answer 1 · 2016-08-01T18:33:41+08:00

S.R.

2016-08-01T18:33:41+08:002016-08-01T18:33:41+08:00

Well, after extensive research on my own, I guess, there is no such a tool...

However, for what it's worth, I did discover hxnormalize which made writting a particular script I needed, a relatively simple matter.

1

jpa · Answer 2 · 2019-12-13T11:57:38+08:00

jpa

2019-12-13T11:57:38+08:002019-12-13T11:57:38+08:00

You can use lynx -dump. It will include a number like [16] before each link, and then a list of URLs at the end of the document.

For pipe usage, you can use lynx -dump -force_html -stdin. However, that will not handle relative links correctly because it doesn't know the original URL.

So the best way is to do lynx -dump http://.../ without separate curl.

1

Getting text and links from a web page

How to install Google Chrome

Is there a command to list all users? Also to add, delete, modify users, in the terminal?

How to delete a non-empty directory in Terminal?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?