I would like to have a script that downloads a web page with curl, pipes it to w3m, which is stripping it from all content except text and links.
Is it possible to specify for the -T option of w3m, more than just one content-type and how?
To clarify my question a bit more, here's an example:
curl --user-agent "Mozilla/4.0" https://askubuntu.com/questions -s | w3m -dump -T text/html
which returns only text from Ask Ubuntu's questions page but with no links. If w3m cannot do it is there any other tool which is capable of scraping text and links simultaneously?
Well, after extensive research on my own, I guess, there is no such a tool...
However, for what it's worth, I did discover hxnormalize which made writting a particular script I needed, a relatively simple matter.
You can use
lynx -dump
. It will include a number like[16]
before each link, and then a list of URLs at the end of the document.For pipe usage, you can use
lynx -dump -force_html -stdin
. However, that will not handle relative links correctly because it doesn't know the original URL.So the best way is to do
lynx -dump http://.../
without separatecurl
.