Ping a Specific Port

Question

Ram Rachum

Asked: 2009-10-10 18:05:41 +0800 CST2009-10-10 18:05:41 +0800 CST 2009-10-10 18:05:41 +0800 CST

Making `wget` not save the page

772

I'm using the wget program, but I want it not to save the html file I'm downloading. I want it to be discarded after it is received. How do I do that?

9 Answers

Voted

Evan Anderson · Answer 1 · 2009-10-10T19:01:30+08:00

Best Answer

Evan Anderson

2009-10-10T19:01:30+08:002009-10-10T19:01:30+08:00

You can redirect the output of wget to /dev/null (or NUL on Windows):

wget http://www.example.com -O /dev/null

The file won't be written to disk, but it will be downloaded.

101

drAlberT · Answer 2 · 2009-10-10T22:38:07+08:00

drAlberT

2009-10-10T22:38:07+08:002009-10-10T22:38:07+08:00

If you don't want to save the file, and you have accepted the solution of downloading the page in /dev/null, I suppose you are using wget not to get and parse the page contents.

If your real need is to trigger some remote action, check that the page exists and so on I think it would be better to avoid downloading the html body page at all.

Play with wget options in order to retrieve only what you really need, i.e. http headers, request status, etc.

assuming you need to check the page is ok (ie, the status returned is 200) you can do the following:
```
wget --no-cache --spider http://your.server.tld/your/page.html
```
if you want to parse server returned headers do the following:
```
wget --no-cache -S http://your.server.tld/your/page.html
```

See the wget man page for further options to play with.
See lynx too, as an alternative to wget.

37

SCL · Answer 3 · 2011-04-01T10:24:05+08:00

SCL

2011-04-01T10:24:05+08:002011-04-01T10:24:05+08:00

In case you also want to print in the console the result you can do:

wget -qO- http://www.example.com

24

al. · Answer 4 · 2009-10-10T18:16:56+08:00

al.

2009-10-10T18:16:56+08:002009-10-10T18:16:56+08:00

$ wget http://www.somewebsite.com -O foo.html --delete-after

19

natacado · Answer 5 · 2009-10-10T19:49:42+08:00

natacado

2009-10-10T19:49:42+08:002009-10-10T19:49:42+08:00

Another alternative is to use a tool like curl, which by default outputs the remote content to stdout instead of saving it to a file.

11

Paul Tomblin · Answer 6 · 2009-10-10T18:46:46+08:00

Paul Tomblin

2009-10-10T18:46:46+08:002009-10-10T18:46:46+08:00

Check out the "-spider" option. I use it to make sure my web sites are up and send me an email if they're not. This is a typical entry from my crontab:

46 */2 * * * if ! wget -q --spider http://www.rochesterflyingclub.com/ >/dev/null 2>&1; then echo "Rochester Flying Club site is down" ; fi

4

JamesThomasMoon · Answer 7 · 2012-04-06T13:05:12+08:00

If you need to crawl a website using wget and want to minimize disk churn...

For a *NIX box and using wget, I suggest skipping writing to a file . I noticed on my Ubuntu 10.04 box that wget -O /dev/null caused wget to abort downloads after the first download.
I also noticed that wget -O real-file causes wget to forget the actual links on the page. It insists on an index.html to be present on each page. Such pages may not always be present and wget will not remember links it has seen previously.

For crawling without writing to disk, the best I came up with is the following

 mkdir /dev/shm/1   
 cd /dev/shm/1
 wget --recursive --relative --no-parent ...

Notice there is no -O file option. wget will write to the $PWD directory. In this case that is a RAM-only tmpfs file system. Writing here should bypass disk churn (depending upon swap space) AND keep track of all links. This should crawl the entire website successfully.

Afterward, of course,

 rm --recursive --force /dev/shm/1/*

John Gardeniers · Answer 8 · 2009-10-10T19:16:48+08:00

John Gardeniers

2009-10-10T19:16:48+08:002009-10-10T19:16:48+08:00

Use the --delete-after option, which deletes the file after it is downloaded.

Edit: Oops, I just noticed that has already been answered.

2

rocky qi · Answer 9 · 2019-05-09T20:33:17+08:00

rocky qi

2019-05-09T20:33:17+08:002019-05-09T20:33:17+08:00

According to the help doc(wget -h), you can use --spider option to skip download(version 1.14).

Download:
  -S,  --server-response         print server response.
       --spider                  don't download anything.

1

Making `wget` not save the page

If you need to crawl a website using wget and want to minimize disk churn...

For crawling without writing to disk, the best I came up with is the following

Ping a Specific Port

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What's the command-line utility in Windows to do a reverse DNS look-up?

How to check if a port is blocked on a Windows machine?

What port should I open to allow remote desktop?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?