Ping a Specific Port

Question

Tie-fighter

Asked: 2010-06-30 13:08:04 +0800 CST2010-06-30 13:08:04 +0800 CST 2010-06-30 13:08:04 +0800 CST

How to download with wget without following links with parameters

772

I'm trying to download two sites for inclusion on a CD:

http://boinc.berkeley.edu/trac/wiki
http://www.boinc-wiki.info

The problem I'm having is that these are both wikis. So when downloading with e.g.:

wget -r -k -np -nv -R jpg,jpeg,gif,png,tif http://www.boinc-wiki.info/

I do get a lot of files because it also follows links like ...?action=edit ...?action=diff&version=...

Does somebody know a way to get around this?

I just want the current pages, without images, and without diffs etc.

P.S.:

wget -r -k -np -nv -l 1 -R jpg,jpeg,png,gif,tif,pdf,ppt http://boinc.berkeley.edu/trac/wiki/TitleIndex

This worked for berkeley but boinc-wiki.info is still giving me trouble :/

P.P.S:

I got what appears to be the most relevant pages with:

wget -r -k -nv  -l 2 -R jpg,jpeg,png,gif,tif,pdf,ppt http://www.boinc-wiki.info

6 Answers

Voted

Skippy le Grand Gourou · Answer 1 · 2014-01-26T11:08:07+08:00

Skippy le Grand Gourou

2014-01-26T11:08:07+08:002014-01-26T11:08:07+08:00

wget --reject-regex '(.*)\?(.*)' http://example.com

(--reject-type posix by default). Works only for recent (>=1.14) versions of wget though, according to other comments.

Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :

wget --reject-regex 'expr1|expr2|…' http://example.com

7

Evan Anderson · Answer 2 · 2010-06-30T13:44:47+08:00

Evan Anderson

2010-06-30T13:44:47+08:002010-06-30T13:44:47+08:00

The documentation for wget says:

Note, too, that query strings (strings at the end of a URL beginning with a question mark (‘?’) are not included as part of the filename for accept/reject rules, even though these will actually contribute to the name chosen for the local file. It is expected that a future version of Wget will provide an option to allow matching against query strings.

It looks like this functionality has been on the table for awhile and nothing has been done with it.

I haven't used it, but httrack looks like it has a more robust filtering feature set than wget and may be a better fit for what you're looking for (read about filters here http://www.httrack.com/html/fcguide.html).

4

user3133076 · Answer 3 · 2014-01-06T11:39:50+08:00

user3133076

2014-01-06T11:39:50+08:002014-01-06T11:39:50+08:00

The new version of wget (v.1.14) solves all these problems.

You have to use the new option --reject-regex=.... to handle query strings.

Note that I couldn't find the new manual that includes these new options, so you have to use the help command wget --help > help.txt

3

Tie-fighter · Answer 4 · 2011-05-10T08:37:55+08:00

Tie-fighter

2011-05-10T08:37:55+08:002011-05-10T08:37:55+08:00

Pavuk should be able to do it:

http://pavuk.sourceforge.net/man.html#sect39

Mediawiki example:

[...]

-skip_url_pattern ’oldid=, action=edit, action=history, diff=, limit=, [/=]User:, [/=]User_talk:, [^p]/Special:, =Special:[^R], .php/Special:[^LUA][^onl][^nul], MediaWiki:, Search:, Help:’

[...]

1

brandizzi · Answer 5 · 2011-05-31T07:00:22+08:00

brandizzi

2011-05-31T07:00:22+08:002011-05-31T07:00:22+08:00

It looks like you are trying to avoid download special pages of MediaWiki. I solved this problem once avoiding the index.php page:

wget  -R '*index.php*'  -r ... <wiki link>

However, the wiki used the URLS as seen in Wikipedia (http://<wiki>/en/Theme) and not the pattern I have seen in other places (http://<wiki>/index.php?title=Theme). Since the link you gave uses URLs in the Wikipedia pattern, I think this solution can work for you too, though.

1

Joshua Enfield · Answer 6 · 2010-06-30T13:43:33+08:00

Joshua Enfield

2010-06-30T13:43:33+08:002010-06-30T13:43:33+08:00

‘-R rejlist --reject rejlist’ Specify comma-separated lists of file name suffixes or patterns to accept or reject (see Types of Files). Note that if any of the wildcard characters, ‘*’, ‘?’, ‘[’ or ‘]’, appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.

Patterns are probably what you want. I am not sure how sophisticated the patterns are but you can either try to accept only certain files or block:

wget -r -k -np -nv -R jpg,jpeg,gif,png,tif,*\? http://www.boinc-wiki.info/

Accept:

wget -r -k -np -nv -R jpg,jpeg,gif,png,tif -A [a-zA-Z.] http://www.boinc-wiki.info/

Edit: nvm in light of the other post.

0

How to download with wget without following links with parameters

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?