I'm trying to download two sites for inclusion on a CD:
http://boinc.berkeley.edu/trac/wiki
http://www.boinc-wiki.info
The problem I'm having is that these are both wikis. So when downloading with e.g.:
wget -r -k -np -nv -R jpg,jpeg,gif,png,tif http://www.boinc-wiki.info/
I do get a lot of files because it also follows links like ...?action=edit ...?action=diff&version=...
Does somebody know a way to get around this?
I just want the current pages, without images, and without diffs etc.
P.S.:
wget -r -k -np -nv -l 1 -R jpg,jpeg,png,gif,tif,pdf,ppt http://boinc.berkeley.edu/trac/wiki/TitleIndex
This worked for berkeley but boinc-wiki.info is still giving me trouble :/
P.P.S:
I got what appears to be the most relevant pages with:
wget -r -k -nv -l 2 -R jpg,jpeg,png,gif,tif,pdf,ppt http://www.boinc-wiki.info
(
--reject-type posix
by default). Works only for recent (>=1.14) versions ofwget
though, according to other comments.Beware that it seems you can use
--reject-regex
only once perwget
call. That is, you have to use|
in a single regex if you want to select on several regex :The documentation for wget says:
It looks like this functionality has been on the table for awhile and nothing has been done with it.
I haven't used it, but httrack looks like it has a more robust filtering feature set than wget and may be a better fit for what you're looking for (read about filters here http://www.httrack.com/html/fcguide.html).
The new version of wget (v.1.14) solves all these problems.
You have to use the new option
--reject-regex=....
to handle query strings.Note that I couldn't find the new manual that includes these new options, so you have to use the help command
wget --help > help.txt
Pavuk should be able to do it:
http://pavuk.sourceforge.net/man.html#sect39
Mediawiki example:
It looks like you are trying to avoid download special pages of MediaWiki. I solved this problem once avoiding the
index.php
page:However, the wiki used the URLS as seen in Wikipedia (
http://<wiki>/en/Theme
) and not the pattern I have seen in other places (http://<wiki>/index.php?title=Theme
). Since the link you gave uses URLs in the Wikipedia pattern, I think this solution can work for you too, though.‘-R rejlist --reject rejlist’ Specify comma-separated lists of file name suffixes or patterns to accept or reject (see Types of Files). Note that if any of the wildcard characters, ‘*’, ‘?’, ‘[’ or ‘]’, appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.
Patterns are probably what you want. I am not sure how sophisticated the patterns are but you can either try to accept only certain files or block:
Accept:
Edit: nvm in light of the other post.