I made a ~/.bashrc
function to save some web directories into my local disk. It works well except some unwanted index files that is not present in the website.
I use it like
crwl http://ioccc.org/2013/cable3/
but it also retrieves some files such as index.html?C=D;O=A index.html?C=D;O=D index.html?C=M;O=A index.html?C=M;O=D index.html?C=N;O=A index.html?C=N;O=D index.html?C=S;O=A index.html?C=S;O=D
Complete file list:
kenn@kenn:~/experiment/crwl/ioccc.org/2013/cable3$ ls
bios index.html?C=D;O=A index.html?C=S;O=A screenshot_flightsim4.png
cable3.c index.html?C=D;O=D index.html?C=S;O=D screenshot_lotus123.png
fd.img index.html?C=M;O=A Makefile screenshot_qbasic.png
hint.html index.html?C=M;O=D runme screenshot_simcity.png
hint.text index.html?C=N;O=A sc-ioccc.terminal screenshot_win3_on_macosx.png
index.html index.html?C=N;O=D screenshot_autocad.png
I want to exclude those files while cloning that directory with wget
Is there any wget
switch or trick to clone a web directory as it is?
My script function in .bashrc
:
crwl() {
wget --tries=inf --timestamping --recursive --level=inf --convert-links --page-requisites --no-parent "$@"
}
EDIT: I found two possible workarounds
1) Adding -R index.html?*
flag
2) Adding -R =A,=D
flag
which rejects index.html?C=D;O=A
files except index.html
I don't know which one is proper but both of them seem unsafe.
To exclude index-sort files such as those with URL
index.html?C=...
without excluding any other kind ofindex.html*
files, there is indeed a more precise specification possible. Try:-R '\?C='
Quick Demo
Set up an different empty directory, for example
Then a shorter version of your command, without the recursion and levels in order to do a quick one page test:
After wget is done,
~/experiment2
, will have noindex.html?C=...
files:So it has indeed excluded those redundant index-sort
index.html?C=...
directories while keeping all other index.html directories, in this case justindex.html
Implement
So just implement the
-R '\?C='
, by updating your shell function in~/.bashrc
:Then remember to either test in a new terminal, or re-source bash to make it effective:
Then try it in a new directory, for comparison:
Warranty
wget -V
says it is 1.13 this may not work and you have need to actually delete those peskyindex.html?C=...
yourself, or try to get a more recent version of wget.-R
or reject a pattern, in this case pages with?C=
pattern that is typical of theindex.html?C=...
versions ofindex.html
.?
happens to be a wget wildcard, thus to match a literal?
you need to escape it as\?
index.html?C=
files. Only if you let wget finish, then wget will follow your-R
specification and delete any temporarily downloadedindex.html?C=...
files for youTry this after download, if you do not want to use wget's removal mechanism or are on a system not suporting this option.
Command:
When you are satisfied with the output, do the following:
(I'm not responsible for when you delete your whole file system, therefore this way.)
Hope this helps.