How can I use docker without sudo?

Question

kenn

Asked: 2016-01-11 03:31:48 +0800 CST2016-01-11 03:31:48 +0800 CST 2016-01-11 03:31:48 +0800 CST

wget web crawler retrieves unwanted index.html index files

772

I made a ~/.bashrc function to save some web directories into my local disk. It works well except some unwanted index files that is not present in the website. I use it like

crwl http://ioccc.org/2013/cable3/

but it also retrieves some files such as index.html?C=D;O=A index.html?C=D;O=D index.html?C=M;O=A index.html?C=M;O=D index.html?C=N;O=A index.html?C=N;O=D index.html?C=S;O=A index.html?C=S;O=D

Complete file list:

 kenn@kenn:~/experiment/crwl/ioccc.org/2013/cable3$ ls
 bios        index.html?C=D;O=A  index.html?C=S;O=A           screenshot_flightsim4.png
 cable3.c    index.html?C=D;O=D  index.html?C=S;O=D           screenshot_lotus123.png
 fd.img      index.html?C=M;O=A  Makefile                     screenshot_qbasic.png
 hint.html   index.html?C=M;O=D  runme                        screenshot_simcity.png
 hint.text   index.html?C=N;O=A  sc-ioccc.terminal            screenshot_win3_on_macosx.png
 index.html  index.html?C=N;O=D  screenshot_autocad.png

I want to exclude those files while cloning that directory with wget Is there any wget switch or trick to clone a web directory as it is?

My script function in .bashrc:

crwl() {
wget --tries=inf --timestamping --recursive --level=inf --convert-links --page-requisites --no-parent "$@"

}

EDIT: I found two possible workarounds

1) Adding -R index.html?* flag

2) Adding -R =A,=D flag which rejects index.html?C=D;O=A files except index.html

I don't know which one is proper but both of them seem unsafe.

2 Answers

Voted

clarity123 · Answer 1 · 2016-01-13T11:24:01+08:00

To exclude index-sort files such as those with URL index.html?C=... without excluding any other kind of index.html* files, there is indeed a more precise specification possible. Try: -R '\?C='

Quick Demo

Set up an different empty directory, for example

$ mkdir ~/experiment2
$ cd ~/experiment2

Then a shorter version of your command, without the recursion and levels in order to do a quick one page test:

$ wget --tries=inf --timestamping --convert-links --page-requisites --no-parent -R '\?C=' http://ioccc.org/2013/cable3/

After wget is done, ~/experiment2, will have no index.html?C=... files:

.
└── ioccc.org
    ├── 2013
    │   └── cable3
    │       └── index.html
    ├── icons
    │   ├── back.gif
    │   ├── blank.gif
    │   ├── image2.gif
    │   ├── text.gif
    │   └── unknown.gif
    └── robots.txt

4 directories, 7 files

So it has indeed excluded those redundant index-sort index.html?C=... directories while keeping all other index.html directories, in this case just index.html

Implement

So just implement the -R '\?C=' , by updating your shell function in ~/.bashrc:

crwl() {
  wget --tries=inf --timestamping --recursive --level=inf --convert-links --page-requisites --no-parent -R '\?C=' "$@"
}

Then remember to either test in a new terminal, or re-source bash to make it effective:

$ . ~/.bashrc

Then try it in a new directory, for comparison:

$ mkdir ~/experiment3
$ cd ~/experiment3
$ crwl http://ioccc.org/2013/cable3/

Warranty

wget 1.14 and up only. So if your wget -V says it is 1.13 this may not work and you have need to actually delete those pesky index.html?C=... yourself, or try to get a more recent version of wget.
works by specifying you want to -R or reject a pattern, in this case pages with ?C= pattern that is typical of the index.html?C=... versions of index.html.
however ? happens to be a wget wildcard, thus to match a literal ? you need to escape it as \?
don't interrupt wget. Because it seems the way wget works with browse-able web pages is to actually download first, delete later, as if it needs to check in case those pages have further links to crawl. So if you cancel this halfway you are still going to end up with index.html?C= files. Only if you let wget finish, then wget will follow your -R specification and delete any temporarily downloaded index.html?C=... files for you

Michael Grieswald · Answer 2 · 2017-08-17T00:21:24+08:00

Michael Grieswald

2017-08-17T00:21:24+08:002017-08-17T00:21:24+08:00

Try this after download, if you do not want to use wget's removal mechanism or are on a system not suporting this option.

FIND=$($WHICH find)
PWD2=$($WHICH pwd)
SH=$($WHICH sh)
ECHO=$($WHICH echo)
LESS=$($WHICH less)

Command:

$FIND "$($PWD2)" -regextype posix-egrep -type f -regex '^(.*?html\?C=[DNSM];O=[AD])$' -exec "$SH" -c 'o="{}";$ECHO -f -v "${o}"' \; | $LESS

When you are satisfied with the output, do the following:

Issue the following command (see below box)
Replace $ECHO with $RM in the above command.
Remove the pipe (|) and the $LESS, to get actual output.

(I'm not responsible for when you delete your whole file system, therefore this way.)

RM=$($WHICH rm);export RM
$FIND "$($PWD2)" -regextype ... ;$RM -f -v "${xox}"' \;

Hope this helps.

2

wget web crawler retrieves unwanted index.html index files

Quick Demo

Implement

Warranty

How to install Google Chrome

Is there a command to list all users? Also to add, delete, modify users, in the terminal?

How to delete a non-empty directory in Terminal?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?