I want to find my articles within the deprecated (obsolete) literature forum e-bane.net. Some of the forum modules are disabled, and I can't get a list of articles by their author. Also the site is not indexed by the search engines as Google, Yndex, etc.
The only way to find all of my articles is to open the archive page of the site (fig.1). Then I must select certain year and month - e.g. January 2013 (fig.1). And then I must inspect each article (fig.2) whether in the beginning is written my nickname - pa4080 (fig.3). But there are few thousand articles.
I've read few topics as follow, but none of the solutions fits to my needs:
I will post my own solution. But for me is interesting: Is there any more elegant way to solve this task?
To solve this task I've created the next simple bash script that mainly uses the CLI tool
wget
.The script has three functions:
The first function
get_url_map()
useswget
as--spider
(which means that it will just check that pages are there) and will create recursive-r
URL$MAP_FILE
of the$TARGET_URL
with depth level-l2
. (Another example could be found here: Convert Website to PDF). In the current case the$MAP_FILE
contains about 20 000 URLs.The second function
filter_url_map()
will simplify the content of the$MAP_FILE
. In this case we need only the lines (URLs) that contain the stringarticle&sid
and they are about 3000. More Ideas could be found here: How to remove particular words from lines of a text file?The third function
get_key_urls()
will usewget -qO-
(as the commandcurl
- examples) to output the content of each URL from the$MAP_FILE
and will try to find any of the$KEY_WORDS
within it. If any of the$KEY_WORDS
is founded within the content of any particular URL, that URL will be saved in the$OUT_FILE
.During the working process the output of the script looks as it is shown on the next image. It takes about 63 minutes to finish if there are two keywords and 42 minute when only one keyword is searched.
script.py
:requirement.txt
:Here is python3 version of the script (tested on python3.5 on Ubuntu 17.10).
How to use:
script.py
and package file isrequirement.txt
.pip install -r requirement.txt
.python3 script.py pa4080
It uses several libraryes:
Things to know to develop the program further (other than the doc of required package):
How it works:
Some idea so it can be developed further
This is not the most elegant answer, but I think it is better than using bash answer.
I recreated my script based on this answer provided by @karel. Now the script uses
lynx
instead ofwget
. In result it becomes significantly faster.The current version does the same job for 15 minutes when there are two searched keywords and only 8 minutes if we searching for only one keyword. That is faster than the Python solution provided by @dan.
In addition
lynx
provides better handling of non latin characters.