I want to write a python program to download the contents of a web page, and then download the contents of the web pages that the first page links to.
For example, this is main web page http://www.adobe.com/support/security/, and the pages I want to download: http://www.adobe.com/support/security/bulletins/apsb13-23.html and http://www.adobe.com/support/security/bulletins/apsb13-22.html
There is a certain condition I want to meet: it should download only web pages under bulletins not under advisories(http://www.adobe.com/support/security/advisories/apsa13-02.html)
#!/usr/bin/env python
import urllib
import re
import sys
page = urllib.urlopen("http://www.adobe.com/support/security/")
page = page.read()
fileHandle = open('content', 'w')
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
sys.stdout = fileHandle
print ('%s' % (link[0]))
sys.stdout = sys.__stdout__
fileHandle.close()
os.system("grep -i '\/support\/security\/bulletins\/' content >> content1")
I've already extracted the link of bulletins into a content1, but don't know how to download the content of those web pages, by providing content1 as input.
The content1 file is as shown below:- /support/security/bulletins/apsb13-23.html /support/security/bulletins/apsb13-23.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-21.html /support/security/bulletins/apsb13-21.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-22.html /support/security/bulletins/apsb13-15.html /support/security/bulletins/apsb13-15.html /support/security/bulletins/apsb13-07.html
If I understood your question, the following script should be what you want:
Probably this question is for stackoverflow!
But anyways you can look in HTTrack for this it does similar kind of operation and moreover its opensource