I have a list of web pages that I need to scrape, parse and then store the resulting data in a database. The total is around 5,000,000.
My current assumption of the best way to approach this is to deploy ~100 EC2 instances, provide each instance with 50,000 pages to scrape and then leave that to run, then once the process is completed merge the databases together. The assumption is that would take around one day to run (600ms to load, parse and save each page).
Does anyone have experience with doing such a large volume of page scraping within limited time? I've done large numbers before (1.5m) but that was from a single machine and took just over a week to complete.
The bottleneck in my situation is the downloading of the pages, the parsing is something that takes no more than 2ms, so something that can streamline the process of downloading the pages is what I'm looking for.
Working on the assumption that download time (and therefore bandwidth usage) is your limiting factor, I would make the following suggestions:
Firstly, choose m1.large instances. Of the three 'levels' of I/O performance (which includes bandwidth), the m1.large and m1.xlarge instances both offer 'high' I/O performance. Since your task is not CPU bound, the least expensive of these will be the preferable choice.
Secondly, your instance will be able to download far faster than any site can serve pages - do not download a single page at a time on a given instance, run the task concurrently - you should be able to do at least 20 pages simultaneously (although, I would guess you can probably do 50-100 without difficulty). (Take the example of downloading from a forum from your comment - that is a dynamic page that is going to take the server time to generate - and there are other users using that sites bandwidth, etc.). Continue to increase the concurrency until you reach the limits of the instance bandwidth. (Of course, don't make multiple simultaneous requests to the same site).
If you really are trying to maximize performance, you may consider launching instances in geographically appropriate zones to minimize latency (but that would require geolocating all your URLs, which may not be practical).
One thing to note is that instance bandwidth is variable, at times you will get higher performance, and at other times you will get lower performance. On the smaller instances, the variation in performance is more significant because the physical links are shared by more servers and any of those can decrease your available bandwidth. Between m1.large instances, within the EC2 network (same availability zone), you should get near theoretical gigabit throughput.
In general, with AWS, it is almost always more efficient to go with a larger instance as opposed to multiple smaller instances (unless you are specifically looking at something such as failover, etc. where you need multiple instances).
I don't know what your setup entails, but when I have previously attempted this (between 1 and 2 million links, updated periodically), my approach was to maintain a database of the links adding new links as they were found, and forking processes to scrape and parse the pages. A URL would be retrieved (at random) and marked as in progress on the database, the script would download the page and if successful, mark the url as downloaded in the database and send the content to another script that parsed the page, new links were added to the database as they were found. The advantage of the database here was centralization - multiple scripts could query the database simultaneously and (as long as transactions were atomic) one could be assured that each page would only be downloaded once.
A couple of additional points of mention - there are limits (I believe 20) on the number of on-demand instances you can have running at one time - if you plan to exceed those limits, you will need to request AWS to increase your account's limits. It would be much more economical for you to run spot instances, and to scale up your numbers when the spot price is low (maybe one on-demand instance to keep everything organized, and the remaining, spot instances).
If time is of higher priority than cost to you, the cluster compute instances offer 10Gbps bandwidth - and should yield the greatest download bandwidth.
Recap: try few large instances (instead of many small instances) and run multiple concurrent downloads on each instance - add more instances if you find yourself bandwidth limited, move to larger instances if you find yourself CPU/memory bound.
We tried to do something similar, and here is my 5 cents:
Get 2-3 cheap unmetered servers, e.g. don't pay for the bandwidth.
Use python with asyncore. Asyncore is the old way to do things, but we found it works faster than any other method. Downside is that DNS lookup is blocking, i.e. not "parallel". Using asyncore we managed to scrape 1M URL's for 40 min, using single XEON 4 cores, 8 GB RAM. Load average on the server was less 4 (that is excellent for 4 cores).
If you do not like asyncore, try gevent. It even do DNS non blocking. Using gevent, 1M was downloaded for about 50 min on same hardware. Load average on the server was huge.
Note we did test lots of Python libraries, such grequests, curl, liburl/liburl2, but we did not test Twisted.