I have a list of web pages that I need to scrape, parse and then store the resulting data in a database. The total is around 5,000,000.
My current assumption of the best way to approach this is to deploy ~100 EC2 instances, provide each instance with 50,000 pages to scrape and then leave that to run, then once the process is completed merge the databases together. The assumption is that would take around one day to run (600ms to load, parse and save each page).
Does anyone have experience with doing such a large volume of page scraping within limited time? I've done large numbers before (1.5m) but that was from a single machine and took just over a week to complete.
The bottleneck in my situation is the downloading of the pages, the parsing is something that takes no more than 2ms, so something that can streamline the process of downloading the pages is what I'm looking for.