Is there an official API to iplists.com from where I can get the list of spiders?
My intention is to whitelist these IPs for site scraping.
Is there an official API to iplists.com from where I can get the list of spiders?
My intention is to whitelist these IPs for site scraping.
I have a list of web pages that I need to scrape, parse and then store the resulting data in a database. The total is around 5,000,000.
My current assumption of the best way to approach this is to deploy ~100 EC2 instances, provide each instance with 50,000 pages to scrape and then leave that to run, then once the process is completed merge the databases together. The assumption is that would take around one day to run (600ms to load, parse and save each page).
Does anyone have experience with doing such a large volume of page scraping within limited time? I've done large numbers before (1.5m) but that was from a single machine and took just over a week to complete.
The bottleneck in my situation is the downloading of the pages, the parsing is something that takes no more than 2ms, so something that can streamline the process of downloading the pages is what I'm looking for.
I've been following a few spiders in our logs and I did a traceroute on their ip to find out they are in fact EC2 instances. The user agents are listed as Google bot and msnbot but they are not Google or MS ip's. Is there anything I can do, is spoofing user agents a common practice? I'm guessing that if I ban their ip(which I've done) they will just start a new instance and carry on. I don't want to ban all EC2 instances though.
Out of curiosity, is anyone here using Google Mini or Google Search Appliance to provide intranet search? Was it easy to set up? What kind of prices do they charge (ball park figure, I'm sure it depends on the customer)?