scraping questions - Page 1

Quintin Par

Asked: 2012-03-14 19:00:08 +0800 CST

IP address of spiders and “official” web bots

3

Is there an official API to iplists.com from where I can get the list of spiders?

My intention is to whitelist these IPs for site scraping.

sam

Asked: 2011-11-01 02:31:28 +0800 CST

Most efficient (time, cost) way to scrape 5 million web pages?

8

I have a list of web pages that I need to scrape, parse and then store the resulting data in a database. The total is around 5,000,000.

My current assumption of the best way to approach this is to deploy ~100 EC2 instances, provide each instance with 50,000 pages to scrape and then leave that to run, then once the process is completed merge the databases together. The assumption is that would take around one day to run (600ms to load, parse and save each page).

Does anyone have experience with doing such a large volume of page scraping within limited time? I've done large numbers before (1.5m) but that was from a single machine and took just over a week to complete.

The bottleneck in my situation is the downloading of the pages, the parsing is something that takes no more than 2ms, so something that can streamline the process of downloading the pages is what I'm looking for.

Ryan Detzel

Asked: 2011-03-22 13:15:15 +0800 CST

What to do about spoofed user agents? Scrapers pretending to be spiders

3

I've been following a few spiders in our logs and I did a traceroute on their ip to find out they are in fact EC2 instances. The user agents are listed as Google bot and msnbot but they are not Google or MS ip's. Is there anything I can do, is spoofing user agents a common practice? I'm guessing that if I ban their ip(which I've done) they will just start a new instance and carry on. I don't want to ban all EC2 instances though.

username

Asked: 2009-05-17 17:19:23 +0800 CST

How easy/expensive is it to adopt Google Mini/Google Appliance for intranet search?

4

Out of curiosity, is anyone here using Google Mini or Google Search Appliance to provide intranet search? Was it easy to set up? What kind of prices do they charge (ball park figure, I'm sure it depends on the customer)?

Randin

Asked: 2009-05-12 16:10:59 +0800 CST

How to avoid being scraped?

8

We have a searchable Database(DB) , we limit the results to 15 per page and only 100 results yet still get people trying to scrape the site.

We are banning sites that hit it fast enough. I was wondering if there is anything else that we can do. Flash render the results maybe?

IP address of spiders and “official” web bots

Most efficient (time, cost) way to scrape 5 million web pages?

What to do about spoofed user agents? Scrapers pretending to be spiders

How easy/expensive is it to adopt Google Mini/Google Appliance for intranet search?

How to avoid being scraped?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?

Questions[scraping](server)