sam's questions -server

sam

Asked: 2011-11-01 02:31:28 +0800 CST

Most efficient (time, cost) way to scrape 5 million web pages?

I have a list of web pages that I need to scrape, parse and then store the resulting data in a database. The total is around 5,000,000.

My current assumption of the best way to approach this is to deploy ~100 EC2 instances, provide each instance with 50,000 pages to scrape and then leave that to run, then once the process is completed merge the databases together. The assumption is that would take around one day to run (600ms to load, parse and save each page).

Does anyone have experience with doing such a large volume of page scraping within limited time? I've done large numbers before (1.5m) but that was from a single machine and took just over a week to complete.

The bottleneck in my situation is the downloading of the pages, the parsing is something that takes no more than 2ms, so something that can streamline the process of downloading the pages is what I'm looking for.

sam

Asked: 2010-06-20 04:50:00 +0800 CST

How can I retain a process ID for later use?

I'm starting the same process multiple times and using the process ID to differentiate between them for future control. How would I go about getting back the process ID for the process I just started? Is it possible to append some sort of command to the command used to start the process?

Most efficient (time, cost) way to scrape 5 million web pages?

How can I retain a process ID for later use?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?