Ping a Specific Port

Question

Xiè Jìléi

Asked: 2009-12-30 20:48:07 +0800 CST2009-12-30 20:48:07 +0800 CST 2009-12-30 20:48:07 +0800 CST

How often does Google's web spiders crawl the web?

772

Just a few hours after having made some changes in the HTML of my site, I found that Google had updated its search result against my website. The Internet is so huge, how did the Google crawler do that? Doesn't it use too much bandwidth?

3 Answers

Voted

John T · Answer 1 · 2009-12-30T20:59:40+08:00

Google's spiders are constantly crawling the web. They have multiple machines which crawl their massive index and add new pages to it all the time.

Reasons it's fast:

They have tons of machines doing the crawling at ridiculous speeds
They have tons of bandwidth available
They already have a giant index of pages to search so it saves time looking for new content. They can request the previously indexed links and parse them for new links to crawl.
They have been doing this for years and have fine tuned their crawling algorithm. They continue to work on it to this day to make it even better.
Certain sites are indexed more often depending on certain factors, PR (PageRank) being a big one. If your site has a high PR, you'll see it updated quickly. That's why you'll often see Superuser questions turn up in search results minutes after they've been asked.

Edit:

...among many other factors.

Google has an abundance of space and bandwidth. Don't you worry about them! As of January 2008, Google was sorting (on average) 20PB a day. 20PB (petabytes) is 20,000 terabytes, or 20 million gigabytes. Now that's just sorting, it isn't all of their data, it's a fraction of it.

An interesting question came up while running experiments at such a scale: Where do you put 1PB of sorted data? We were writing it to 48,000 hard drives (we did not use the full capacity of these disks, though), and every time we ran our sort, at least one of our disks managed to break (this is not surprising at all given the duration of the test, the number of disks involved, and the expected lifetime of hard disks). To make sure we kept our sorted petabyte safe, we asked the Google File System to write three copies of each file to three different disks.

Simply incredible.

Tobu · Answer 2 · 2009-12-31T02:25:46+08:00

Tobu

2009-12-31T02:25:46+08:002009-12-31T02:25:46+08:00

I suspect google uses a few extra signals to decide to re-crawl.

Account activity in analytics or google webmaster tools, twitter activity, search activity, toolbar activity, chrome url completion, perhaps requests to their dns service.

Then they need to look up when a listing page was last updated, and if so mine it for newly created pages. The sitemap is the preferred listing page (SuperUser has one), then feeds, then the home page which tends to list recent pages and therefore to be updated whenever another page is.

1

Molly27371 · Answer 3 · 2009-12-31T03:06:58+08:00

Molly27371

2009-12-31T03:06:58+08:002009-12-31T03:06:58+08:00

Google's crawling frequency is defined by many factors such as PageRank, links to a page, and crawling constraints such as the number of parameters in a URL.

and here's an excellent article on how it is done:

The Anatomy of a Large-Scale Hypertextual Web Search Engine

1

How often does Google's web spiders crawl the web?

Edit:

Ping a Specific Port

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What's the command-line utility in Windows to do a reverse DNS look-up?

How to check if a port is blocked on a Windows machine?

What port should I open to allow remote desktop?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?