Ping a Specific Port

Question

wodow

Asked: 2012-01-04 09:28:48 +0800 CST2012-01-04 09:28:48 +0800 CST 2012-01-04 09:28:48 +0800 CST

Does a forward web proxy exist that checks and obeys robots.txt on remote domains?

772

Does there exist a forward proxy server that will lookup and obey robots.txt files on remote internet domains and enforce them on behalf of requesters going via the proxy?

e.g. Imagine a website at www.example.com that has a robots.txt file that restricts certain URLs and applies Crawl-Delays to others.

Multiple automatic clients (e.g. crawlers, scrapers) could then, going via the proxy, access the website at www.example.com without violating the robots.txt directives AND without having to access the file themselves (=> simpler clients and less requests to get robots.txt)

(Specifically, I am looking at the "GYM2008" version of the spec - http://nikitathespider.com/python/rerp/#gym2008 - because it's in wide use)

1 Answers

Voted

voretaq7 · Answer 1 · 2012-01-04T10:26:48+08:00

I'm not sure why enforcing compliance with robots.txt would be the job of a proxy: The crawler (robot) is supposed to pull robots.txt and follow the instructions contained in that file, so as long as the proxy returns the correct robots.txt data and the crawler Does The Right Thing with that data, and as long as the crawler supports using a proxy, you'll get all the benefits of a proxy with no work required.

**

That said, I don't know of any proxy that does what you seem to be asking for (parse robots.txt from a site and only return things that would be allowed by that file -- presumably to control a crawler bot that doesn't respect robots.txt?). Writing a proxy that handles this would require doing a user-agent-to-robots.txt mapping/check for every request the proxy receives, which is certainly possible (You can do it in Squid, but you'd need to bang together a script to turn robots.txt into squid config rules and update that data periodically), but would undoubtedly be an efficiency hit on the proxy.
Fixing the crawler is the better solution (it also avoids "stale" data being sent to the crawler by the proxy. Note that a good crawler bot will check update times in the HTTP headers and only fetch pages if they've changed...)

Does a forward web proxy exist that checks and obeys robots.txt on remote domains?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?