Does there exist a forward proxy server that will lookup and obey robots.txt
files on remote internet domains and enforce them on behalf of requesters going via the proxy?
e.g. Imagine a website at www.example.com that has a robots.txt
file that restricts certain URLs and applies Crawl-Delays to others.
Multiple automatic clients (e.g. crawlers, scrapers) could then, going via the proxy, access the website at www.example.com without violating the robots.txt
directives AND without having to access the file themselves (=> simpler clients and less requests to get robots.txt
)
(Specifically, I am looking at the "GYM2008" version of the spec - http://nikitathespider.com/python/rerp/#gym2008 - because it's in wide use)
I'm not sure why enforcing compliance with
robots.txt
would be the job of a proxy: The crawler (robot) is supposed to pullrobots.txt
and follow the instructions contained in that file, so as long as the proxy returns the correctrobots.txt
data and the crawler Does The Right Thing with that data, and as long as the crawler supports using a proxy, you'll get all the benefits of a proxy with no work required.**
That said, I don't know of any proxy that does what you seem to be asking for (parse robots.txt from a site and only return things that would be allowed by that file -- presumably to control a crawler bot that doesn't respect
robots.txt
?). Writing a proxy that handles this would require doing a user-agent-to-robots.txt mapping/check for every request the proxy receives, which is certainly possible (You can do it in Squid, but you'd need to bang together a script to turn robots.txt into squid config rules and update that data periodically), but would undoubtedly be an efficiency hit on the proxy.Fixing the crawler is the better solution (it also avoids "stale" data being sent to the crawler by the proxy. Note that a good crawler bot will check update times in the HTTP headers and only fetch pages if they've changed...)