I have a "content" website that some leechers and 419 scammers love to crawl agressively which also generates costs and performance issue. :( I have no choice: I need to prevent them to access the sitemap files and index. :(
I am doing the same as Facebook: I generate a sitemap index on the fly (/sitemap.php). I whitelisted the "good" crawlers with DNS reverse lookup (PHP) and agent check (Same as Stackoverflow). To prevent whitelisted engines to make the sitemap index content public I added that header (Stackoverflow forgot it):
header('Content-type: application/xml; charset="UTF-8"', true);
header('Pragma: no-cache');
header('X-Robots-Tag: NOARCHIVE');
Question 1: Am I missing something to protect the sitemap index file?
Question 2: The problem comes from the static sitemap (.xml.gz) files generated. How can I protect them? Even if they have a "hard to guess" name, they can be found easily with a simple google query (example: "site:stackoverflow.com filetype:xml") and I have a very limited access to .htaccess.
EDIT: This is not a server config issue. Prefered language is PHP.
EDIT 2: Sorry, this is pure programmatic question, but it has been transfered from SO and I cannot close/delete it. :(
You could always use a URL for the sitemap which will not be disclosed to anyone else apart from the engines that you'll explicitly submit to.
Have a look at http://en.wikipedia.org/wiki/Sitemaps
You should use a whitelist and only allow good search engines access to these sitemap files like Google and Bing.
This is a huge problem that I'm afraid most people don't even consider when submitting sitemap files to Google and Bing. I track every request to my xml sitemap files and I've denied access to over 6,500 IPs since I started doing this (3 months ago). Only Google, Bing, and a few others only ever to get to view these files now.
Since you are using a whitelist and not a blacklist, they can buy all the proxies they want and they will never get through. Also, you should perform a reverse DNS lookup as well before you whitelist and IP to make sure they really are from Google or Bing. As for how to do this in PHP, I have no idea as we are a Microsoft shop and only do ASP.NET development. I would start by getting the range of IPs that Google and Bing run their bots out of, then when a request comes in from one of those IPs, perform a DNS lookup and make sure "googlebot" or "msnbot" is in the DNS name, if it is, then perform a reverse DNS lookup against that name to make sure that the IP Address returned matches the original IP Address. If it does, then you can safely allow the IP to view your sitemap file, if it doesn't, deny access and 404 the jokers. I got that technique talking to a Google techie BTW so it's pretty solid.
Note, I own and operate a site that does around 4,000,000 page views a month so for me this was a huge priority as I didn't want my data that easily scrapped. Also, I employ the use of recaptcha after 50 page requests from the same IP in a 12 hour period and that really works well to weed out bots.
I took the time to write this post as I hope it will help someone else out and shed some light on what I think is a problem that goes largely unnoticed.
How about not creating sitemap.php on the fly? Instead regenerate it once a day (or whatever makes sense) and serve it up as a static file. That way, even if 10,000 crawlers a day request it—so what?
You could use robots.txt to disallow the file but you could also block the IP's. A simple way to do this is to look at the HTTP referrers in your web logs and write a cron job to take those IP's (by referrer) and add them to hosts.deny for your website.