I have a website that I use for stuff but not stuff on the public web. I use it for serving my calendars and private web applications. I use Apache's built in authentication to keep it from being crawled and to keep casual visitors from wandering in. I have a domain name assigned to it from dyndns.org for convenience. The ddclient script runs on one of my boxes and updates the ip address over there whenever mine changes. The system works very well. Most of the time. Somehow one weekend the domain name was left pointing at my old ip address for a while when I was out of town. Who ever had that IP address sure was serving up a lot of nasty stuff. Now Google thinks all that nasty stuff is on my private domain.
I'm going to fix it. I use Google Webmaster Tools for other stuff and I see there's a URL removal tool in there. To use the tool you have to verify that you own the domain - a reasonable request. The thing is the URLs I want to remove are on a domain that I don't want Google to crawl and the way Google verifies that you own the domain is by retrieving a specific URL from the domain. What a dilemma.
Luckily Apache access control and authentication are quite flexible and can deal with this handily. My example uses the Apache httpd.conf file but the important directives that I use are available in .htaccess as well. What I really want to do is allow someone to have access to one file and one file only on my website. So my directory section used to look like this:
What this section says is to use the Basic authentication method to allow users listed in user.passwd to get at URLs in htdocs. Then it says deny access to anyone except for people on the 127.0.0 or 192.168.1 subnets (localhost and the LAN). Since there are two ways to figure out who gets in - either user name with password or IP address - the Satisfy directive says that either of the two methods are acceptable. Satisfy All would mean that users had to pass both tests (be on the LAN and have a valid username & password).
Google needs to get in too now. But just a little bit. I added a Files directive inside that Directory to provide an exception for their server.
The second Files directive is there because Google has to establish that the server doesn't return HTTP code 20x for any old URL. The third one is there because I discovered that I'll need a robots.txt as well.
So they check for a file that should exist and one that shouldn't. The one that should exist is one they asked you to create called google1234567890abcdef.html. They assume that noexist_1234567890abcdef.html should not exist and expect to get a 404 for that. I allow access to both but of course I only created google1234567890abcdef.html (just an empty text file) so noexist_1234567890abcdef.html will 404. The access logs for my server showed me what to look for:
At the time of that request I didn't have the second Files directive. The 401 response to the second request confounds Google and it gave me the error:
Last attempt Sep 25, 2008: Our system has experienced a temporary problem.
After adding the second Files directive (and restarting apache) I told Google to look again and it worked just fine. For bonus points, I could have used the IP address Google came in from (220.127.116.11) to restrict access more, as in
Google Webmaster Tools has the removal tool at Tools >> Remove URLs. There it states that:
To block a page or image from your site, do one of the following, and then submit your removal request:
To remove your entire site, or a complete directory, use a robots.txt file
Okay, the way I read that it sounds like I need a robots.txt to get this done quick. That's because my site would give a 401 Denied instead of 404 Not Found or 410 Gone for these URLs. Sigh.
I guess there's a good side-effect. If I goof up in the future and leave a link out there that let in a bot that respects robots.txt it would still skip anything on my site, so I'll make one. This is why there's a Files directive for robots.txt in the code above (I'm writing this as I do it). The contents of robots.txt are simple and should block crawling by all robots if they honour it:
Oh, and earlier I said this would work with just .htaccess. From Apache's documentation it looks like you'd just put a .htaccess in the appropriate physical directory and use the File directives in the same way but without the Directory part around it.