Google does not always respect robots.txt… maybe

Here are the results for experiment 4.

To recap: I put a new link on the home page which was blocked by robots.txt. The link was to

Before even creating this page, I blocked all pages in that folder in robots.txt.

Here’s the very text that appears there:

User-agent: *
Disallow: /experiments/exp4/

Google Webmaster Tools confirms that the page is blocked when I use its Crawler Access tool:

Test results:
Blocked by line 3: Disallow: /experiments/exp4/

(However, it’s not yet showing up in the Crawl errors page.)

Then I put a 301 redirect in place on the page, redirecting to my “destination” page.

If Google properly respects robots.txt, then it should not request the blocked page. If it doesn’t request the blocked page, it shouldn’t find the 301 redirect to the destination page.

As that destination page is not linked to from anywhere else, that page should never appear in the index.

So, what happened?

Well, Google took it’s time to reindex the home page of the site (it’s not frequently updated and it’s not exactly a high-traffic site). But it did get around to it eventually.

And the destination page has also been indexed.

Now, it is of course possible that some other site has linked directly to the destination page, thereby giving Google an alternative and legitimate route in. The experiment is not, therefore, in a clearly controlled environment. But this seems quite unlikely, unless it has been accessed by some other crawler which has republished the destination URL somewhere, or someone was being very annoying to the point of being malicious. On a site like this, however, with its minuscule readership, I think the chances of the latter are remote. Incidentally, neither Google nor YSE are reporting any links into the destination page.

There was only one way to find out exactly what had happened – to look at the raw server logs for the period and see whether Google had indeed pinged the blocked URL. Unfortunately…  when I went to the logs to check out exactly what Gbot had been up to, I found that I hadn’t changed the default option in my hosting, which is not to keep the raw logs. So that’s not too smart. Sorry. I’ve got all the stats that I normally need, but neither AWStats, Webalizer or GA are giving me the detail that I need here.

On the balance of probability however, it seems that Google may be pinging the URLs that you tell them not to access with robots.txt, and checking the HTTP header returned. If it’s a 301, it will follow that redirect, and index the destination page in accordance with your settings for that page.

What’s the practical use of this information? Well, I can imagine a circumstance in which you have blocked certain pages using robots.txt because you are not happy with them being indexed or accessed, and you are planning to replace those pages with pages that you are happy with, you shouldn’t rely on Google continuing to respect the robots.txt exclusion once you have arranged for those pages to be redirected.

What’s the next step? Well, I’ve enabled the logs now, and will run a similar experiment in the near future.

Tags: , , , , , ,

Leave a Reply

You must be logged in to post a comment.