More on Google and robots.txt

I spoke to a couple of fellow SEO types about Google’s behaviour in Experiment 4. I’ve almost been persuaded that the behaviour of Google in the circumstances is not so very controversial.

In the experiment, I found evidence that Google was checking URLs that were disallowed in robots.txt, which initially seemed to me to be a breach of the robots protocol.

Here’s what Google says about robots.txt.

While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web.

Here’s what the site dedicated to the robots.txt protocol says:

The “Disallow: /” tells the robot that it should not visit any pages on the site.

There appears to be a slight discrepancy. Google says it will not crawl or index the content of pages blocked, whereas the robots protocol suggests that blocked agents should not visit the pages.

However, Google states that it may index just the URLs if it finds them elsewhere. This leads to the classic “thin result” in Google where you just see the URL and nothing else. It’s quite possible for such thin results not only to appear but to rank in searches, which has been an interesting way of demonstrating the power of anchor text in the past.

Google will presumably not want to index junk URLs. So when it finds URLs via links on other pages, but finds that they are blocked by robots.txt, it is presumably sending some kind of HTTP request – enough to confirm that they are valid, and apparently enough to pick up the HTTP response. I’d assume that the approach is as follows:

  • Response: 200 – index a thin result URL
  • Response: 301 – follow the same process with the destination URL, index if not blocked
  • Response: 4xx or 5xx – don’t index, at least for now

This would explain the result of the experiment. It seems to me that Google is not quite acting within the spirit of the robots protocol, if this is indeed the case.

The upshot of this is that you have to be very careful about the combination of methods that you are using to restrict access to your pages. It’s well known, for example, that Google will not (or cannot) parse a page-level robots noindex instruction if that page is blocked by robots.txt (because they are respecting robots.txt and not looking at the content of the page). For similar reasons, Google would not be able to “see” a canonical link instruction on a page blocked by robots.txt. However, it seems that they can and will respect an HTTP-level redirect, because this response is not part of the “content” of the page.

I wonder if I’m the only person to find this stuff interesting!

Tags: , , , ,

Leave a Reply

You must be logged in to post a comment.