Posts Tagged ‘Indexing’

More on Google and robots.txt

Sunday, November 1st, 2009

I spoke to a couple of fellow SEO types about Google’s behaviour in Experiment 4. I’ve almost been persuaded that the behaviour of Google in the circumstances is not so very controversial.

In the experiment, I found evidence that Google was checking URLs that were disallowed in robots.txt, which initially seemed to me to be a breach of the robots protocol.

Here’s what Google says about robots.txt.

While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web.

Here’s what the site dedicated to the robots.txt protocol says:

The “Disallow: /” tells the robot that it should not visit any pages on the site.

There appears to be a slight discrepancy. Google says it will not crawl or index the content of pages blocked, whereas the robots protocol suggests that blocked agents should not visit the pages.

However, Google states that it may index just the URLs if it finds them elsewhere. This leads to the classic “thin result” in Google where you just see the URL and nothing else. It’s quite possible for such thin results not only to appear but to rank in searches, which has been an interesting way of demonstrating the power of anchor text in the past.

Google will presumably not want to index junk URLs. So when it finds URLs via links on other pages, but finds that they are blocked by robots.txt, it is presumably sending some kind of HTTP request – enough to confirm that they are valid, and apparently enough to pick up the HTTP response. I’d assume that the approach is as follows:

  • Response: 200 – index a thin result URL
  • Response: 301 – follow the same process with the destination URL, index if not blocked
  • Response: 4xx or 5xx – don’t index, at least for now

This would explain the result of the experiment. It seems to me that Google is not quite acting within the spirit of the robots protocol, if this is indeed the case.

The upshot of this is that you have to be very careful about the combination of methods that you are using to restrict access to your pages. It’s well known, for example, that Google will not (or cannot) parse a page-level robots noindex instruction if that page is blocked by robots.txt (because they are respecting robots.txt and not looking at the content of the page). For similar reasons, Google would not be able to “see” a canonical link instruction on a page blocked by robots.txt. However, it seems that they can and will respect an HTTP-level redirect, because this response is not part of the “content” of the page.

I wonder if I’m the only person to find this stuff interesting!

Google does not always respect robots.txt… maybe

Tuesday, October 27th, 2009

Here are the results for experiment 4.

To recap: I put a new link on the home page which was blocked by robots.txt. The link was to http://www.search-experiments.com/experiments/exp4/experiment4main.shtml.

Before even creating this page, I blocked all pages in that folder in robots.txt.

Here’s the very text that appears there:

User-agent: *
Disallow: /experiments/exp4/

Google Webmaster Tools confirms that the page is blocked when I use its Crawler Access tool:

Test results:

http://www.search-experiments.com/experiments/exp4/experiment4main.shtml
Blocked by line 3: Disallow: /experiments/exp4/

(However, it’s not yet showing up in the Crawl errors page.)

Then I put a 301 redirect in place on the page, redirecting to my “destination” page.

If Google properly respects robots.txt, then it should not request the blocked page. If it doesn’t request the blocked page, it shouldn’t find the 301 redirect to the destination page.

As that destination page is not linked to from anywhere else, that page should never appear in the index.

So, what happened?

Well, Google took it’s time to reindex the home page of the site (it’s not frequently updated and it’s not exactly a high-traffic site). But it did get around to it eventually.

And the destination page has also been indexed.

Now, it is of course possible that some other site has linked directly to the destination page, thereby giving Google an alternative and legitimate route in. The experiment is not, therefore, in a clearly controlled environment. But this seems quite unlikely, unless it has been accessed by some other crawler which has republished the destination URL somewhere, or someone was being very annoying to the point of being malicious. On a site like this, however, with its minuscule readership, I think the chances of the latter are remote. Incidentally, neither Google nor YSE are reporting any links into the destination page.

There was only one way to find out exactly what had happened – to look at the raw server logs for the period and see whether Google had indeed pinged the blocked URL. Unfortunately…  when I went to the logs to check out exactly what Gbot had been up to, I found that I hadn’t changed the default option in my hosting, which is not to keep the raw logs. So that’s not too smart. Sorry. I’ve got all the stats that I normally need, but neither AWStats, Webalizer or GA are giving me the detail that I need here.

On the balance of probability however, it seems that Google may be pinging the URLs that you tell them not to access with robots.txt, and checking the HTTP header returned. If it’s a 301, it will follow that redirect, and index the destination page in accordance with your settings for that page.

What’s the practical use of this information? Well, I can imagine a circumstance in which you have blocked certain pages using robots.txt because you are not happy with them being indexed or accessed, and you are planning to replace those pages with pages that you are happy with, you shouldn’t rely on Google continuing to respect the robots.txt exclusion once you have arranged for those pages to be redirected.

What’s the next step? Well, I’ve enabled the logs now, and will run a similar experiment in the near future.

Ranking versus indexing – update

Friday, August 8th, 2008

Back after a week or so’s total inattention, and an interesting pattern is emerging. With the power of external links beginning to kick in, the home page for the site is now ranking at #5 in the big G for the familiar phrase. However, all of the non-blog pages appear to have disappeared from the index – at least, those few that were there already. None are currently indexed.

I suspect that this is part of a general fluctuation common with new sites, but I’d make the following observations. 

  1. The blog pages aren’t affected by this. Those that are set to be indexed are there in the big G index.
  2. Since the last update I’ve introduced an XML sitemap with all URLs in there (which updates with any new blog posts). So far no beneficial effect for non-blog pages.
  3. Some of the inbound links are showing in G webmaster tools now, but I suspect that this ranking means that all have been taken into account. The most powerful link is on a page that has now been crawled since the link was introduced.
  4. None of the links is from a page relevant to the subject matter. 
  5. Yahoo is not indexing anything but the homepage at present. Site Explorer is however showing links to pages other than the home page. I’m not quite sure how Y can count links to pages that it doesn’t recognise as indexed.
No firm conclusions as yet. Tentative conclusions are:
  • Using a blog platform is better for getting your pages indexed than hand-crafting HTML
  • Possibly this effect is helped by tagging, as other sites collate, aggregate and link to posts based on tagging
  • It’s easier to get one page to rank for a search term than it is to get a suite of pages into the index
However, at present it is also making it difficult to draw conclusions about either the CSS/picture experiment or the anchor text/noindex experiment.

Experiment 2 – anchor text and noindex

Tuesday, July 29th, 2008

I’ve begun a new search experiment. At that link you’ll find the “home page” for that experiment, which links to a couple of other pages. These pages, while not identical, are pretty similar in content. Each of them has a link using unusual anchor text to another pair of final destination pages, each of which has a little bit of text and a picture. One of the linking pages is set to “noindex”. 

The intention is to see whether both of the destination pages will rank for a search on the unusual anchor text, and if so, which one ranks the highest. 

The expectation would be that both pages would appear for that search, along with the pages that contain the terms. 

If that is the case, then I won’t put too much weight on the outcome of which one ranks the highest, but it should give us a platform for further iterations.