Archive for the ‘Indexing’ Category

Canonicals and noindex results

Sunday, November 1st, 2009

The results of the third experiment are in. This was quite a simple one: to see whether Google would respect a canonical link element on a page that had the noindex robots metatag.

No surprises here, happily. You’d expect Google to read the noindexed page, including the canonical link element, make the adjustment accordingly and index the destination page. That’s exactly what it did. Happy days.

Actually, it did it so quickly (compared with some other canonicals that I’ve implemented elsewhere) that I’m left wondering whether Google might actually be more inclined to pay swift attention to the canonical instruction if the page on which it is found is set not to be indexed. Just speculation, of course.

More on Google and robots.txt

Sunday, November 1st, 2009

I spoke to a couple of fellow SEO types about Google’s behaviour in Experiment 4. I’ve almost been persuaded that the behaviour of Google in the circumstances is not so very controversial.

In the experiment, I found evidence that Google was checking URLs that were disallowed in robots.txt, which initially seemed to me to be a breach of the robots protocol.

Here’s what Google says about robots.txt.

While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web.

Here’s what the site dedicated to the robots.txt protocol says:

The “Disallow: /” tells the robot that it should not visit any pages on the site.

There appears to be a slight discrepancy. Google says it will not crawl or index the content of pages blocked, whereas the robots protocol suggests that blocked agents should not visit the pages.

However, Google states that it may index just the URLs if it finds them elsewhere. This leads to the classic “thin result” in Google where you just see the URL and nothing else. It’s quite possible for such thin results not only to appear but to rank in searches, which has been an interesting way of demonstrating the power of anchor text in the past.

Google will presumably not want to index junk URLs. So when it finds URLs via links on other pages, but finds that they are blocked by robots.txt, it is presumably sending some kind of HTTP request – enough to confirm that they are valid, and apparently enough to pick up the HTTP response. I’d assume that the approach is as follows:

  • Response: 200 – index a thin result URL
  • Response: 301 – follow the same process with the destination URL, index if not blocked
  • Response: 4xx or 5xx – don’t index, at least for now

This would explain the result of the experiment. It seems to me that Google is not quite acting within the spirit of the robots protocol, if this is indeed the case.

The upshot of this is that you have to be very careful about the combination of methods that you are using to restrict access to your pages. It’s well known, for example, that Google will not (or cannot) parse a page-level robots noindex instruction if that page is blocked by robots.txt (because they are respecting robots.txt and not looking at the content of the page). For similar reasons, Google would not be able to “see” a canonical link instruction on a page blocked by robots.txt. However, it seems that they can and will respect an HTTP-level redirect, because this response is not part of the “content” of the page.

I wonder if I’m the only person to find this stuff interesting!

Google does not always respect robots.txt… maybe

Tuesday, October 27th, 2009

Here are the results for experiment 4.

To recap: I put a new link on the home page which was blocked by robots.txt. The link was to http://www.search-experiments.com/experiments/exp4/experiment4main.shtml.

Before even creating this page, I blocked all pages in that folder in robots.txt.

Here’s the very text that appears there:

User-agent: *
Disallow: /experiments/exp4/

Google Webmaster Tools confirms that the page is blocked when I use its Crawler Access tool:

Test results:

http://www.search-experiments.com/experiments/exp4/experiment4main.shtml
Blocked by line 3: Disallow: /experiments/exp4/

(However, it’s not yet showing up in the Crawl errors page.)

Then I put a 301 redirect in place on the page, redirecting to my “destination” page.

If Google properly respects robots.txt, then it should not request the blocked page. If it doesn’t request the blocked page, it shouldn’t find the 301 redirect to the destination page.

As that destination page is not linked to from anywhere else, that page should never appear in the index.

So, what happened?

Well, Google took it’s time to reindex the home page of the site (it’s not frequently updated and it’s not exactly a high-traffic site). But it did get around to it eventually.

And the destination page has also been indexed.

Now, it is of course possible that some other site has linked directly to the destination page, thereby giving Google an alternative and legitimate route in. The experiment is not, therefore, in a clearly controlled environment. But this seems quite unlikely, unless it has been accessed by some other crawler which has republished the destination URL somewhere, or someone was being very annoying to the point of being malicious. On a site like this, however, with its minuscule readership, I think the chances of the latter are remote. Incidentally, neither Google nor YSE are reporting any links into the destination page.

There was only one way to find out exactly what had happened – to look at the raw server logs for the period and see whether Google had indeed pinged the blocked URL. Unfortunately…  when I went to the logs to check out exactly what Gbot had been up to, I found that I hadn’t changed the default option in my hosting, which is not to keep the raw logs. So that’s not too smart. Sorry. I’ve got all the stats that I normally need, but neither AWStats, Webalizer or GA are giving me the detail that I need here.

On the balance of probability however, it seems that Google may be pinging the URLs that you tell them not to access with robots.txt, and checking the HTTP header returned. If it’s a 301, it will follow that redirect, and index the destination page in accordance with your settings for that page.

What’s the practical use of this information? Well, I can imagine a circumstance in which you have blocked certain pages using robots.txt because you are not happy with them being indexed or accessed, and you are planning to replace those pages with pages that you are happy with, you shouldn’t rely on Google continuing to respect the robots.txt exclusion once you have arranged for those pages to be redirected.

What’s the next step? Well, I’ve enabled the logs now, and will run a similar experiment in the near future.

301 redirect and robots.txt exclusion combined

Tuesday, October 20th, 2009

Experiment 4 is now up on the Search Experiments home page.

What I’m up to here is again pretty simple. I’ve created two pages. The first has been linked to from the home page under Experiment 4, but it has also been blocked by robots.txt (by disallowing the directory in which it resides). To be on the safe side, the robots.txt exclusion was put in place for the directory before the page was even created.

This page, however, will never see the light of day, because it has also been 301-redirected to another page, the “destination” page for the experiment.

Fortunately this blog is so obscure that the destination page is unlikely to receive any other incoming links (please don’t link to it if you’re reading this…).

The hypothesis is that Google will NOT follow the URL from which it is blocked by robots.txt, and so it will NOT discover the 301 redirect, so the destination page should not appear in Google’s index. What we should see instead is a snippet-free URL for the original page.

That’s what should happen if my understanding is right. But that’s not necessarily the case. Results will be reported back here.

Canonical link element and noindex robots metatag

Tuesday, October 20th, 2009

I’ve actually explained what I’m doing in this experiment on the page itself, which is here. The set-up is as follows

  • Create two almost identical pages
  • Link to the first one
  • Set the first page to “noindex,follow”
  • Give the first page a canonical link element in the head section, pointing to the second page
  • Set the second page to “index, follow”

Then, sit back and wait for Googlebot to work its magic – and see whether the second page makes it into the index. Really, provided that Google respects the noindex tag, and there’s no good reason why it should not, there should be no chance of the first page making it into the index. So the sole question is whether the second page will make it into the index or not.

My expectation, and hope, is that it will, despite being unlinked from anywhere else. Further variations on this theme will follow if it does not, and may in any case.

These terms only appear in links pointing to this page

Tuesday, August 19th, 2008

OK, so we have an initial result for our second search experiment. This was a five-page experiment, with a home page that linked to two further pages, one with meta robots set to index, the other set to noindex. Each of these pages linked to a destination page, the links having the same anchor text (which was a unique, or at least unusual, portmanteau word). The anchor text did not appear anywhere else on any pages.

The expectation was that both destination pages would be indexed. I also expected both pages to appear in a search for the anchor text word, but I wasn’t absolutely sure about this. Then, if both destination pages did indeed appear for that word, I was interested to see which ranked better.

It took longer than I expected for all the pages to be indexed, but both destination pages made it in there eventually. The linking page that was set to noindex, of course, is not there.

The indexed linking page is the first result from the site for the anchor text term. This page contains the term, and is further up the site hierarchy. The second result from the site is the second destination page (ie the one linked from the noindex linking page). Google’s cache of that page contains the familiar phrase: “these terms only appear in links point to this page”, followed by the anchor text. 

The first destination page does not appear in the results. It may do in future, and if it does I will report on its relative performance. But the page linked from the noindex parent was first to show…

This result demonstrates that pages set to noindex are passing link anchor text. This should not be too much of a surprise. From the initial result, it might appear that it is doing so more efficiently that an indexed page. I think that conclusion would not be correct. However, it might be reasonable to assume that it is passing anchor text at least as well as an indexed page.

Some further questions arise:

  • does Google consider the non-linked textual content of a non-indexed page when determining the relevance of the links from that page?
  • Indeed, does Google treat “noindex” pages exactly the same as other pages in its index – assessing the content, placing them in the link graph etc – and the only difference is that pages are not returned in SERPs?
  • What difference would there be if the page rather than locally set to noindex had been excluded using robots.txt?
  • Is it a given that the page containing the anchor text link would rank higher for the phrase than the page linked to, if the page linked to did not itself contain the word? Or in other words, does textual content outrank anchor text?

I don’t think that last one can be true, and I feel another experiment coming on…

Bad CSS to blame for non-caching of images

Sunday, August 10th, 2008

The first SEO experiment on the main site was intended to determine which of two possible code faux pas was more likely to be the cause of images not showing up in Google’s Image search results, a problem that had occurred on another site – which is why the test was a little specific in nature, and not very generic.

On examining Google’s cache of the pages in question, it was clear that the main images on those pages were not appearing. Looking at the code, two possible culprits suggested themselves. 

Firstly, in a rather messy way, classes and ids were being used interchangeably as style selectors for divisions (“divs”), and although there were not any repeated ids, there was a div with a particular id, which was then referenced as a class in another, nested div.

<div id=”blah”>

<div class=”blah”>

[picture and other content]

</div>

</div>

Not invalid HTML, but messy.

The other candidate was some strange-looking CSS code, apparently designed to get over some problem with rendering in IE6 (which may itself have been caused by the messy HTML…)

.hack {
	color: blue;
	font-size: 18px;
	height: 1%;
	overflow: hidden;
	}

It’s the last two lines, obviously, that are the candidates for causing issues. This CSS validates, and the pages render as expected in all browsers that I have tried. Browsers are very forgiving, however…

So, I recreated pages with these problems, including controls and permutations with the different errors.

The conclusion is that it is the CSS hack that is causing the images not to render in Google’s cache. It’s too early to tell whether this is also having an effect on the indexing of these images, because none of the images is yet indexed.

The cache for the badly nested divs page shows the picture, whereas the cache for the CSS-hacked test page does not render the picture.

Does this mean that Google is excluding certain types of “hidden” content, or does it mean that its internal “browser” for rendering its cached pages is a bit more strict about rendering pages accurately? Only when the pages have settled in the index and the images on the test pages have made it (or not) into the image search results will we be able to speculate more intelligently on this.

Ranking versus indexing – update

Friday, August 8th, 2008

Back after a week or so’s total inattention, and an interesting pattern is emerging. With the power of external links beginning to kick in, the home page for the site is now ranking at #5 in the big G for the familiar phrase. However, all of the non-blog pages appear to have disappeared from the index – at least, those few that were there already. None are currently indexed.

I suspect that this is part of a general fluctuation common with new sites, but I’d make the following observations. 

  1. The blog pages aren’t affected by this. Those that are set to be indexed are there in the big G index.
  2. Since the last update I’ve introduced an XML sitemap with all URLs in there (which updates with any new blog posts). So far no beneficial effect for non-blog pages.
  3. Some of the inbound links are showing in G webmaster tools now, but I suspect that this ranking means that all have been taken into account. The most powerful link is on a page that has now been crawled since the link was introduced.
  4. None of the links is from a page relevant to the subject matter. 
  5. Yahoo is not indexing anything but the homepage at present. Site Explorer is however showing links to pages other than the home page. I’m not quite sure how Y can count links to pages that it doesn’t recognise as indexed.
No firm conclusions as yet. Tentative conclusions are:
  • Using a blog platform is better for getting your pages indexed than hand-crafting HTML
  • Possibly this effect is helped by tagging, as other sites collate, aggregate and link to posts based on tagging
  • It’s easier to get one page to rank for a search term than it is to get a suite of pages into the index
However, at present it is also making it difficult to draw conclusions about either the CSS/picture experiment or the anchor text/noindex experiment.

The importance of a controlled environment

Thursday, July 24th, 2008

The meta-experiment relating to indexing is over, having fallen victim to a failure to maintain a hermetically sealed environment for the experiment.

The idea was to see whether pages from the site would be indexed by Google when they had no external links and no submission to Google had been made.

Despite the fact that no submission has been made and no links sought or set up, one has crept through.

It seems that Technorati have some detail about this blog, presumably through some hook-up with WordPress. The relevant Technorati pages don’t currently appear in the Goo index, but this aggregator is picking up and publishing blog posts from Technorati with certain tags, in this case “w3c”, and publishing them.

All very interesting in itself, but it does rather blow the intended experiment. Which just goes to show how hard it is to maintain a hermetic environment for experiments on the web.

Anyhow, now that it’s blown, I can pump in a bit of link juice from elsewhere – I need the pages to be indexed for current and future experiments.

Using Google tools

Wednesday, July 23rd, 2008

As part of the “how little can I do and still get indexed” experiment, I’ve added both Google Analytics and Google Webmaster tools to the site, to see if these alone will inspire the big G to index them. 

Expected outcome: not in the index

Next steps: Use the Google “Add URL” tool. I’m going to give this a couple of days though. 

Incidentally, if anyone thinks that a couple of days is not enough time to wait, I’m pretty confident that I could get indexed in 24 hours if I was in a rush – I will need to have this site indexed for other experiments in future, so I’m not prepared to wait indefinitely…