Archive for the ‘SEO experiments’ Category

New experiment in business search

Wednesday, November 10th, 2010

I’ve been involved in producing a new tag-based business directory for the UK, which went live today. It’s in very early beta, and it is far from being perfect, but it has some interesting features even at this early stage.

It’s very fast, and it uses a lot of tags – both for comprehension and for internal linking. I shall be very interested to see what Google and the rest will make of it.

Canonicals and noindex results

Sunday, November 1st, 2009

The results of the third experiment are in. This was quite a simple one: to see whether Google would respect a canonical link element on a page that had the noindex robots metatag.

No surprises here, happily. You’d expect Google to read the noindexed page, including the canonical link element, make the adjustment accordingly and index the destination page. That’s exactly what it did. Happy days.

Actually, it did it so quickly (compared with some other canonicals that I’ve implemented elsewhere) that I’m left wondering whether Google might actually be more inclined to pay swift attention to the canonical instruction if the page on which it is found is set not to be indexed. Just speculation, of course.

Google does not always respect robots.txt… maybe

Tuesday, October 27th, 2009

Here are the results for experiment 4.

To recap: I put a new link on the home page which was blocked by robots.txt. The link was to http://www.search-experiments.com/experiments/exp4/experiment4main.shtml.

Before even creating this page, I blocked all pages in that folder in robots.txt.

Here’s the very text that appears there:

User-agent: *
Disallow: /experiments/exp4/

Google Webmaster Tools confirms that the page is blocked when I use its Crawler Access tool:

Test results:

http://www.search-experiments.com/experiments/exp4/experiment4main.shtml
Blocked by line 3: Disallow: /experiments/exp4/

(However, it’s not yet showing up in the Crawl errors page.)

Then I put a 301 redirect in place on the page, redirecting to my “destination” page.

If Google properly respects robots.txt, then it should not request the blocked page. If it doesn’t request the blocked page, it shouldn’t find the 301 redirect to the destination page.

As that destination page is not linked to from anywhere else, that page should never appear in the index.

So, what happened?

Well, Google took it’s time to reindex the home page of the site (it’s not frequently updated and it’s not exactly a high-traffic site). But it did get around to it eventually.

And the destination page has also been indexed.

Now, it is of course possible that some other site has linked directly to the destination page, thereby giving Google an alternative and legitimate route in. The experiment is not, therefore, in a clearly controlled environment. But this seems quite unlikely, unless it has been accessed by some other crawler which has republished the destination URL somewhere, or someone was being very annoying to the point of being malicious. On a site like this, however, with its minuscule readership, I think the chances of the latter are remote. Incidentally, neither Google nor YSE are reporting any links into the destination page.

There was only one way to find out exactly what had happened – to look at the raw server logs for the period and see whether Google had indeed pinged the blocked URL. Unfortunately…  when I went to the logs to check out exactly what Gbot had been up to, I found that I hadn’t changed the default option in my hosting, which is not to keep the raw logs. So that’s not too smart. Sorry. I’ve got all the stats that I normally need, but neither AWStats, Webalizer or GA are giving me the detail that I need here.

On the balance of probability however, it seems that Google may be pinging the URLs that you tell them not to access with robots.txt, and checking the HTTP header returned. If it’s a 301, it will follow that redirect, and index the destination page in accordance with your settings for that page.

What’s the practical use of this information? Well, I can imagine a circumstance in which you have blocked certain pages using robots.txt because you are not happy with them being indexed or accessed, and you are planning to replace those pages with pages that you are happy with, you shouldn’t rely on Google continuing to respect the robots.txt exclusion once you have arranged for those pages to be redirected.

What’s the next step? Well, I’ve enabled the logs now, and will run a similar experiment in the near future.

301 redirect and robots.txt exclusion combined

Tuesday, October 20th, 2009

Experiment 4 is now up on the Search Experiments home page.

What I’m up to here is again pretty simple. I’ve created two pages. The first has been linked to from the home page under Experiment 4, but it has also been blocked by robots.txt (by disallowing the directory in which it resides). To be on the safe side, the robots.txt exclusion was put in place for the directory before the page was even created.

This page, however, will never see the light of day, because it has also been 301-redirected to another page, the “destination” page for the experiment.

Fortunately this blog is so obscure that the destination page is unlikely to receive any other incoming links (please don’t link to it if you’re reading this…).

The hypothesis is that Google will NOT follow the URL from which it is blocked by robots.txt, and so it will NOT discover the 301 redirect, so the destination page should not appear in Google’s index. What we should see instead is a snippet-free URL for the original page.

That’s what should happen if my understanding is right. But that’s not necessarily the case. Results will be reported back here.

Canonical link element and noindex robots metatag

Tuesday, October 20th, 2009

I’ve actually explained what I’m doing in this experiment on the page itself, which is here. The set-up is as follows

  • Create two almost identical pages
  • Link to the first one
  • Set the first page to “noindex,follow”
  • Give the first page a canonical link element in the head section, pointing to the second page
  • Set the second page to “index, follow”

Then, sit back and wait for Googlebot to work its magic – and see whether the second page makes it into the index. Really, provided that Google respects the noindex tag, and there’s no good reason why it should not, there should be no chance of the first page making it into the index. So the sole question is whether the second page will make it into the index or not.

My expectation, and hope, is that it will, despite being unlinked from anywhere else. Further variations on this theme will follow if it does not, and may in any case.

New experimental blog

Tuesday, October 13th, 2009

I’ve started a new blog about surnames to see whether the creation of new content will have any effect on another project. The new blog is hand-crafted, ie it doesn’t rely on autoblogging – it’s done the old-fashioned way. It’s not a highly controlled experiment, but as I’m checking the relevant rankings anyway it will be interesting to see whether the linked pages get a greater benefit than those that are not linked.

If it succeeds, I’ll continue with the blog as a permanent tactic; in any event I shall probably try something more along the autoblogging line at a later stage.

Causality part II

Sunday, October 19th, 2008

I left it a little longer than a week, during which time I didn’t change any of the blogs, add or edit any posts, or even check the search results.

Today, the home page of the site is back up to #9 for the search term “search experiments”, which makes the previous change look something like the usual non-random churn.

Could it be that the initial dip was some kind of penalty, but the fact that no new splog posts have been published means that the penalty has diminished? It could, but there are a hundred other possible explanations. Beware of jumping to conclusions.

Causality, splogging and speculation

Thursday, October 9th, 2008

Here is a common issue for anyone involved in SEO or Google-watching, and I suppose in many other areas as well.

You take an action X. Event Y follows. You can build a plausible hypothesis for a connection between X and Y. X therefore caused Y.

The SEO version of this goes: you released new feature/code tweak/section on your website. Traffic went up the following week. New feature was a success! 

My latest version of this runs as follows: I experimented with some splogging on another blog (the relationship with this one is not disguised: they are cross-linked and hosted on the same account). I created a number of automatic posts using RSS-generated content about Google, ensuring that every time the word “search” appeared in the posts, it linked to the home page of this site. Before I did this, the home page of this site was #5 or #6 on Google, which it had been for a while. Today it is #20. So it might be very easy to jump to the conclusion that Google has spotted my nefarious tactics, and has penalised my site. 

Is that a reasonable conclusion based on the evidence?

Update on experiment 1

Wednesday, September 3rd, 2008

It transpires that the whole experiment was somewhat misconceived.

To recap: we were having trouble getting images indexed on a certain part of another live site, and on examining the cache of the pages in Google we noted that the images were not appearing. We then identified a couple of candidate reasons why this might be, isolated them and set up some pages here to test which of the reasons might be causing it. 

We successfully identified the cause of the phenomenon.

However, the underlying assumption – that the absence of the image from Google’s cached version was somehow an indication that Google had not indexed the image – was incorrect, as I discovered when looking again at an offending cached page using another browser (in this case IE), which rendered the image.

I suppose that the experiment worked, but the hypothesis unfortunately died.

These terms only appear in links pointing to this page

Tuesday, August 19th, 2008

OK, so we have an initial result for our second search experiment. This was a five-page experiment, with a home page that linked to two further pages, one with meta robots set to index, the other set to noindex. Each of these pages linked to a destination page, the links having the same anchor text (which was a unique, or at least unusual, portmanteau word). The anchor text did not appear anywhere else on any pages.

The expectation was that both destination pages would be indexed. I also expected both pages to appear in a search for the anchor text word, but I wasn’t absolutely sure about this. Then, if both destination pages did indeed appear for that word, I was interested to see which ranked better.

It took longer than I expected for all the pages to be indexed, but both destination pages made it in there eventually. The linking page that was set to noindex, of course, is not there.

The indexed linking page is the first result from the site for the anchor text term. This page contains the term, and is further up the site hierarchy. The second result from the site is the second destination page (ie the one linked from the noindex linking page). Google’s cache of that page contains the familiar phrase: “these terms only appear in links point to this page”, followed by the anchor text. 

The first destination page does not appear in the results. It may do in future, and if it does I will report on its relative performance. But the page linked from the noindex parent was first to show…

This result demonstrates that pages set to noindex are passing link anchor text. This should not be too much of a surprise. From the initial result, it might appear that it is doing so more efficiently that an indexed page. I think that conclusion would not be correct. However, it might be reasonable to assume that it is passing anchor text at least as well as an indexed page.

Some further questions arise:

  • does Google consider the non-linked textual content of a non-indexed page when determining the relevance of the links from that page?
  • Indeed, does Google treat “noindex” pages exactly the same as other pages in its index – assessing the content, placing them in the link graph etc – and the only difference is that pages are not returned in SERPs?
  • What difference would there be if the page rather than locally set to noindex had been excluded using robots.txt?
  • Is it a given that the page containing the anchor text link would rank higher for the phrase than the page linked to, if the page linked to did not itself contain the word? Or in other words, does textual content outrank anchor text?

I don’t think that last one can be true, and I feel another experiment coming on…