New experiment in business search

November 10th, 2010

I’ve been involved in producing a new tag-based business directory for the UK, which went live today. It’s in very early beta, and it is far from being perfect, but it has some interesting features even at this early stage.

It’s very fast, and it uses a lot of tags – both for comprehension and for internal linking. I shall be very interested to see what Google and the rest will make of it.

A to Z of Google Instant

September 13th, 2010

Of course, the list below will vary by user and location, but here’s how Google Instant reacted to the alphabet and single numerals. This may have changed by now.

Argos
BBC
Currys
Debenhams
Ebay
Facebook
Google maps
Hotmail
ITV
John Lewis
KLM
Lotto
MSN
Next
O2
Paypal
QVC
Rightmove
Sky
Tesco
Utube
Virgin
Weather
Xbox
Youtube
Zara

192
24
3
4OD
5 day weather
6 music
7zip
8 ball
9 11

This list from Google UK, as should be apparent. Three results return Google’s own products – and guessing “Utube” for “U” (which returns Youtube as the top result) seems a little cheeky.

Other than that, it’s quite a commercial list it is. Apart from the Google items, they are mostly brands, and mostly people who’d want to sell you something.

Notable exceptions to this are the BBC (which also owns “Weather” and indeed “5 day weather”, at least in my locality), 24 (wikipedia link for the TV series), 6 music (more BBC) and 9 11, which goes to the Wikipedia page for the attacks. I’m guessing this one might be seasonal.

It would be interesting to see how this varies for other users and other territories. Also would be interesting to see the Adwords spends for the companies so favoured…

Canonicals and noindex results

November 1st, 2009

The results of the third experiment are in. This was quite a simple one: to see whether Google would respect a canonical link element on a page that had the noindex robots metatag.

No surprises here, happily. You’d expect Google to read the noindexed page, including the canonical link element, make the adjustment accordingly and index the destination page. That’s exactly what it did. Happy days.

Actually, it did it so quickly (compared with some other canonicals that I’ve implemented elsewhere) that I’m left wondering whether Google might actually be more inclined to pay swift attention to the canonical instruction if the page on which it is found is set not to be indexed. Just speculation, of course.

More on Google and robots.txt

November 1st, 2009

I spoke to a couple of fellow SEO types about Google’s behaviour in Experiment 4. I’ve almost been persuaded that the behaviour of Google in the circumstances is not so very controversial.

In the experiment, I found evidence that Google was checking URLs that were disallowed in robots.txt, which initially seemed to me to be a breach of the robots protocol.

Here’s what Google says about robots.txt.

While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web.

Here’s what the site dedicated to the robots.txt protocol says:

The “Disallow: /” tells the robot that it should not visit any pages on the site.

There appears to be a slight discrepancy. Google says it will not crawl or index the content of pages blocked, whereas the robots protocol suggests that blocked agents should not visit the pages.

However, Google states that it may index just the URLs if it finds them elsewhere. This leads to the classic “thin result” in Google where you just see the URL and nothing else. It’s quite possible for such thin results not only to appear but to rank in searches, which has been an interesting way of demonstrating the power of anchor text in the past.

Google will presumably not want to index junk URLs. So when it finds URLs via links on other pages, but finds that they are blocked by robots.txt, it is presumably sending some kind of HTTP request – enough to confirm that they are valid, and apparently enough to pick up the HTTP response. I’d assume that the approach is as follows:

  • Response: 200 – index a thin result URL
  • Response: 301 – follow the same process with the destination URL, index if not blocked
  • Response: 4xx or 5xx – don’t index, at least for now

This would explain the result of the experiment. It seems to me that Google is not quite acting within the spirit of the robots protocol, if this is indeed the case.

The upshot of this is that you have to be very careful about the combination of methods that you are using to restrict access to your pages. It’s well known, for example, that Google will not (or cannot) parse a page-level robots noindex instruction if that page is blocked by robots.txt (because they are respecting robots.txt and not looking at the content of the page). For similar reasons, Google would not be able to “see” a canonical link instruction on a page blocked by robots.txt. However, it seems that they can and will respect an HTTP-level redirect, because this response is not part of the “content” of the page.

I wonder if I’m the only person to find this stuff interesting!

Google does not always respect robots.txt… maybe

October 27th, 2009

Here are the results for experiment 4.

To recap: I put a new link on the home page which was blocked by robots.txt. The link was to http://www.search-experiments.com/experiments/exp4/experiment4main.shtml.

Before even creating this page, I blocked all pages in that folder in robots.txt.

Here’s the very text that appears there:

User-agent: *
Disallow: /experiments/exp4/

Google Webmaster Tools confirms that the page is blocked when I use its Crawler Access tool:

Test results:

http://www.search-experiments.com/experiments/exp4/experiment4main.shtml
Blocked by line 3: Disallow: /experiments/exp4/

(However, it’s not yet showing up in the Crawl errors page.)

Then I put a 301 redirect in place on the page, redirecting to my “destination” page.

If Google properly respects robots.txt, then it should not request the blocked page. If it doesn’t request the blocked page, it shouldn’t find the 301 redirect to the destination page.

As that destination page is not linked to from anywhere else, that page should never appear in the index.

So, what happened?

Well, Google took it’s time to reindex the home page of the site (it’s not frequently updated and it’s not exactly a high-traffic site). But it did get around to it eventually.

And the destination page has also been indexed.

Now, it is of course possible that some other site has linked directly to the destination page, thereby giving Google an alternative and legitimate route in. The experiment is not, therefore, in a clearly controlled environment. But this seems quite unlikely, unless it has been accessed by some other crawler which has republished the destination URL somewhere, or someone was being very annoying to the point of being malicious. On a site like this, however, with its minuscule readership, I think the chances of the latter are remote. Incidentally, neither Google nor YSE are reporting any links into the destination page.

There was only one way to find out exactly what had happened – to look at the raw server logs for the period and see whether Google had indeed pinged the blocked URL. Unfortunately…  when I went to the logs to check out exactly what Gbot had been up to, I found that I hadn’t changed the default option in my hosting, which is not to keep the raw logs. So that’s not too smart. Sorry. I’ve got all the stats that I normally need, but neither AWStats, Webalizer or GA are giving me the detail that I need here.

On the balance of probability however, it seems that Google may be pinging the URLs that you tell them not to access with robots.txt, and checking the HTTP header returned. If it’s a 301, it will follow that redirect, and index the destination page in accordance with your settings for that page.

What’s the practical use of this information? Well, I can imagine a circumstance in which you have blocked certain pages using robots.txt because you are not happy with them being indexed or accessed, and you are planning to replace those pages with pages that you are happy with, you shouldn’t rely on Google continuing to respect the robots.txt exclusion once you have arranged for those pages to be redirected.

What’s the next step? Well, I’ve enabled the logs now, and will run a similar experiment in the near future.

301 redirect and robots.txt exclusion combined

October 20th, 2009

Experiment 4 is now up on the Search Experiments home page.

What I’m up to here is again pretty simple. I’ve created two pages. The first has been linked to from the home page under Experiment 4, but it has also been blocked by robots.txt (by disallowing the directory in which it resides). To be on the safe side, the robots.txt exclusion was put in place for the directory before the page was even created.

This page, however, will never see the light of day, because it has also been 301-redirected to another page, the “destination” page for the experiment.

Fortunately this blog is so obscure that the destination page is unlikely to receive any other incoming links (please don’t link to it if you’re reading this…).

The hypothesis is that Google will NOT follow the URL from which it is blocked by robots.txt, and so it will NOT discover the 301 redirect, so the destination page should not appear in Google’s index. What we should see instead is a snippet-free URL for the original page.

That’s what should happen if my understanding is right. But that’s not necessarily the case. Results will be reported back here.

Canonical link element and noindex robots metatag

October 20th, 2009

I’ve actually explained what I’m doing in this experiment on the page itself, which is here. The set-up is as follows

  • Create two almost identical pages
  • Link to the first one
  • Set the first page to “noindex,follow”
  • Give the first page a canonical link element in the head section, pointing to the second page
  • Set the second page to “index, follow”

Then, sit back and wait for Googlebot to work its magic – and see whether the second page makes it into the index. Really, provided that Google respects the noindex tag, and there’s no good reason why it should not, there should be no chance of the first page making it into the index. So the sole question is whether the second page will make it into the index or not.

My expectation, and hope, is that it will, despite being unlinked from anywhere else. Further variations on this theme will follow if it does not, and may in any case.

New experimental blog

October 13th, 2009

I’ve started a new blog about surnames to see whether the creation of new content will have any effect on another project. The new blog is hand-crafted, ie it doesn’t rely on autoblogging – it’s done the old-fashioned way. It’s not a highly controlled experiment, but as I’m checking the relevant rankings anyway it will be interesting to see whether the linked pages get a greater benefit than those that are not linked.

If it succeeds, I’ll continue with the blog as a permanent tactic; in any event I shall probably try something more along the autoblogging line at a later stage.

WordPress 2.7 and Google Analytics: Google Analyticator plugin review

January 11th, 2009

I’ve previously written about Google Analytics and WordPress 2.7, and all I managed to do really was to show my ignorance. The problem that I’d had was that after installing Google Analytics manually (by inserting the relevant code in the footer), I’d then upgraded to 2.7 automatically, and naively expected it all to work magically.

I now understand that I am not yet an instinctive WordPress user. The first instinct of an experienced WP blogger looking to install analytics of any kind (or carry out almost any task) would naturally be: “Find me a plugin!”, whereas I’m more used to handbuilding web pages using simple HTML editors.

So when I recently changed the WordPress theme of one of my scratchpad blogs, and I knew as a result, the analytics code that I had placed in the footer file would have disappeared – changing themes is another way of losing code that has been manually added – I decided to investigate the plugin route.

It didn’t take too long to find the Google Analyticator plugin, which is intended to make installing Google Analytics as simple as possible.

Test blog software version: WordPress 2.7

Installation and activation: worked without hitches of any kind. Following “activation”, you do have to enter the relevant ID from your Google Analytics account and enable tracking, so don’t fall into the trap of thinking that you’re done when you’ve activated the plugin. To be fair, you do get a great big warning at the top of the page letting you know about this.

Options:

a) You can choose to put the code in the footer rather than the header – I would have thought that this should be the default setting, as I’d always want to ensure that the tracking code was loaded last on the page – visits where the visitor doesn’t wait for the full page to be loaded before hitting the back button or moving on don’t seem to me to be worth counting. However, as the plugin’s author explains in the settings, apparently not all themes support having the code in the footer.

b) You can choose whether to exclude visits from logged-in blog admins. My strong recommendation would be to use this, as you don’t want your own visits to the site to be distorting your traffic. (The more traffic you have, the less important this distortion will be.) A good feature – and one that appears to be working correctly. One warning on this: the way that it works is to exclude the tracking code from the page when you are accessing the page as a logged-in admin. So, if you want to check your pages to see whether the tracking code is present and correct, you’ll need to log out first. That one confused me for a moment or two!

c) You can specify additional tracking code to go before or after the GA code. This allows you access to a range of additional tracking functions in GA. My needs here aren’t yet that sophisticated, but I can confirm that adding the text works as it should.

d) You can choose whether or not to turn on tracking of outbound links.

e) You can specify (by file suffix) any file links that you want to be counted as downloads.

Persistence:

The key question for me was as to whether this useful-seeming plugin would plug the gap that I’d originally been hoping to fill: that is, whether it would maintain the correct analytics code in the right place if I were to upgrade or switch themes. As my test blog for this has the current 2.7 version of WordPress, I can’t test the upgrade question, but I can see what happens if I switch themes. And I can report that it handles the transition perfectly. I’ll report on how it handles any upgrades at a later stage.

Verdict:

Adding Google Analytics to your WordPress blog is not that complicated a task, but the Google Analyticator makes it even simpler, and also gives an intelligent range of useful options. Congratulations and thanks to the plugin author, especially for taking the time to make it compatible with WordPress 2.7.

Yaab 1.2 and WordPress 2.7 not the best of friends yet

January 5th, 2009

The main test site that I have used for Yaab Autoblogger to date is using WordPress version 2.6. It’s all working pretty sweetly (with the occasional apparent double-posting that I have yet to investigate properly, and suspect is my fault or a problem with the feed).

I was keen to try out the new version (1.2) of Yaab, and continue to provide some feedback to its tireless author. As it happened, I’d just bought a new domain on which I’m intending to try out some experiments with autoblogging, autolinking and crosslinking on various subdomains, so I had the opportunity to carry out a fresh install.

Having set up a blog on one subdomain, using WordPress 2.7, I installed the latest version of the Yaab plugin.
Now, 2.7 represents a major change of interface for WordPress, and Yaab was clearly designed to fit in with the previous interface. However, anyone thinking about installing the plugin with 2.7 should be warned that you will not be seeing it at its best!

It will still function, but the interface changes to 2.7 have messed up the attractive interface of Yaab somewhat. I’m sure that this is something that Satheesh will be looking to tidy up (once he has finished his exams), as I think that the latest version of WordPress is likely to be very popular.

Now, I’ve broken a golden rule of experimentation by changing two elements at the same time. Fortunately I still have a live blog running 2.6, and I will try installing Yaab 1.2 there in order to give proper feedback on its new features. I might also have a go at installing the earlier version of Yaab on a 2.7 blog somewhere, if I get time for that as well.