Posts Tagged ‘SEO’

Google does not always respect robots.txt… maybe

Tuesday, October 27th, 2009

Here are the results for experiment 4.

To recap: I put a new link on the home page which was blocked by robots.txt. The link was to http://www.search-experiments.com/experiments/exp4/experiment4main.shtml.

Before even creating this page, I blocked all pages in that folder in robots.txt.

Here’s the very text that appears there:

User-agent: *
Disallow: /experiments/exp4/

Google Webmaster Tools confirms that the page is blocked when I use its Crawler Access tool:

Test results:

http://www.search-experiments.com/experiments/exp4/experiment4main.shtml
Blocked by line 3: Disallow: /experiments/exp4/

(However, it’s not yet showing up in the Crawl errors page.)

Then I put a 301 redirect in place on the page, redirecting to my “destination” page.

If Google properly respects robots.txt, then it should not request the blocked page. If it doesn’t request the blocked page, it shouldn’t find the 301 redirect to the destination page.

As that destination page is not linked to from anywhere else, that page should never appear in the index.

So, what happened?

Well, Google took it’s time to reindex the home page of the site (it’s not frequently updated and it’s not exactly a high-traffic site). But it did get around to it eventually.

And the destination page has also been indexed.

Now, it is of course possible that some other site has linked directly to the destination page, thereby giving Google an alternative and legitimate route in. The experiment is not, therefore, in a clearly controlled environment. But this seems quite unlikely, unless it has been accessed by some other crawler which has republished the destination URL somewhere, or someone was being very annoying to the point of being malicious. On a site like this, however, with its minuscule readership, I think the chances of the latter are remote. Incidentally, neither Google nor YSE are reporting any links into the destination page.

There was only one way to find out exactly what had happened – to look at the raw server logs for the period and see whether Google had indeed pinged the blocked URL. Unfortunately…  when I went to the logs to check out exactly what Gbot had been up to, I found that I hadn’t changed the default option in my hosting, which is not to keep the raw logs. So that’s not too smart. Sorry. I’ve got all the stats that I normally need, but neither AWStats, Webalizer or GA are giving me the detail that I need here.

On the balance of probability however, it seems that Google may be pinging the URLs that you tell them not to access with robots.txt, and checking the HTTP header returned. If it’s a 301, it will follow that redirect, and index the destination page in accordance with your settings for that page.

What’s the practical use of this information? Well, I can imagine a circumstance in which you have blocked certain pages using robots.txt because you are not happy with them being indexed or accessed, and you are planning to replace those pages with pages that you are happy with, you shouldn’t rely on Google continuing to respect the robots.txt exclusion once you have arranged for those pages to be redirected.

What’s the next step? Well, I’ve enabled the logs now, and will run a similar experiment in the near future.

301 redirect and robots.txt exclusion combined

Tuesday, October 20th, 2009

Experiment 4 is now up on the Search Experiments home page.

What I’m up to here is again pretty simple. I’ve created two pages. The first has been linked to from the home page under Experiment 4, but it has also been blocked by robots.txt (by disallowing the directory in which it resides). To be on the safe side, the robots.txt exclusion was put in place for the directory before the page was even created.

This page, however, will never see the light of day, because it has also been 301-redirected to another page, the “destination” page for the experiment.

Fortunately this blog is so obscure that the destination page is unlikely to receive any other incoming links (please don’t link to it if you’re reading this…).

The hypothesis is that Google will NOT follow the URL from which it is blocked by robots.txt, and so it will NOT discover the 301 redirect, so the destination page should not appear in Google’s index. What we should see instead is a snippet-free URL for the original page.

That’s what should happen if my understanding is right. But that’s not necessarily the case. Results will be reported back here.

Splogging and search

Sunday, October 5th, 2008

I’ve been experimenting with the WordPress plugin WP-o-Matic on another blog of late. In combination with the SimplePie plugin, it allows you to automatically post to blogs using RSS feeds. 

The plugin allows you to create campaigns, into which you can place multiple RSS feeds – or just a single one if you prefer. For each campaign, you allocate a category, and the plug-in will post items from the feed as individual blog posts categorised accordingly. 

You can control how often each campaign checks the feed for new items, although I’ve had some teething problems getting this to work exactly as I would like. Ideally, you would want to organise this so that it published stories on a drip-feed basis pretty close to their publication dates, so you want to set the check time at about the same frequency as new items are published.

Incidentally, I’ve also had some difficulty getting the campaigns to refresh. I think it is something to do with being a bit new to cron jobs. More on that later.

So, why would you want to republish someone else’s RSS feeds as if they were your own blog posts? Isn’t this (a) a rather unethical theft of content and (b) unlikely to do you any good for search optimisation, as it will all be duplicate content?

I’ll leave the ethical questions for another time – for now, let’s just remember that the second S in RSS stands for “syndication”.

So, what possible benefits, including SEO benefits, could flow from republishing this material? The idea of each item in an RSS feed being reproduced as a new, individual post is definitely just dupe content spam, right?

Not really. There are all kinds of possible legitimate uses for this. For example, you might want to do some judicious selection of RSS feeds, perhaps filtered automatically as well, and combine them so that your particular blog carried every story that you thought was going to be of interest to your audience. Provided that the posts have links to the original story, your users could be reading the truncated RSS summary in your blog and then deciding whether to go to the full post.

Another possibility is that you effectively own the RSS feed – for example, it could be something like your del.icio.us feed, which you wanted to turn into a linkblog without doing any more work, but creating a post for each one.

However, from an SEO point of view there are some further uses.

First, although the posts themselves will not be unique, the permutation of them may well be, so that your main page – and in particular your category pages – can contain themed content in a combination that is not to be found elsewhere on the web. If reasonably well-linked, these pages could have a chance of ranking for those terms.

Second, there is a very nice feature in the plug-in that allows you to process the feeds as they come in using a search and replace function.

This is separated into two functions for ease of use: the first is a simple word-swap. The example that the author gives is that you could have the plugin search for “ass” and replace it with “butt”. Incidentally, this kind of auto-bowdlerisation is a risky business – witness the embarrassment of the right-wing Christian site that decided that “gay” was too euphemistic (and happy-sounding) for them, and then ended up publishing a number of stories about the Olypmic sprinter “Tyson Homosexual”.

The second element enables you to automatically place links behind certain specified words/phrases. This is obviously pretty powerful for building lots of links with the right anchor text, quite quickly.

I’m not sure whether the two would work together – I will give it a go – but on the assumption that they do, it would be possible to pick a news feed filtered on say, Barack Obama, and republish all of those stories with the words “digital cameras” automatically replacing “Barack Obama”, and linking to your digital cameras site. You might even avoid some of the duplicate-spotting in this way…

Warning: very much of any of this kind of stuff is pretty likely to get your site banned by Google.

Effects of taggregation, plus status updates

Friday, August 15th, 2008

I am a little surprised to find that the blog home page was briefly #2 (now #3) in Google for the phrase “search experiments”, and that the site home page is #2 in Yahoo (in each case, the UK varieties). Despite this apparent “success” (I don’t think that the term has driven any search visitors to the site), there remain pages of the site resolutely unindexed.

Google

The preference that Google is showing for the blog home page is also interesting, and it is worth looking into why this might be, particularly because the links that I have created are all to the website home page. Although all the pages on the site link to the blog home, all the pages/posts on the blog link to the site home. 

So what is going on with Google here? A link: operator search returns no results, but Webmaster tools credits the site overall with 39 external links. Eight of these are to the home page, the rest to blog pages. The eight, which I set up, are from a couple of other blogs, one of which is totally weak and the other fairly weak.

The links to blog pages are mostly from Technorati, and all Technorati links are from pages aggregating all blogs with particular tags. The other links look as if they are doing something similar, probably with material taken or scraped from Technorati.

There’s good cross-linking between the blog and the other site pages: all links on blog pages to the main site home page use the phrase; conversely, all links on the non-blog pages link to the blog including the phrase. 

So, crosslinking should pretty much cancel itself out in relation to relative ranking. Which leads to an interesting tentative hypothesis: that simply blogging and using tags can garner external links – from aggregator pages – that are as powerful as hand-edited links from existing sites.

I do have one reasonable powerful incoming link set up (from the home page of a five-year old site with thousands of organic links), but this is not yet showing up as an external link in Webmaster tools. (This link is to the home page, not the blog.)

OK, it could of course be passing PR without showing up in Webmaster tools. I shall keep an eye out to see whether the relative ranking changes, and when the link shows up in Webmaster tools.

Yahoo

In Yahoo, it’s the home page that is showing up in the rankings. The blog home page is nowhere to be seen in the rankings; indeed, Site Explorer doesn’t recognise the page among the six that it currently lists. 

However, Site Explorer is giving credit for the one relatively powerful link to the site.

Observations and predictions

1) The blog home page being “ahead” of the home page in Google rankings seems to suggest that the links garnered by tag aggregation – I am disappointed but not wholly surprised to discover that the word “taggregation” has already been coined – may have a significant role to play in getting content indexed and ranked. I will not put it more strongly than that at present. It may be worth experimenting with a new blog, unlinked elsewhere, to test this hypothesis – by watching how it performs up to the point that someone manually links to it.

2) Having a top 3 result for a plausible if specialised phrase does not necessarily generate traffic.

3) Google is more interested in blog content than Yahoo (?)

Prediction: when Webmaster Tools shows the strong site in the external links, the home page for the site will outperform the blog home page in Google. 

Thinking about it, the other possible reason that the blog home page may be outperforming the home page is content – there’s typically a lot more content on the blog page and (obviously enough) the phrase “search experiments” gets mentioned all the time on it.