Internet Marketing Coach

Ask me ANYTHING Internet Marketing related and I'll give you an honest answer!

Join my Private Internet Marketing Forum and get a Free account once you reach 500 Posts!

An In-Depth robots.txt Guide

January 18th, 2008 | 8 comments

robots.txt - What Is It and Why Should I Use It?

The robots.txt file is a simple file contained within a sites remote folder (yoursite.com/robots.txt,) that contains instructions for web spiders (not just search engines) as to what files and/or folders the spider should not access.

There are several reasons one may use a robots.txt file including, but not limited to, privacy issues, reduction of duplicate content and security of sensitive directories/files etc. This file is especially useful when using CMS platforms such as WordPress, as their structure creates duplicate pages by default.

How to Create and Use a robots.txt File Correctly

Whether you’re running a blog or static site you most likely already have a robots.txt file. If you haven’t touched it, it is probably blank or restricts a file that your site doesn’t even contain just so the SE’s know it’s there, which some people believe is a good thing (misconception.) Having a robots.txt file that doesn’t restrict anything is neither good nor bad.

- Creation

To create this file, just open a new text file with notepad, wordpad or any other basic text editor and save the blank file to your server. Of course you’re going to name it robots.txt

To ensure it’s there, just go to yoursite.com/robots.txt You should see a blank page if you haven’t added anything to it yet.

- Setting Controls

Now that your file is created you need to set controls so it can do its job. This is very easy.

First you need to tell it what bot(spider) you want to restrict. To do this just insert the following into the file. For this example we’re restricting Google Search:

User-Agent: Googlebot

To restrict all spiders you can use a wild-card:

User-Agent: *

Now we need to tell it what page or directory we want to to restrict Google from accessing. To do this we’ll just add that page or directory:

User-Agent: Googlebot
Disallow: /mybuttpictures.html

I don’t want pictures of my butt to be accessed by the Google bot (fictional page) :) so I just restrict it.

You can do the same with directories:

User-Agent: Googlebot
Disallow: /blog/

You can also use a wild-card, which will restrict all URL’s that start with the beginning attribute:

User-Agent: Googlebot
Disallow: /blog/*

This will restrict everything within the blog directory.

To restrict multiple pages and/or directories just add them directly beneath one another:

User-Agent: *
Disallow: /blog/
Disallow: /page34234.html
Disallow: /mybutt.jpg
Disallow: /anotherdirectory/

To restrict your entire Website just use the following:

User-Agent: *
Disallow: /

Things to Consider

1. The robots.txt file should always be contained within the main directory of a site. For instance, this blog is hosted in a sub-directory so in order for me to restrict URL’s on this blog I use ez-onlinemoney.com/robots.txt rather than ez-onlinemoney.com/blog/robots.txt

2. In a recent interview of Matt Cutts done by Eric Enge (found at Andy Beard’s Blog) Matt explains how a page which has been restricted by robots.txt will not be spidered by Googlebot, but it can still accrue PageRank and be returned in their search results.

3. The robots.txt file is viewable by evryone. I have heard of ways to hide if from visitors, but I don’t have any insight on that. I don’t know how if the techniques I’ve seen are still freindly to the SE’s etc. so I haven’t tried it.

People can see what you’re hiding from the SE’s so always keep that in mind! For instance, using robots.txt to restrict SE’s from indexing a download that you sell wouldn’t be a good idea ;)

Always be Careful!

robots.txt has the ability to do some great things, but it can also hurt a site very quickly! An easy way to see if you’re restricting only the pages/directories you want is by analyzing your robots.txt with the tool provided within your Google Webmaster Tools account.

Also, although robots.txt is a good tool and one that should be used, it is not 100% effective all of the time. You should use the rel=”nofollow” attribute in the links you don’t want to be spidered as well. In this situation, if you absolutely no NOT want those pages returned, insert a META nofollow into the page(s) and insert a rel=”nofollow” into all of the known links that point to that page(s.)

Popularity: 18% [?]

Share This Story: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • StumbleUpon
  • Technorati
  • Bumpzee
  • PlugIM
  • Sphinn
  • bodytext
  • E-mail this story to a friend!
  • Facebook
  • Google
  • Live

Related Posts

8 comments

  1. Stumble into the Weekend 01/18
    19th January, 2008 at 1:38 am 

    [...] If you’re not sure what I’m talking about, just read Josh Spaulding’s “In-Depth robots.txt Guide“. Don’t neglect [...]

  2. Maurice (TheCaymanHost) (60 comments.)
    20th January, 2008 at 12:52 am 

    Good stuff Josh - a nicely written explanation.

    There are several free robots.txt generators available online - I prefer Yellowpipe.com which gives anyone a good starting point and it’s as easy as filling in the blanks on a form.

    It also offers the option to exclude known “bad bots” which is useful. Either way, it’s a nice simple way for people to build an error free robots.txt file, which is great for techno morons like me :0)

  3. Best Posts for the Week of January 14th 2008
    21st January, 2008 at 4:57 am 

    [...] An In-Depth robots.txt Guide. [...]

  4. Josh Spaulding
    21st January, 2008 at 2:13 pm 

    Thanks Maurice,

    I’ve never heard of a robots.txt generator, but that sounds like a good option for those who may be intimidated by them, although I don’t see a reason to be intimidated.

    The bad bot exclusion seems like a nice option. I’m not a techie, as you know, but I can’t believe a “bad” bot would obey a robots.txt file though. They are “bad” aren’t they? So why would they obey robots.txt? :) I have seen many files disallowing alot of bad bots though, so I guess it does work.

  5. [...] It wouldn’t hurt to restrict some directories etc. that don’t need to be crawled. Anyone can feel free to view my robots.txt and/or read my in-depth robots.txt guide. [...]

  6. Duplicate Content - It May Not be so Bad
    13th February, 2008 at 11:33 pm 

    [...] be indexed while the useless pages will not. To ensure the right pages are indexed be sure to use robots.txt and a good internal linking structure. If you’re running a WordPress blog there are many ways [...]

  7. [...] Spaulding provides an excellent explanation of the robots.txt file on his [...]

  8. Tamara
    1st April, 2008 at 9:43 pm 

    Ok my website is not done yet….but you are too funny with your butt pictures.

    I am so glad you brought this fact out, very interesting!

    Very usefull information!

Leave a reply

Disclosure Statement | © 2008 Spaulding Marketing Ent. All Rights Reserved. Syndication is NOT authorized without consent.


Proudly hosted on a LiquidWeb Dedicated Server!