Learning never exhausts the mind
Home >  Web Design > SEO > A Guide to the Robots.txt Exclusion Protocol

Published 1st February 2013 by

Learn how to use robots.txt to control how search engine crawlers, or spiders, access and crawl your site on the web.
Search Engine Optimisation Series
  1. SEO - Search Engine Optimization
  2. A Guide to the Robots.txt Exclusion Protocol
  3. What are XML sitemaps?
  4. Using Google Webmaster Tools
  5. Getting Started with Google Analytics
  6. Getting Started Earning Money with Adsense
  7. Website Loading Times Are Vital - How to Improve Yours
  8. Improve Website Speed by Enabling Compression
  9. Google Trends Keyword Comparison Tool
  10. 8 Excellent (and Free!) Search Engine Optimization Websites

A Web crawler is sometimes also called a spider or a bot. They are an automated Internet robot which systematically browses the World Wide Web, typically for the purpose of Web indexing, although they can be used for gathering data of any kind. Sometimes these bots can be a bit overzealous in their crawling and generate thousands of hits per hour, a process also known as a denial of service attack.

The robots.txt file is a standard to give instructions about a website to the web robots. This standard is called The Robots Exclusion Protocol.

When a robot wants to visit a site to crawl, it firsts checks for the robots.txt to see if is allowed to crawl the site and if there are any areas it should ignore.

The robots.txt is a plain text file which is placed in the root of the website, for example, http://www.example.com/robots.txt

The most basic of robots.txt contents looks like this:

User-agent: *

This simple content creates a rule which allows all web crawlers access to the entire site.

User-agent: * indicates that the following rule applies to all spider bots.

Disallow: The empty disallow field indicates nothing is blocked and every link can be crawled.

The opposite would be to block access for all web crawlers and prevent the site from being indexed.

User-agent: *
Disallow: /

User-agent: * indicates that the following rule applies to all spider bots.

Disallow: / indicates that the homepage, and everything under the homepage, is disallowed, or forbidden.

You can specify which URLs are blocked in the Disallow field.

User-agent: *
Disallow: /wp-admin/
Disallow: /private/

Care must be taken when disallowing resources as malicious users may use the robots.txt to locate hidden areas and target them to attack. For example, the above rules would indicate to a hacker that the site is running WordPress and that there is a URL with private resources. They can then tailor an attack to WordPress or investigate those private resources. Read more about internal information disclosure here.

Robots.txt is used to prevent search engines from listing a web page, it should not be used as a security measure.

You can also limit access on a per-bot basis by specifying them in the User-agent field.

Here are a few of the most popular web crawler user agents to use:

  • googlebot - Googles own web crawler
  • Mediapartners-Google - Google Adsense/Adwords
  • Bingbot - Microsoft Bing
  • MSNBot - Microsoft's old MSN bot
  • Slurp - Yahoo! Search
  • ia_archiver - Internet Archive

You can give access to certain bots, wilst blocking all others:

User-agent: Googlebot

User-agent: Bingbot

User-agent: Slurp

# Everyone Else (NOT allowed)
User-agent: *
Disallow: / 

In theory, this should block all "bad bots", i.e. those bots who scrape content and hog bandwidth, but bad bots do not honour the robots.txt rules or even access the file. Spiders that are bad are not going to follow robots.txt, and search engines and spiders that do follow it are the ones that you want indexing your site.

Once you have created and uploaded a robots.txt, you can use the Google Webmaster Tools to check for errors and test to see if the rules work against a number of user agents.

Tutorial Series

This post is part of the series Search Engine Optimisation. Use the links below to advance to the next tutorial in the couse, or go back and see the previous in the tutorial series.

Leave a Reply

Fields marked with * are mandatory.

We respect your privacy, and will not make your email public. Hashed email address may be checked against Gravatar service to retrieve avatars. This site uses Akismet to reduce spam. Learn how your comment data is processed.