The Robots.txt File

The Robots.txt file controls what search engines and archivers can and can’t access on your website when they visit. It’s formally known as ‘The Robots Exclusion Protocol’ however most webmasters simply refer to it by its file name. Rogue archivers don’t adhere to the rules you set however all reliable search engines such as Google, Bing and Alexa do. It’s therefore an important file to add to your website.

Creating a robots.txt file is easy. All you have to do is create a text file, save it as robots.txt and then upload it to the root of your domain (e.g. www.kevinmuldoon.com/robots.txt). The file can be as simple or as complicated as you want. For example, to allow all search engines to index your content, you just need to add this to your robots.txt.

User-agent: *
Disallow: 

The code to stop all search engines from indexing your content is very similar. You just need to add a forward slash (/) to the Disallow property.

User-agent: *
Disallow: /

You can achieve this by using the Allow property. To allow everything you could use the following.

User-agent: *
Allow: /

And to disallow everything.

User-agent: *
Allow: 

Understanding The Robots.txt File

Johnny 5 Is Alive

The Robots.txt file is one of the easiest files to understand. The four main properties used in Robots.txt are User-agent, Disallow, Allow and Sitemap (though other properties exist such as Deny, Crawl-delay and Request-rate).

  • User-agent – Specify the user agent (e.g. Googlebot)
  • Disallow – Specify what folders should not be crawled.
  • Allow – Specify what folders should be crawled.
  • Sitemap – Specify where your sitemap is

Disallow is used much more frequently than Allow as by default search engine robots are allowed to crawl everything unless you say otherwise. Allow is useful for specifying that a file or folder can be crawled if the parent folder is blocked. For example, say you wanted to stop all files from being searched in a folder except one file and one sub folder.

User-agent: *
Disallow: /mydocuments/
Allow: /mydocuments/intro.pdf
Allow: /mydocuments/guidelines/

Generally, you will use Disallow the vast majority of the time and only use the Allow option when you want a file or folder inside a blocked folder to be crawled.

Sitemap is used to tell search engines where your sitemap is located. Your sitemap is usually an xml file.

Robots.txt For WordPress

Most websites stop search engines crawling certain directories such as the cgi-bin. It’s recommended that WordPress users block search engines from certain areas of the site such as the admin folder, includes folder, plugin folder etc. Below you will find some suggestions of what your Robots.txt file should look like for a WordPress website.

The suggested Robots.txt file on WordPress is popular.

Sitemap: http://www.example.com/sitemap.xml

# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*

# Google AdSense
User-agent: Mediapartners-Google*
Disallow:

# digg mirror
User-agent: duggmirror
Disallow: /

# global
User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /category/*/*
Disallow: */trackback/
Disallow: */feed/
Disallow: */comments/
Disallow: /*?
Allow: /wp-content/uploads/

Another alternative is the leaner Robots.txt file from Jeff Star.

Sitemap: http://example.com/sitemap.xml

User-agent: *

Disallow: /feed/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /xmlrpc.php
Disallow: /wp-

The lean version from Jeff Star is a good starting point for any WordPress Robots.txt file. You can then update it when you add new folders and files to your site.

Additional Resources

The Robots.txt for every website on the internet can be viewed by simply visiting www.site.com/robots.txt so a great way to learn more is by comparing the Robots.txt file from popular websites. You should also find the following resources useful for understanding the Robots Exclusion Protocol better and configuring Robots.txt for your own website.