My robots.txt file for Simple Machines

What follows is the robots.txt file that I'm using for a photography forum I'm an admin on. The idea is to tell robots not to crawl pages that I think shouldn't be indexed, or that I feel may be causing duplicate content problems.

It's important to note that this file, like the robots.txt files on my other sites, is fluid. I change them as often as I feel I need to to block robots from crawling pages I don't want them on. So just because I'm using this robots.txt file now doesn't mean it's not going to change in a few days.

CODE:
  1. # robots.txt file for
  2. # http://forums.photoartsforum.com
  3.  
  4. Sitemap: http://forums.photoartsforum.com/index.php?action=sitemap;xml
  5.  
  6. User-Agent: *
  7. Disallow: /index.php?action=search
  8. Disallow: /index.php?action=calendar
  9. Disallow: /index.php?action=login
  10. Disallow: /index.php?action=register
  11. Disallow: /index.php?action=profile
  12. Disallow: /index.php?action=stats
  13. Disallow: /index.php?action=activate
  14. Disallow: /index.php?action=help
  15. Disallow: /index.php?action=admin
  16. Disallow: /index.php?action=pm
  17. Disallow: /index.php?action=mlist
  18. Disallow: /index.php?action=notify
  19. Disallow: /index.php?action=post
  20. Disallow: /index.php?action=markasread
  21. Disallow: /index.php?action=sendtopic
  22. Disallow: /index.php?action=printpage
  23. Disallow: /index.php?action=dlattach
  24. Disallow: /index.php?action=reminder
  25. Disallow: /attachments/
  26. Disallow: /avatars/
  27. Disallow: /Packages/
  28. Disallow: /sitemaps/
  29. Disallow: /Smileys/
  30. Disallow: /Sources/
  31. Disallow: /Themes/
  32.  
  33. Disallow: /*sort=
  34. Disallow: /*.msg
  35. Disallow: *wap2*

The first two lines are just comments so that I can keep the dozen or so robots.txt files I have laying around my hard drive separate.

Line 4 is a new addition to the robots.txt file, and is not yet a standard. But the major search engines support it and others will probably follow suit. Either way, it's not going to hurt to have it there. What the Sitemap: line does is tell the search engine robot the path to your XML sitemap.

Lines 6-31 are disallowed urls that all robots should follow due to the User-Agent: * on line 6. What's nice is that urls that start with any of the disallowed strings are disallowed, so I don't have to go and worry about every possible combination.

Lines 33 & 34 are non standard, but both Yahoo Slurp and GoogleBot understand wildcards so I went ahead and listed them.

Question, Comments...

Do you have more questions. Please either leave a comment below or join us in our new forum.

4 Responses to “My robots.txt file for Simple Machines”

  1. [...] a short follow-up to the robots.txt for SMF posting I did yesterday, here is a similar article over at TheAdminZone.com on robots.txt for [...]

  2. Thank you for posting this! I’ll give it a try and use it as a startpoint.

  3. Thanks for the comment granec. I will say that I haven’t had much luck getting my SMF forums indexed so this is just part of what I’m trying.

  4. Going through my Google results I found that they were indexing quite a few pages that were the wap2 version, which I didn’t want. So I added line 35 above to block those requests. We’ll see whether it helps.

Leave a Reply