My robots.txt file for Simple Machines
What follows is the robots.txt file that I’m using for a photography forum I’m an admin on. The idea is to tell robots not to crawl pages that I think shouldn’t be indexed, or that I feel may be causing duplicate content problems.
It’s important to note that this file, like the robots.txt files on my other sites, is fluid. I change them as often as I feel I need to to block robots from crawling pages I don’t want them on. So just because I’m using this robots.txt file now doesn’t mean it’s not going to change in a few days.
-
# robots.txt file for
-
# http://forums.photoartsforum.com
-
-
Sitemap: http://forums.photoartsforum.com/index.php?action=sitemap;xml
-
-
User-Agent: *
-
Disallow: /index.php?action=search
-
Disallow: /index.php?action=calendar
-
Disallow: /index.php?action=login
-
Disallow: /index.php?action=register
-
Disallow: /index.php?action=profile
-
Disallow: /index.php?action=stats
-
Disallow: /index.php?action=activate
-
Disallow: /index.php?action=help
-
Disallow: /index.php?action=admin
-
Disallow: /index.php?action=pm
-
Disallow: /index.php?action=mlist
-
Disallow: /index.php?action=notify
-
Disallow: /index.php?action=post
-
Disallow: /index.php?action=markasread
-
Disallow: /index.php?action=sendtopic
-
Disallow: /index.php?action=printpage
-
Disallow: /index.php?action=dlattach
-
Disallow: /index.php?action=reminder
-
Disallow: /attachments/
-
Disallow: /avatars/
-
Disallow: /Packages/
-
Disallow: /sitemaps/
-
Disallow: /Smileys/
-
Disallow: /Sources/
-
Disallow: /Themes/
-
-
Disallow: /*sort=
-
Disallow: /*.msg
The first two lines are just comments so that I can keep the dozen or so robots.txt files I have laying around my hard drive separate.
Line 4 is a new addition to the robots.txt file, and is not yet a standard. But the major search engines support it and others will probably follow suit. Either way, it’s not going to hurt to have it there. What the Sitemap: line does is tell the search engine robot the path to your XML sitemap.
Lines 6-31 are disallowed urls that all robots should follow due to the User-Agent: * on line 6. What’s nice is that urls that start with any of the disallowed strings are disallowed, so I don’t have to go and worry about every possible combination.
Lines 33 & 34 are non standard, but both Yahoo Slurp and GoogleBot understand wildcards so I went ahead and listed them.
Question, Comments...
Do you have more questions. Please either leave a comment below or join us in our new forum.
[...] a short follow-up to the robots.txt for SMF posting I did yesterday, here is a similar article over at TheAdminZone.com on robots.txt for [...]
Thank you for posting this! I’ll give it a try and use it as a startpoint.
Thanks for the comment granec. I will say that I haven’t had much luck getting my SMF forums indexed so this is just part of what I’m trying.