Stopping web crawling bots from causing errors in BlogCFC

So I have been using BlogCFC for 970 days, and I love it. But one problem I have had since the beginning is when my site gets hammered with web crawler, I get a ton of errors. They usually hit between 2:00 AM and 6:00 AM and crawl my blog looking for new content. I appreciate what they do, but sometimes they can be VERY aggressive and start to cause timeout errors.

The result is that I wake up to dozens, or hundreds, of error emails and, very rarely, a crashed ColdFusion application server. Since I am on an Awesome VPS, I rarely have problems with the crashing, even less so since I upgraded the JVM from the CF8 default. But I would rather not have my server brought to its knees every morning by bots. Especially since I know that my worshippers from across the pond are just arriving at work and desire nothing more than to see if I have anything new to say.

So, finally, after 3 years, I decided to look into this problem. I've noticed that more often than not, the timeout errors are occurring when the web crawler tries to hit the "print" link on every post. So I said to myself, "Self, do web crawlers need to index my 'print' page?"

Of course not.

Robots.txt

I hope that most people know what the robots.txt file is for, but in case you don't, the robots.txt file is a file you place at the root of your website to tell web crawlers which paths on your websites they should crawl and which they should ignore. This allows you to tell the web crawlers not to try to index certain paths of your websites.

So here is the robots.txt file I added to my site.


User-agent: *
Disallow: /print.cfm
Disallow: /admin

Here I've told the web crawlers to ignore my /admin directory and the print.cfm file. I don't want them to index either one, for obvious reasons.

Hopefully this will help. We'll see what happens. And of course, you should be able to apply this logic to other applications as well. Only the /print.cfm part is BlogCFC specific. And since BlogCFC does not ship with a robots.txt file, I figured most users of it probably don't have one and could be experiencing the same problem.

Related Blog Entries

Comments
Steve Withington's Gravatar @Jason,
Yeah, it's about time somebody posted something on this ... I might also add a rel="nofollow" to those necessary links as well.
# Posted By Steve Withington | 10/13/10 9:29 AM
Eric Cobb's Gravatar Until you mentioned it, I forgot that I also had to set up a robots.txt file for my blog. Here's what mine looks like:

User-Agent: *
Sitemap: http://www.cfgears.com/googlesitemap.cfm
Disallow: /admin/
Disallow: /addcomment.cfm
Disallow: /addsub.cfm
Disallow: /trackback.cfm
Disallow: /trackbacks.cfm
Disallow: /contact.cfm
Disallow: /error.cfm
Disallow: /print.cfm

I don't remember now if there were valid reasons for each of those, or I just went ahead disallowed everything that I didn't think the bots needed to see.
# Posted By Eric Cobb | 10/13/10 9:29 AM
Steve Withington's Gravatar Oh, and also consider putting <meta name="robots" content="noindex,nofollow" /> on any pages you _don't_ want indexed, etc.
# Posted By Steve Withington | 10/13/10 9:41 AM
Jason Dean's Gravatar Thanks for the additional ideas/tips guys. It's all good advice, I think.

At first, I was mad at the spider bots for hammering my site. But then, I realize, that the fault was, at least, partly mine for not giving the bots the appropriate guidance. Hopefully this will help solve my error e-mail problem.
# Posted By Jason Dean | 10/13/10 11:59 PM
Stephen Moretti's Gravatar I've also got a crawl-delay in my robot.txt. Some of the bot do take note of this but some don't. It helps a little:

Crawl-delay: 10

This makes bot that adhere to this parameter wait 10 seconds before requesting another page.
# Posted By Stephen Moretti | 10/14/10 7:52 AM
Brian Steck's Gravatar What a simple solution!

I was really looking for a solution that involved fixing the errors that the robots were returning (I thought that perhaps they were running across actual errors that I users hadn't seen yet), but now I'm realizing that it's the robots themselves that are calling some crazy URLs from my error.cfm page and that's really what's sending me the dozens of e-mails each night.

Sometimes I love being wrong. Thanks!
# Posted By Brian Steck | 1/24/12 7:52 PM
BlogCFC was created by Raymond Camden. This blog is running version 5.9.1. Contact Blog Owner