Stopping web crawling bots from causing errors in BlogCFC

So I have been using BlogCFC for 970 days, and I love it. But one problem I have had since the beginning is when my site gets hammered with web crawler, I get a ton of errors. They usually hit between 2:00 AM and 6:00 AM and crawl my blog looking for new content. I appreciate what they do, but sometimes they can be VERY aggressive and start to cause timeout errors.

The result is that I wake up to dozens, or hundreds, of error emails and, very rarely, a crashed ColdFusion application server. Since I am on an Awesome VPS, I rarely have problems with the crashing, even less so since I upgraded the JVM from the CF8 default. But I would rather not have my server brought to its knees every morning by bots. Especially since I know that my worshippers from across the pond are just arriving at work and desire nothing more than to see if I have anything new to say.

So, finally, after 3 years, I decided to look into this problem. I've noticed that more often than not, the timeout errors are occurring when the web crawler tries to hit the "print" link on every post. So I said to myself, "Self, do web crawlers need to index my 'print' page?"

[More]

Related Blog Entries

Comments
Steve Withington's Gravatar @Jason,
Yeah, it's about time somebody posted something on this ... I might also add a rel="nofollow" to those necessary links as well.
# Posted By Steve Withington | 10/13/10 9:29 AM
Eric Cobb's Gravatar Until you mentioned it, I forgot that I also had to set up a robots.txt file for my blog. Here's what mine looks like:

User-Agent: *
Sitemap: http://www.cfgears.com/googlesitemap.cfm
Disallow: /admin/
Disallow: /addcomment.cfm
Disallow: /addsub.cfm
Disallow: /trackback.cfm
Disallow: /trackbacks.cfm
Disallow: /contact.cfm
Disallow: /error.cfm
Disallow: /print.cfm

I don't remember now if there were valid reasons for each of those, or I just went ahead disallowed everything that I didn't think the bots needed to see.
# Posted By Eric Cobb | 10/13/10 9:29 AM
Steve Withington's Gravatar Oh, and also consider putting <meta name="robots" content="noindex,nofollow" /> on any pages you _don't_ want indexed, etc.
# Posted By Steve Withington | 10/13/10 9:41 AM
Jason Dean's Gravatar Thanks for the additional ideas/tips guys. It's all good advice, I think.

At first, I was mad at the spider bots for hammering my site. But then, I realize, that the fault was, at least, partly mine for not giving the bots the appropriate guidance. Hopefully this will help solve my error e-mail problem.
# Posted By Jason Dean | 10/13/10 11:59 PM
Stephen Moretti's Gravatar I've also got a crawl-delay in my robot.txt. Some of the bot do take note of this but some don't. It helps a little:

Crawl-delay: 10

This makes bot that adhere to this parameter wait 10 seconds before requesting another page.
# Posted By Stephen Moretti | 10/14/10 7:52 AM
Brian Steck's Gravatar What a simple solution!

I was really looking for a solution that involved fixing the errors that the robots were returning (I thought that perhaps they were running across actual errors that I users hadn't seen yet), but now I'm realizing that it's the robots themselves that are calling some crazy URLs from my error.cfm page and that's really what's sending me the dozens of e-mails each night.

Sometimes I love being wrong. Thanks!
# Posted By Brian Steck | 1/24/12 7:52 PM
BlogCFC was created by Raymond Camden. This blog is running version 5.9.1. Contact Blog Owner