So I have been using BlogCFC for 970 days, and I love it. But one problem I have had since the beginning is when my site gets hammered with web crawler, I get a ton of errors. They usually hit between 2:00 AM and 6:00 AM and crawl my blog looking for new content. I appreciate what they do, but sometimes they can be VERY aggressive and start to cause timeout errors.
The result is that I wake up to dozens, or hundreds, of error emails and, very rarely, a crashed ColdFusion application server. Since I am on an Awesome VPS, I rarely have problems with the crashing, even less so since I upgraded the JVM from the CF8 default. But I would rather not have my server brought to its knees every morning by bots. Especially since I know that my worshippers from across the pond are just arriving at work and desire nothing more than to see if I have anything new to say.
So, finally, after 3 years, I decided to look into this problem. I've noticed that more often than not, the timeout errors are occurring when the web crawler tries to hit the "print" link on every post. So I said to myself, "Self, do web crawlers need to index my 'print' page?"
Of course not.
Robots.txtI hope that most people know what the robots.txt file is for, but in case you don't, the robots.txt file is a file you place at the root of your website to tell web crawlers which paths on your websites they should crawl and which they should ignore. This allows you to tell the web crawlers not to try to index certain paths of your websites.
So here is the robots.txt file I added to my site.
Here I've told the web crawlers to ignore my /admin directory and the print.cfm file. I don't want them to index either one, for obvious reasons.
Hopefully this will help. We'll see what happens. And of course, you should be able to apply this logic to other applications as well. Only the /print.cfm part is BlogCFC specific. And since BlogCFC does not ship with a robots.txt file, I figured most users of it probably don't have one and could be experiencing the same problem.