Googlebot Relentlessly Using Bandwidth

When one of my hosting clients complained about continuously running out of bandwidth on his low-traffic site, I took a peek at the access logs and discovered that Googlebot was indexing every single possible day on a simple calendar addon for the phpBB2 forum software installed on the site. (Googlebot is the program that crawls the web indexing everything so you can search for it using Google.)

A quick peek at the access logs showed thousands of Googlebot requests for a forum calendar:

[sourcecode language="bash"] - - [01/Sep/2008:17:09:12 -0400] "GET /forums/calendar.php?m=7&d=21&y=1621&sid=79b643b30eer7140adcd2ba76732688a HTTP/1.1" 200 44000 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [01/Sep/2008:17:09:33 -0400] "GET /forums/calendar.php?m=4&d=2&y=2188&sid=e4da1ee0a488096e3897a8f15c31cea2 HTTP/1.1" 200 43997 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [01/Sep/2008:17:09:44 -0400] "GET /forums/calendar.php?m=12&d=4&y=1624&sid=cc5d5084d158457ce3c7a9d38263f553 HTTP/1.1" 200 44076 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [01/Sep/2008:17:10:05 -0400] "GET /forums/calendar.php?m=10&d=15&y=1621&sid=a4e8af0d20715g965b3e616ae6f95004 HTTP/1.1" 200 43751 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [01/Sep/2008:17:10:15 -0400] "GET /forums/calendar.php?m=9&d=13&y=2187&sid=80c79b2491ddf3d8d46076d48a6282d1 HTTP/1.1" 200 43896 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [01/Sep/2008:17:10:26 -0400] "GET /forums/calendar.php?m=5&d=30&y=1618&sid=f0619ba6517an57bcd6a7e9ca6289a32 HTTP/1.1" 200 43820 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [01/Sep/2008:17:10:38 -0400] "GET /forums/calendar.php?m=11&y=2189&d=30&sid=97c0a58bbd2b3914dbf255ea0a2b1a4c HTTP/1.1" 200 44107 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +"

A quick Google search turned up many others who've had the same problem:

Just found exactly the same on one of my client’s sites. They were complaining that despite being a small site, they’d apparently used all of their bandwidth within 4 days.

They had one of these PHP calendars on their site, where you click the day and it tells you what’s on. Googlebot had tried to index EVERY SINGLE POSSIBLE DAY. And, in the first four days of September, had used up all this site’s bandwidth, clocking up an impressive 19,000 hits and 800MB of bandwidth.

You can use robots.txt to tell all decent robots to push off. I’ve just done that. Let’s see if it works!

So I added a file to the root web directory for the site and named it robots.txt. Inside, I put the following:

User-agent: *
Disallow: /forums/calendar.php

Sure enough, the next time the Googlebot came through it ignored /forums/calendar.php and didn't use up ridiculous amounts of bandwidth indexing something that need not be indexed.

I can't blame the Googlebot though. It was just doing its job. The fault goes to the creators of the calendar addon. What they should have done was add a rel="nofollow" to all the links in the calendar. You can add a nofollow tag to individual links to prevent Googlebot from crawling them. Google started using the nofollow tag as a method of preventing comment spam back in 2005.

Write a Comment