12

I am having EXTREME bot problems on the some of my websites within my hosting account. The bots utilize over 98% of my CPU resources and 99% of my bandwidth for my entire hosting account. These bots are generating over 1 GB of traffic per hour for my sites. The real human traffic for all of these sites is less than 100 MB / month.

I have done extensive research on both robots.txt and .htaccess file to block these bots but all methods failed.

I have also put code in the robots.txt files to block access to the scripts directories, but these bots (Google, MS Bing, and Yahoo) ignore the rules and run the scripts anyways.

I do not want to completely block the Google, MS Bing, and Yahoo bots, but I want to limit there crawl rate. Also, adding a Crawl-delay statement in the robots.txt file does not slow down the bots. My current robots.txt and .htacces code for all sites are stated below.

I have setup both Microsoft and Google webmaster tools to slow down the crawl rate to the absolute minimum, but they are still hitting these sites at a rate of 10 hits / second.

In addition, every time I upload a file that causes an error, the entire VPS webserver goes down within seconds such that I cannot even access the site correct the issue due to the onslaught of hits by these bots.

What can I do to stop the on-slot of traffic to my websites?

I tried asking my web hosting company (site5.com) many times about this issue in the past months and they cannot help me with this problem.

What I really need is to prevent the Bots from running the rss2html.php script. I tried both sessions and cookies and both failed.

robots.txt

User-agent: Mediapartners-Google
Disallow: 
User-agent: Googlebot
Disallow: 
User-agent: Adsbot-Google
Disallow: 
User-agent: Googlebot-Image
Disallow: 
User-agent: Googlebot-Mobile
Disallow: 
User-agent: MSNBot
Disallow: 
User-agent: bingbot
Disallow: 
User-agent: Slurp
Disallow: 
User-Agent: Yahoo! Slurp
Disallow: 
# Directories
User-agent: *
Disallow: /
Disallow: /cgi-bin/
Disallow: /ads/
Disallow: /assets/
Disallow: /cgi-bin/
Disallow: /phone/
Disallow: /scripts/
# Files
Disallow: /ads/random_ads.php
Disallow: /scripts/rss2html.php
Disallow: /scripts/search_terms.php
Disallow: /scripts/template.html
Disallow: /scripts/template_mobile.html

.htaccess

ErrorDocument 400 http://english-1329329990.spampoison.com
ErrorDocument 401 http://english-1329329990.spampoison.com
ErrorDocument 403 http://english-1329329990.spampoison.com
ErrorDocument 404 /index.php
SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
SetEnvIfNoCase User-Agent "^baidu*" bad_bot
Order Deny,Allow
Deny from env=bad_bot
RewriteEngine on
RewriteCond %{HTTP_user_agent} bot\* [OR]
RewriteCond %{HTTP_user_agent} \*bot
RewriteRule ^.*$ http://english-1329329990.spampoison.com [R,L]
RewriteCond %{QUERY_STRING} mosConfig_[a-zA-Z_]{1,21}(=|\%3D) [OR]
# Block out any script trying to base64_encode crap to send via URL
RewriteCond %{QUERY_STRING} base64_encode.*\(.*\) [OR]
# Block out any script that includes a <script> tag in URL
RewriteCond %{QUERY_STRING} (\<|%3C).*script.*(\>|%3E) [NC,OR]
# Block out any script trying to set a PHP GLOBALS variable via URL
RewriteCond %{QUERY_STRING} GLOBALS(=|\[|\%[0-9A-Z]{0,2}) [OR]
# Block out any script trying to modify a _REQUEST variable via URL
RewriteCond %{QUERY_STRING} _REQUEST(=|\[|\%[0-9A-Z]{0,2})
# Send all blocked request to homepage with 403 Forbidden error!
RewriteRule ^(.*)$ index.php [F,L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !^/index.php
RewriteCond %{REQUEST_URI} (/|\.php|\.html|\.htm|\.feed|\.pdf|\.raw|/[^.]*)$  [NC]
RewriteRule (.*) index.php
RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
# Don't show directory listings for directories that do not contain an index file (index.php, default.asp etc.)
Options -Indexes
<Files http://english-1329329990.spampoison.com>
order allow,deny
allow from all
</Files>
deny from 108.
deny from 123.
deny from 180.
deny from 100.43.83.132

UPDATE TO SHOW ADDED USER AGENT BOT CHECK CODE

<?php
function botcheck(){
 $spiders = array(
   array('AdsBot-Google','google.com'),
   array('Googlebot','google.com'),
   array('Googlebot-Image','google.com'),
   array('Googlebot-Mobile','google.com'),
   array('Mediapartners','google.com'),
   array('Mediapartners-Google','google.com'),
   array('msnbot','search.msn.com'),
   array('bingbot','bing.com'),
   array('Slurp','help.yahoo.com'),
   array('Yahoo! Slurp','help.yahoo.com')
 );
 $useragent = strtolower($_SERVER['HTTP_USER_AGENT']);
 foreach($spiders as $bot) {
   if(preg_match("/$bot[0]/i",$useragent)){
     $ipaddress = $_SERVER['REMOTE_ADDR']; 
     $hostname = gethostbyaddr($ipaddress);
     $iphostname = gethostbyname($hostname);
     if (preg_match("/$bot[1]/i",$hostname) && $ipaddress == $iphostname){return true;}
   }
 }
}
if(botcheck() == false) {
  // User Login - Read Cookie values
     $username = $_COOKIE['username'];
     $password = $_COOKIE['password'];
     $radio_1 = $_COOKIE['radio_1'];
     $radio_2 = $_COOKIE['radio_2'];
     if (($username == 'm3s36G6S9v' && $password == 'S4er5h8QN2') || ($radio_1 == '2' && $radio_2 == '5')) {
     } else {
       $selected_username = $_POST['username'];
       $selected_password = $_POST['password'];
       $selected_radio_1 = $_POST['group1'];
       $selected_radio_2 = $_POST['group2'];
       if (($selected_username == 'm3s36G6S9v' && $selected_password == 'S4er5h8QN2') || ($selected_radio_1 == '2' && $selected_radio_2 == '5')) {
         setcookie("username", $selected_username, time()+3600, "/");
         setcookie("password", $selected_password, time()+3600, "/");
         setcookie("radio_1", $selected_radio_1, time()+3600, "/");
         setcookie("radio_2", $selected_radio_2, time()+3600, "/");
       } else {
        header("Location: login.html");
       }
     }
}
?>

I also added the following to the top tof the rss2html.php script

// Checks to see if this script was called by the main site pages, (i.e. index.php or mobile.php) and if not, then sends to main page
   session_start();  
   if(isset($_SESSION['views'])){$_SESSION['views'] = $_SESSION['views']+ 1;} else {$_SESSION['views'] = 1;}
   if($_SESSION['views'] > 1) {header("Location: http://website.com/index.php");}
Sammy
  • 223
  • 1
  • 2
  • 5

11 Answers11

5

You could set your script to throw a 404 error based on the user agent string provided by the bots - they'll quickly get the hint and leave you alone.

if(isset($_SERVER['HTTP_USER_AGENT'])){
   $agent = $_SERVER['HTTP_USER_AGENT'];
}

if(preg_match('/^Googlebot/i',$agent)){
   http_response_code(301);
   header("HTTP/1.1 301 Moved Permanently");
   header("Location: http://www.google.com/");
   exit;
}

Pick through your logs and reject Bingbot, etc. in a similar manner - it won't stop the requests, but might save some bandwidth - give googlebot a taste of it's own medicine - Mwhahahahaha!

Updated

Looking at your code, I think your problem is here:

if (preg_match("/$bot[1]/i",$hostname) && $ipaddress == $iphostname)

If they are malicious bots then they could be coming from anywhere, take that $ipaddress clause out and throw a 301 or 404 response at them.

Thinking right up by the side of the box

  1. Googlebot never accepts cookies, so it can't store them. In fact, if you require cookies for all users, that's probably going to keep the bot from accessing your page.
  2. Googlebot doesn't understand forms - or - javascript, so you could dynamically generate your links or have the users click a button to reach your code (with a suitable token attached).

    <a href="#" onclick="document.location='rss2html.php?validated=29e0-27fa12-fca4-cae3';">Rss2html.php</a>

    • rss2html.php?validated=29e0-27fa12-fca4-cae3 - human
    • rss2html.php - bot
web_bod
  • 151
  • 3
3

If rss2html.php is not being used directly by the client (that is, if it's PHP always using it rather than it being a link or something), then forget trying to block bots. All you really have to do is define a constant or something in the main page, then include the other script. In the other script, check whether the constant is defined, and spit out a 403 error or a blank page or whatever if it's not defined.

Now, in order for this to work, you'll have to use include rather than file_get_contents, as the latter will either just read in the file (if you're using a local path), or run in a whole other process (if you're using a URL). But it's the method that stuff like Joomla! uses to prevent a script from being included directly. And use a file path rather than a URL, so that the PHP code isn't already parsed before you try to run it.

Even better would be to move rss2html.php out from under the document root, but some hosts make that difficult to do. Whether that's an option depends on your server/host's setup.

cHao
  • 211
  • 1
  • 4
2

PHP Limit/Block Website requests for Spiders/Bots/Clients etc.

Here I have written a PHP function which can Block unwanted Requests to reduce your Website-Traffic. Good for Spiders, Bots and annoying Clients.

CLIENT/Bots Blocker

DEMO: http://szczepan.info/9-webdesign/php/1-php-limit-block-website-requests-for-spiders-bots-clients-etc.html

CODE:

/* Function which can Block unwanted Requests
 * @return array of error messages
 */
function requestBlocker()
{
        /*
        Version 1.0 11 Jan 2013
        Author: Szczepan K
        http://www.szczepan.info
        me[@] szczepan [dot] info
        ###Description###
        A PHP function which can Block unwanted Requests to reduce your Website-Traffic.
        God for Spiders, Bots and annoying Clients.

        */

        # Before using this function you must 
        # create & set this directory as writeable!!!!
        $dir = 'requestBlocker/';

        $rules   = array(
                #You can add multiple Rules in a array like this one here
                #Notice that large "sec definitions" (like 60*60*60) will blow up your client File
                array(
                        //if >5 requests in 5 Seconds then Block client 15 Seconds
                        'requests' => 5, //5 requests
                        'sek' => 5, //5 requests in 5 Seconds
                        'blockTime' => 15 // Block client 15 Seconds
                ),
                array(
                        //if >10 requests in 30 Seconds then Block client 20 Seconds
                        'requests' => 10, //10 requests
                        'sek' => 30, //10 requests in 30 Seconds
                        'blockTime' => 20 // Block client 20 Seconds
                ),
                array(
                        //if >200 requests in 1 Hour then Block client 10 Minutes
                        'requests' => 200, //200 requests
                        'sek' => 60 * 60, //200 requests in 1 Hour
                        'blockTime' => 60 * 10 // Block client 10 Minutes
                )
        );
        $time    = time();
        $blockIt = array();
        $user    = array();

        #Set Unique Name for each Client-File 
        $user[] = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : 'IP_unknown';
        $user[] = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
        $user[] = strtolower(gethostbyaddr($user[0]));

        # Notice that I use files because bots do not accept Sessions
        $botFile = $dir . substr($user[0], 0, 8) . '_' . substr(md5(join('', $user)), 0, 5) . '.txt';


        if (file_exists($botFile)) {
                $file   = file_get_contents($botFile);
                $client = unserialize($file);

        } else {
                $client                = array();
                $client['time'][$time] = 0;
        }

        # Set/Unset Blocktime for blocked Clients
        if (isset($client['block'])) {
                foreach ($client['block'] as $ruleNr => $timestampPast) {
                        $elapsed = $time - $timestampPast;
                        if (($elapsed ) > $rules[$ruleNr]['blockTime']) {
                                unset($client['block'][$ruleNr]);
                                continue;
                        }
                        $blockIt[] = 'Block active for Rule: ' . $ruleNr . ' - unlock in ' . ($elapsed - $rules[$ruleNr]['blockTime']) . ' Sec.';
                }
                if (!empty($blockIt)) {
                        return $blockIt;
                }
        }

        # log/count each access
        if (!isset($client['time'][$time])) {
                $client['time'][$time] = 1;
        } else {
                $client['time'][$time]++;

        }

        #check the Rules for Client
        $min = array(
                0
        );
        foreach ($rules as $ruleNr => $v) {
                $i            = 0;
                $tr           = false;
                $sum[$ruleNr] = 0;
                $requests     = $v['requests'];
                $sek          = $v['sek'];
                foreach ($client['time'] as $timestampPast => $count) {
                        if (($time - $timestampPast) < $sek) {
                                $sum[$ruleNr] += $count;
                                if ($tr == false) {
                                        #register non-use Timestamps for File 
                                        $min[] = $i;
                                        unset($min[0]);
                                        $tr = true;
                                }
                        }
                        $i++;
                }

                if ($sum[$ruleNr] > $requests) {
                        $blockIt[]                = 'Limit : ' . $ruleNr . '=' . $requests . ' requests in ' . $sek . ' seconds!';
                        $client['block'][$ruleNr] = $time;
                }
        }
        $min = min($min) - 1;
        #drop non-use Timestamps in File 
        foreach ($client['time'] as $k => $v) {
                if (!($min <= $i)) {
                        unset($client['time'][$k]);
                }
        }
        $file = file_put_contents($botFile, serialize($client));


        return $blockIt;

}


if ($t = requestBlocker()) {
        echo 'dont pass here!';
        print_R($t);
} else {
        echo "go on!";
}
dazzafact
  • 121
  • 1
2

I've solved the same issue with the script available at http://perishablepress.com/blackhole-bad-bots/. With this blackhole approach I collected a list of malicious ip, and then using .htaccess denied them. (Which is not mandatory, since the script itself does the banning. But I need to reduce the server load by avoiding php parsing for known unwanted ips) in three days my traffic came down from 5GB per day to 300MB, which is quiet expected.

Check this page also for full list of htaccess rules to block many known junk bots. http://www.askapache.com/htaccess/blocking-bad-bots-and-scrapers-with-htaccess.html

Nishad
  • 21
  • 2
1

It's likely that your site is being indexed by fake google bot(s). You can try adding a check and serve 404 for all fake google bot requests.

Here is an article that explains how to verify Googlebot: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

Also you could check your records against known fake bots: http://stopmalvertising.com/security/fake-google-bots.html

1

You should really make sure in a first place that any page requested from a useragent, of whichever abusive crawler you have, will be served a static page.

An apache mod_rewrite with a condition or equiv with your http server. For apache, something like this:

RewriteCond  %{HTTP_USER_AGENT}  ^GoogleBot [OR]
RewriteCond  %{HTTP_USER_AGENT}  ^OtherAbusiveBot
RewriteRule  ^/$                 /static_page_for_bots.html  [L]
smassey
  • 11
  • 1
1

To continue on smassey's post, you can put several conditions:

RewriteCond  %{HTTP_USER_AGENT}  ^GoogleBot [OR]
RewriteCond  %{HTTP_USER_AGENT}  ^OtherAbusiveBot
RewriteRule  ^rss2html\.php$     /static.html  [L]

This way, the bots still access your pages, but just not that one. As it's strange that the (legitimate) bots are not keeping to rules, do you have any referers that push bots to your page from other sources (domainname forwarding, ...)

ndrix
  • 111
  • 3
0

Old question, but current answers don't appear to have addressed the "error"(s) in the robots.txt file...

I have also put code in the robots.txt files to block access to the scripts directories, but these bots (Google, MS Bing, and Yahoo) ignore the rules and run the scripts anyways.

Yes, they are actually obeying the robots.txt file as written. This robots.txt file does not prevent these bots (Google, MS Bing, and Yahoo) from crawling the scripts directories!

When a bot parses the robots.txt file, it only matches against at most one group of directives, identified by the User-agent directive(s). The "special" User-agent: * (all bots) group only applies if the bot does not match against any other - more specific - group. It is not always processed, as appears to be the assumption here. All the named bots in this robots.txt file are permitted to crawl everything.

For "Google, MS Bing and Yahoo" they each have their own specific group that allows unrestricted crawling of everything (ie. Disallow: ). Consequently they don't see the directives in the User-agent: * group that blocks specific URLs.

User-agent: *
Disallow: /
Disallow: /cgi-bin/
Disallow: /ads/
:

The first Disallow: / directive blocks "all other" bots from crawling everything. The more specific Disallow directives that follow are entirely superfluous.

What I really need is to prevent the Bots from running the rss2html.php script. I tried both sessions and cookies and both failed.

To prevent all (good) bots from crawling rss2html.php then all you would need is the following and no specific groups for the other bots:

User-agent: *
Disallow: /scripts/rss2html.php
MrWhite
  • 43,224
  • 4
  • 50
  • 90
0
// Checks to see if this script was called by the main site pages,
// (i.e. index.php or mobile.php) and if not, then sends to main page
session_start();  
if (isset($_SESSION['views'])) {$_SESSION['views'] = $_SESSION['views']+ 1;} else {$_SESSION['views'] = 1;}
if ($_SESSION['views'] > 1) {header("Location: http://website.com/index.php");}

This script does not do what the comment says, in fact it does the complete opposite. This will always let the bots through, since the session variable will never be set when the bot requests your script. All it will potentially do is prevent legitimate requests (from index.php or mobile.php) from calling the script more than once.

In order to prevent a bot from accessing your script you should only allow access if a session variable (or cookie) is actually set. Assuming of course that the (malicious) bot does not accept cookies. (We know that the real Googlebot doesn't.)

As has already been mentioned, placing rss2html.php above the web root (outside of the public webspace) would prevent a bot from accessing the script directly - but you say this causes other problems? Or, place it in a directory and protect that directory with .htaccess. Or you might even be able to protect just the file itself in .htaccess from direct requests?

MrWhite
  • 43,224
  • 4
  • 50
  • 90
0

Go set up your domain on Cloudflare (free service for this). They block malicious bots at the domain level before they hit your server. Takes about 20 minutes, never have to monkey with the code.

I use this service on all my sites and all client sites. They identify malicious bots based on a number of techniques including leveraging project Honey pot.

-2

What you need to do is install a SSL Certificate on your server for apache/nginx/email/ftp. Enable HSTS and also you need to edit your ssl.conf file so that SSLv2 SSLv3 TLSv1 are disabled and doesn't allow incoming connections. Strengthen your server the right way and you will not have any issues from bots.

Robert
  • 1