Forum Moderators: coopster & phranque

Message Too Old, No Replies

A Close to perfect .htaccess ban list

         

toolman

3:30 am on Oct 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's the latest rendition of my favorite ongoing artwork....my beloved .htaccess file. I've become quite fond of my little buddy, the .htaccess file, and I love the power it allows me to exclude vermin, pestoids and undesirable entities from my web sites

Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.

Feel free to use this on your own site and start blocking bots too.

(the top part is left out)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

jdMorgan

6:27 pm on Sep 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, you can cut 'n paste that code into your .htaccess file - at your own risk.

A couple of points first, though...

The [1] at the beginning of the first line is spurious, and should be removed, leaving the line reading:

RewriteEngine On

As it stands, this example code will generate a 403-Forbidden response. You can also configure it to respond with other error codes, or with permanent or temporary redirects to other pages on your own site or elsewhere. I strongly encourage you to read the documentation [httpd.apache.org] on mod_rewrite, whether you plan to "tweak" this example code or not. As states in the documentation, mod-rewrite is a poweful tool; and as such, it is also a dnagerous tool. Some time spend "reading the fine manual" may save you a lot of grief in the future.

By changing the final line to:
RewriteRule ^.* - [F,L]
you can minimize interactions with following rewrite rulesets, and also minimize CPU overhead for processing. The "L" tells mod-rewrite that this is the last rule that needs to be processed in this case, and to stop rewriting as soon as it is processed.

You can customize the 403-Forbidden page returned to the bad-bot (keeping in mind that at some time, as you modify this, you might introduce an error and catch an innocent person instead) to explain what happened and what to do about it. To do this, add:
ErrorDocument 403 /my403.html
at the beginning of the example code, and then create a custom 403 error page (called "my403.html" in this example.)

All RewriteCond's in this example are case-sensitive. This leaves it open to a few more errors as you maintain the file. To make the pattern-match case-insensitive, change the [OR] flag at the end of each line to [NC,OR]. Note also that the [OR] must not be included on the very last RewriteCond - the one directly preceding the RewriteRule. If it is, you'll lock up your server, and you and your users will get 500-Server Error responses to all requests. (After changing anything in your .htaccess file, it's a very good idea to access your own site, and make sure it still works!)

All RewritesCond's in this example assume that the user-agent starts with the pattern of characters shown (That's what the "^" character means). Some user-agent strings do not start with the "bad-bot" user agent string; they start with something common like "Mozilla/3.01" and then contain the bad-bot identification further on in the string. To catch these guys, you will need to remove the starting text anchor "^" from the pattern match string. This makes the pattern matching less efficient, and should only be done if necessary.

Here's one example that I know needs to be changed:
RewriteCond %{HTTP_USER_AGENT} Indy.Library [NC,OR]

Note that I removed the starting "^", so that it will ban any user-agent with "Indy Library" anywhere in its user-agent string, and that I will accept any character - including a space - after "Indy".

Again - Yes, you can cut 'n paste this into your .htaccess file - at your own risk. I recommend that you minimize this risk by reading the mod_rewrite documentation.

Hope this helps,
Jim

Nick_W

6:38 pm on Sep 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Helps a lot!

Thanks for taking the time to go through that with us Jim ;-)

Nick

jdMorgan

9:03 pm on Sep 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Forgot something, though...

I've posted this before, but just in case:

mod-rewrite (and many related Apache modules) depend on "regular expressions" for pattern-matching. You can find a short and useful tutorial here [etext.lib.virginia.edu] on the University of Virginia Library Web site.

This is a big help in figuring out ^(what\ all\ the\ strange\ characters\ in\ mod_rewrite\ directives\ mean¦how\ to\ write\ them\ correctly)\.$
;)

Jim

andreasfriedrich

9:22 pm on Sep 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is a big help in figuring out ^(what\ all\ the\ strange\ characters\ in\ mod_rewrite\ directives\ mean¦how\ to\ write\ them\ correctly)\.$

Shouldn´t it read: This is a big help in figuring out (?:(?:^.*what\ all\ the\ strange\ characters\ in\ mod_rewrite\ directives\ mean.*how\ to\ write\ them\ correctly)¦(?:^.*how\ to\ write\ them\ correctly.*what\ all\ the\ strange\ characters\ in\ mod_rewrite\ directives\ mean))\.$

andreasfriedrich

12:41 am on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Some of you may wonder about the performance impact of having such a large .htaccess file. I did some simple benchmark tests.

Setup

  • System

    Server: Apache/1.3.26 (Unix) mod_ssl/2.8.10
    OpenSSL/0.9.6d PHP/4.2.1

    Linux version 2.2.19-7.0.16
    Detected 467741 kHz processor.
    Memory: 257496k/262080k available (1076k kernel code,
    416k reserved, 3020k data, 72k init, 0k bigmem)
    128K L2 cache (4 way)
    CPU: L2 Cache: 128K
    CPU: Intel Celeron (Mendocino) stepping 05

  • benchmark script
    #!/usr/bin/perl 

    use LWP::UserAgent;
    use LWP::Simple;
    use Time::HiRes qw(gettimeofday);

    $url = "http://server/root/test.html";
    foreach $agent (qw(BlackWidow Zeus AaronCarter)) {
    for($j=0;$j<10;$j++) {
    $ua = new LWP::UserAgent;
    $ua->agent($agent);
    $t0 = gettimeofday;

    # Request document and parse it as it arrives
    for(my $i=1;$i < 100;$i++) {
    $res = $ua->request(HTTP::Request->new(GET => $url),
    sub { });
    }
    $t{$agent} += gettimeofday-$t0;
    }
    $t{$agent} = $t{$agent}/($j+1);
    }

    print map { $_,' needed ', $t{$_}, ' seconds.', "\n"} sort keys %t;

  • stress script
    #!/usr/bin/perl 

    use LWP::UserAgent;

    $url = "http://server/www.pension-schafspelz.de/";
    $ua = new LWP::UserAgent;

    # Request document and parse it as it arrives
    while(true) {
    $res = $ua->request(HTTP::Request->new(GET => $url),
    sub { });
    }

  • .htacces with single RewriteCond directive
    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} BlackWidow¦Bot\ mailto:craftbot@yahoo.com¦ChinaClaw¦DISCo¦Download\ Demon¦eCatch¦EirGrabber¦EmailSiphon¦Express\ WebPictures¦ExtractorPro¦EyeNetIE¦FlashGet¦GetRight¦Go!Zilla¦ Go-Ahead-Got-It¦GrabNet¦Grafula¦HMView¦HTTrack¦Image\ Stripper¦Image\ Sucker¦InterGET¦Internet\ Ninja¦JetCar¦JOC\ Web\ Spider¦larbin¦LeechFTP¦Mass\ Downloader¦MIDown\ tool¦Mister\ PiX¦Navroad¦NearSite¦NetAnts¦NetSpider¦Net\ Vampire¦NetZIP¦Octopus¦Offline\ Explorer¦Offline\ Navigator¦PageGrabber¦Papa\ Foto¦pcBrowser¦RealDownload¦ReGet¦Siphon¦SiteSnagger¦SmartDownload¦ SuperBot¦SuperHTTP¦Surfbot¦tAkeOut¦Teleport\ Pro¦VoidEYE¦Web\ Image\ Collector¦Web\ Sucker¦WebAuto¦WebCopier¦WebFetch¦WebReaper¦WebSauger¦Website\ eXtractor¦WebStripper¦WebWhacker¦WebZIP¦Wget¦Widow¦Xaldon\ WebSpider¦Zeus
    RewriteRule .* - [F,L]
    spaces added above to stop sidescroll - jk

  • .htaccess with multiple RewriteCond directives

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} BlackWidow [OR]
    RewriteCond %{HTTP_USER_AGENT} Bot\ mailto:craftbot@yahoo.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} DISCo [OR]
    RewriteCond %{HTTP_USER_AGENT} Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} EmailSiphon [OR]
    RewriteCond %{HTTP_USER_AGENT} Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ExtractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} EyeNetIE [OR]
    RewriteCond %{HTTP_USER_AGENT} FlashGet [OR]
    RewriteCond %{HTTP_USER_AGENT} GetRight [OR]
    RewriteCond %{HTTP_USER_AGENT} Go!Zilla [OR]
    RewriteCond %{HTTP_USER_AGENT} Go-Ahead-Got-It [OR]
    RewriteCond %{HTTP_USER_AGENT} GrabNet [OR]
    RewriteCond %{HTTP_USER_AGENT} Grafula [OR]
    RewriteCond %{HTTP_USER_AGENT} HMView [OR]
    RewriteCond %{HTTP_USER_AGENT} HTTrack [OR]
    RewriteCond %{HTTP_USER_AGENT} Image\ Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} Image\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} InterGET [OR]
    RewriteCond %{HTTP_USER_AGENT} Internet\ Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} JetCar [OR]
    RewriteCond %{HTTP_USER_AGENT} JOC\ Web\ Spider [OR]
    RewriteCond %{HTTP_USER_AGENT} larbin [OR]
    RewriteCond %{HTTP_USER_AGENT} LeechFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} Mass\ Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} MIDown\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} Mister\ PiX [OR]
    RewriteCond %{HTTP_USER_AGENT} Navroad [OR]
    RewriteCond %{HTTP_USER_AGENT} NearSite [OR]
    RewriteCond %{HTTP_USER_AGENT} NetAnts [OR]
    RewriteCond %{HTTP_USER_AGENT} NetSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} Net\ Vampire [OR]
    RewriteCond %{HTTP_USER_AGENT} NetZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} Octopus [OR]
    RewriteCond %{HTTP_USER_AGENT} Offline\ Explorer [OR]
    RewriteCond %{HTTP_USER_AGENT} Offline\ Navigator [OR]
    RewriteCond %{HTTP_USER_AGENT} PageGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} Papa\ Foto [OR]
    RewriteCond %{HTTP_USER_AGENT} pcBrowser [OR]
    RewriteCond %{HTTP_USER_AGENT} RealDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ReGet [OR]
    RewriteCond %{HTTP_USER_AGENT} Siphon [OR]
    RewriteCond %{HTTP_USER_AGENT} SiteSnagger [OR]
    RewriteCond %{HTTP_USER_AGENT} SmartDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} SuperBot [OR]
    RewriteCond %{HTTP_USER_AGENT} SuperHTTP [OR]
    RewriteCond %{HTTP_USER_AGENT} Surfbot [OR]
    RewriteCond %{HTTP_USER_AGENT} tAkeOut [OR]
    RewriteCond %{HTTP_USER_AGENT} Teleport\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} VoidEYE [OR]
    RewriteCond %{HTTP_USER_AGENT} Web\ Image\ Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} Website\ eXtractor [OR]
    RewriteCond %{HTTP_USER_AGENT} WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} WebZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} Xaldon\ WebSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} Zeus
    RewriteRule .* - [F,L]

Results

  • htaccess, multiple RewriteCond, idle server

    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 2.03904145414179 seconds.
    BlackWidow needed 1.89269917661493 seconds.
    Zeus needed 1.90201771259308 seconds.
    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 2.05427220734683 seconds.
    BlackWidow needed 1.90449017828161 seconds.
    Zeus needed 1.91795318776911 seconds.
    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 2.04453534429724 seconds.
    BlackWidow needed 1.89828474955125 seconds.
    Zeus needed 1.90684572133151 seconds.
    [li][b]httpd.conf, multiple RewriteCond, idle server[/b]
    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 1.48258856209842 seconds.
    BlackWidow needed 1.41852938045155 seconds.
    Zeus needed 1.4474944526499 seconds.
    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 1.47204467383298 seconds.
    BlackWidow needed 1.40937690301375 seconds.
    Zeus needed 1.42638698491183 seconds.
    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 1.49393899874254 seconds.
    BlackWidow needed 1.42769226160916 seconds.
    Zeus needed 1.44513262401928 seconds.

  • .htaccess, single RewriteCond, idle server

    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 1.77475028688257 seconds.
    BlackWidow needed 1.69021615115079 seconds.
    Zeus needed 1.59830655834892 seconds.
    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 1.76538353616541 seconds.
    BlackWidow needed 1.68273590911518 seconds.
    Zeus needed 1.5909228108146 seconds.
    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 1.77456200122833 seconds.
    BlackWidow needed 1.69423974644054 seconds.
    Zeus needed 1.60414087772369 seconds.

  • httpd.conf, single RewriteCond, idle server

    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 1.50137218562039 seconds.
    BlackWidow needed 1.43611000884663 seconds.
    Zeus needed 1.45189526948062 seconds.
    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 1.48307919502258 seconds.
    BlackWidow needed 1.41925389116461 seconds.
    Zeus needed 1.43655191768299 seconds.
    [af@server SuchmaschinenTricks]$ ./bm
    AaronCarter needed 1.49346754767678 seconds.
    BlackWidow needed 1.43283208933744 seconds.
    Zeus needed 1.44978855956684 seconds.

  • .htaccess, multiple RewriteCond, server under stress

    K:\SuchmaschinenTricks>perl bm
    AaronCarter needed 6.25990908796137 seconds.
    BlackWidow needed 5.3813636302948 seconds.
    Zeus needed 5.76372727480802 seconds.

  • httpd.conf, multiple RewriteCond, server under stress

    K:\SuchmaschinenTricks>perl bm
    AaronCarter needed 5.58627272735943 seconds.
    BlackWidow needed 5.23381818424572 seconds.
    Zeus needed 5.10827272588556 seconds.

  • .htaccess, single RewriteCond, server under stress

    K:\SuchmaschinenTricks>perl bm
    AaronCarter needed 6.03227272900668 seconds.
    BlackWidow needed 5.1229090907357 seconds.
    Zeus needed 5.46418181332675 seconds.

  • httpd.conf, single RewriteCond, server under stress

    K:\SuchmaschinenTricks>perl bm
    AaronCarter needed 5.32499999349768 seconds.
    BlackWidow needed 4.55199999159033 seconds.
    Zeus needed 5.22927273403515 seconds.

Conclusion

  • Use a single RewriteCond directive in your .htaccess files.
  • Use multiple RewriteCond directives in your httpd.conf file.

.htaccess files have to be read each and every time a request is made. It takes longer to parse and compile the multiple RewriteCond directives than the one with the longer, single regular expression. The parsing the is the bottle neck in this scenario.

The httpd.conf file is read once the server starts up. The regular expressions are compiled once. Multiple short and simple REs execute faster than the one single, complex one. Execution time is the factor that matters most in this case.

[edited by: jatar_k at 4:35 pm (utc) on Sep. 22, 2002]
[edit reason] stopped side scroll [/edit]

caine

1:06 am on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This has been one hell of a read, i've got my copy of .htaccess file and will drop it on the server, for my next site. Impressive.

Who needs books !

jdMorgan

1:08 am on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



andreasfriedrich,

Excellent data...

I wonder what the result would be performance-wise with four RewriteCond lines: One each for start-anchored, end-anchored, fully-anchored, and unanchored pattern strings. I use this method to keep the patterns organized by type, and to keep them neat and easy to maintain.

Thanks for the test - Very useful.

Jim

bird

1:22 am on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting comparison Andreas, thanks!

If I'm reading your code correctly, then you're demonstrating two things:

a) The response time difference between the two .htaccess files makes roughly 10%.

b) Each call takes between 0.002 and 0.006 seconds, depending on the load of the machine.

Combining those two, we're talking about an average additional overhead caused by multiple RewriteConds of 0.0004 seconds (1/2500 seconds) per request, on a somewhat aged machine.

Since you're always fetching the same file from the disk cache, the remaining server side overhead is very low (in contrast to a real life situation, where each request may cause a different set of files to be loaded from disk first), so we can assume that the difference is indeed caused by the different rule sets. The comments in your code talk about parsing the HTML, but I don't understand enough Perl to see if that really happens.

In summary, I'd prefer the maintenance friendly multiple-rule version any time, if it only costs me such a small price in terms of response time. Of course, I'm not serving millions of requests per day, so your mileage may vary.

andreasfriedrich

1:37 am on Sep 10, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



comments in your code talk about parsing the HTML, but I don't understand enough Perl to see if that really happens.

No, no parsing going on since its irrelevant for the benchmarking.

on a somewhat aged machine

Don´t insult my trusted old linux box. I cannot guarantie for any DoS attacks it will launch on its own ;)

Since you're always fetching the same file from the disk cache, the remaining server side overhead is very low [...], so we can assume that the difference is indeed caused by the different rule sets

That was the idea behind the admittedly artificial setup.

I'd prefer the maintenance friendly multiple-rule version any time

As is quite often the case, it´s a tradeoff between speed and maintainability.

pmkpmk

9:36 am on Sep 11, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi,

I followed this thread for quite some time since our server gets harassed by those "evil bots" as well. However, I couldn't quite decide to take action until the posting of jdMorgan which was almost a how-to.

However, having done this I run into the first trouble because my Apache 1.3 says:

Options FollowSymLinks or SymLinksIfOwnerMatch is off which implies that RewriteRule directive is forbidden

Simply adding a "FollowSymLinks on" on top of the htaccess doesn't work.

Any advice?

On a sidenote: I found some packages like "Sugerplum" and "robotcop" which promise to automize some of the functions intended by this htaccess. Any expereinces with these?

This 243 message thread spans 25 pages: 243