Forum Moderators: coopster & phranque

Message Too Old, No Replies

A Close to perfect .htaccess ban list

         

toolman

3:30 am on Oct 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's the latest rendition of my favorite ongoing artwork....my beloved .htaccess file. I've become quite fond of my little buddy, the .htaccess file, and I love the power it allows me to exclude vermin, pestoids and undesirable entities from my web sites

Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.

Feel free to use this on your own site and start blocking bots too.

(the top part is left out)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

jdMorgan

10:49 pm on Oct 4, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



58sniper,

But since it's already there...

That first UA you've got commented out may have been meant to block "^Mozzilla*" - a misspelled and bogus user-agent. Blocking the "two z's" version is a good idea, blocking the common version certainly isn't! :)

Jim

Annii

11:09 pm on Oct 4, 2002 (gmt 0)

10+ Year Member



JDMorgan,
Thanks very much for your help.
So.. I've removed the 400 error doc from part 1 and changed the 403 error document to 403.htm

I've left part 2 as is because I need to exclude shtml and shtm and this was the only combination I could get to work at the time...

I've changed the last bit of Part 3 as you suggested.

Just 3 quick questions if I can, what does the
RewriteRule !^403.htm$ - [F,L] (in particular the F,L bit) mean at the end?

Is it better to use absolute or relative paths for the error documents?

How can I check that it's working once I've uploaded it?

Thanks again

Anni

carfac

11:44 pm on Oct 4, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Annii:

RewriteRule !^403.htm$ - [F,L] (in particular the F,L bit) mean at the end?

RewriteRule- The conclusion of all thos conditions, and the terminator of that. In means that if any (*you use [OR]) of those conditions exist, do this.

!- this means not
^- means the request string BEGINS EXACTLY
403.htm- name of your 403 file
$- ends exactly (the above will not match 403.html!)

- means do nothing

F means this is forbidden- return FORBIDDEN (403)
L means LAST, do nothing further and end all rewrite rules for any requests effected by this block

Is it better to use absolute or relative paths for the error documents?

for the first part of the rewriterule, use a URI from the root directory
for the seconf part, you HAVE to use a full URL (http://www.domain.com/)

How can I check that it's working once I've uploaded it?

I would recommend going to [wannabrowser.com...]

and spoofing your UA!

Good Luck!

dave

jdMorgan

12:39 am on Oct 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I see that carcfac has already replied, but I was called away while writing this, and since it took awhile to write, I'm gonna post it anyway... :)

I've left part 2 as is because I need to exclude shtml and shtm and this was the only combination I could get to work at the time...

The change I suggested will do the same. You just had a somewhat complicated and inefficient regex pattern saying, "match anything that ends with htm or html". The two methods are equivalent, except that the original method would match htm, html, htmll, htmlll, or htmlllllllllllllllllllll, etc.
The new method will match anything that ends with htm or html only.

If you want to exclude shtm and shtml files from the match, use <FilesMatch "\.html?$"> which will require the path to end with ".htm" or ".html".

Just 3 quick questions if I can, what does the
RewriteRule !^403.htm$ - [F,L] (in particular the F,L bit) mean at the end?

If the conditions match:
Rewrite any requested URL except 403.htm to (blank URL), return a Forbidden server status code, and stop processing rewrite rules, this is the Last one to process. The result is that any banned User-Agent will receive a 403-Forbidden server response, and it will be redirected to your custom 403 error page, 403.htm (which is why you don't want to rewrite that URL if it is subsequently requested). Most bad-bots will not follow this redirect, but that's OK.

Is it better to use absolute or relative paths for the error documents?

ErrorDocument paths must be relative, otherwise a 302-Moved Temporarily response code will be sent to the requesting client, masking the correct error code.

How can I check that it's working once I've uploaded it?

That's tricky... The first thing to check is whether you can still access your web site. Various errors in .htaccess can result in a 500-Server Error code being returned, and your site will be inaccessible. Be ready to remove your new .htaccess and replace it with a known-good backup if this happens! Then view your server error log to find out what caused the server error.

The next part can be done several ways. Checking to see whether your User-agent blocks can be accomplished by modifying your registry entries for Internet Explorer to make it send a blocked User-agent string. Do this only if you are familiar with registry backups and editing! Otherwise, you can simply check your log files once in a while to confirm that bad bots are being blocked as expected.

Testing the custom 404 error document is easy, just request a non-existent page from your site. Testing the 500-series codes is more difficult, since you will need to create redirects for several non-existent files and then request those files in order to test the custom handlers:

RewriteRule ^test501.htm$ - [R=501,L]
RewriteRule ^test502.htm$ - [R=502,L]
etc.

Also, unless you are handling password logins with a custom script, I suggest that you do not redirect 401s to a custom error document.

Again, spending some time reviewing the Apache server documentation [httpd.apache.org] will clear up many questions. I print it out once a year, or when my current copy is worn out, whichever comes first! :)

Jim

[edited by: jdMorgan at 12:59 am (utc) on Oct. 5, 2002]

Annii

12:53 am on Oct 5, 2002 (gmt 0)

10+ Year Member



Dave and JD Morgan
Thankyou both so much for explaining this and taking so much time to do so... I really appreciate it. It's late here (UK) so I'll give it all a go tomorrow and see how I get on :)

Thanks again

Anni

Superman

2:52 am on Oct 5, 2002 (gmt 0)

10+ Year Member


Wow, this thread has gotten a lot of activity lately after being dead for a long time.

.htaccess rocks, and there are many other things I use it for. Here is a great one for preventing people from hotlinking your files:

RewriteEngine On
RewriteCond %{HTTP_REFERER}!^http://([a-z0-9-]+\.)*yourdomain.com/ [NC]
RewriteCond %{HTTP_REFERER}!^http://([a-z0-9-]+\.)*12.345.67.890/ [NC]
RewriteRule /* http://yourdomain.com [L,R]

Obviously "yourdomain.com" would be your domain name, and the "12.345.67.890" would be your site's domain #.

For example, I use this one in my "images" directory to prevent people from hotlinking my images on their sites. I also have it in my "logs" folder so they can't view my site logs.

-Superman-

Superman

3:01 am on Oct 5, 2002 (gmt 0)

10+ Year Member



Another way to shorten your list is to put those bots that read and respect robots.txt there instead. I have moved "ia_archiver", "psbot", and "SlySearch" to my robots.txt file. "internetseer.com" also reads and respects the robots.txt file, although I actually signed up for their service so I don't block them anymore.

-Superman-

Superman

3:07 am on Oct 5, 2002 (gmt 0)

10+ Year Member



Sniper, you are blocking Googlebot by blocking everything with "bot" in it. Probably a bunch of other good bots as well.

-Superman-

Superman

3:27 am on Oct 5, 2002 (gmt 0)

10+ Year Member



Here is my latest, thoroughly researched .htaccess file to block evil bots and site downloaders ... with some new tricks integrated from the recent posts:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

Some notes:

1. The [1] at the begining and the [2] at the end of my original file was added sometime after this site's format changed (something to do with the formatting codes). Anyway, they did not belong there.

2. All the things in my list have been thorougly researched. 90 percent of them are Site Downloaders. There are also some email harvesters and other evil things (like VoidEye).

3. If you know a bot respects robots.txt, put it there. It will shorten your list (see my post above). If anybody sees something in my list that definitely obeys robots.txt, please let me know.

4. Adding the [NC,OR] to all of your entries will only make your file that much bigger. 99 percent of these things always use the exact useragent name. If there are anomolies (like httrack), then by all means make it case-insensitive. Same with the ^ character. They always start the same way.

-Superman-

stapel

5:04 am on Oct 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I note that you do not include FrontPage in your list. If you don't mind my asking: Why not?

Eliz.

This 243 message thread spans 25 pages: 243