Forum Moderators: coopster & phranque

Message Too Old, No Replies

A Close to perfect .htaccess ban list

         

toolman

3:30 am on Oct 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's the latest rendition of my favorite ongoing artwork....my beloved .htaccess file. I've become quite fond of my little buddy, the .htaccess file, and I love the power it allows me to exclude vermin, pestoids and undesirable entities from my web sites

Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.

Feel free to use this on your own site and start blocking bots too.

(the top part is left out)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

DerekT

8:51 pm on Mar 12, 2003 (gmt 0)

10+ Year Member



StopSpam,

Check your sticky mail.

residuals

4:40 am on Mar 27, 2003 (gmt 0)



Wow-just some comments and humor-no questions yet ;)

As someone else stated in a previous post if I recall....This has been one hell of a read. I read every single post in one sitting tonight. To hell with books!

I just wanted to humorously/seriously note that all these attempts at getting rid of the bad bots/spammers etc.... they (the bad people) may all end up reading this post after they try a search on google to find out why their "system" isn't working any more, LOL. All the efforts you've all put into this post will possibly be read and this will help all the spammers. However the harder you make something to detour begginners, and robots that wouldn't read these forums anyway (unless robots get so smart that they can read forums), the better.

Maybe this forum should be encrypted in some arabic language that no one can read and only " good" people get the de-encryption software to read it. But to figure who who is bad and who is good, we are back to square one again...lol.

I find this forum possibly the most interesting and mind excersizing forum I've ever visited!

DrDoc

6:07 am on Mar 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So you only need to put the .htaccess in your root directory. What if the robots enter from another area?

If it's entering, say, directly at www.example.com/deep/path/to/some/file.html the server will look for an .htaccess file in each directory, starting at the root, process any information it finds, before sending the page.

pmkpmk

2:41 pm on Apr 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yep. My file is probably outdated too....

Somewhere I read a sniplet of code that automatically places bots who get a forbidden page from robots.txt into the ban-list. Can't find it anymore though...

Anyone with more details?

pmkpmk

3:38 pm on Apr 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's my list. And it has a problem too - a syntax error hidden somewhere. Can anybody help?

XBitHack on
Options +FollowSymLinks
RewriteEngine On

RewriteCond %{HTTP_REFERER} iaea\.org [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "DTS Agent" [OR] # OD
RewriteCond %{HTTP_USER_AGENT} "Fetch API Request" [OR] # OD
RewriteCond %{HTTP_USER_AGENT} "Indy Library" [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "LINKS ARoMATIZED" [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} "Microsoft URL Control" [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "^DA \d\.\d+" [OR] # OD
RewriteCond %{HTTP_USER_AGENT} "^Download" [OR] # OD
RewriteCond %{HTTP_USER_AGENT} "^Internet Explore" [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} "^Mozilla/4.0$" [OR] # dumb bot
RewriteCond %{HTTP_USER_AGENT} "^Mozilla/\?\?$" [OR] # formmail attacker
RewriteCond %{HTTP_USER_AGENT} "compatible ; MSIE 6.0" [OR] # spambot (note extra space before semicolon)
RewriteCond %{HTTP_USER_AGENT} "efp@gmx\.net" [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} "mister pix" [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^Atomz [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EasyDL/\d\.\d+ [OR] # OD
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlickBot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^FrontPage [OR] # stupid user trying to edit my site
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^IE\ \d\.\d\ Compatible.*Browser$ [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^MSIECrawler [OR] # IE’s "make availableoffline" mode
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^NG [OR] # unknown bot
RewriteCond %{HTTP_USER_AGENT} ^NPBot [OR] # NameProtect spybot
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^PersonaPilot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sqworm [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SurveyBot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^[A-Z]+$ [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} anarchie [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} cherry.?picker [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} crescent [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} e?mail.?(collector¦magnet¦reaper¦siphon¦sweeper¦harvest¦collect¦wolf) [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} express [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} extractor [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} flashget [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} getright [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} go.?zilla [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} grabber [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} httrack [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} imagefetch [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} net.?(ants¦mechanic¦spider¦vampire¦zip)[NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} nicerspro [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ninja [NC,OR] # Download Ninja OD
RewriteCond %{HTTP_USER_AGENT} offline [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} snagger [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} tele(port¦soft) [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} vayala [OR] # dumb bot, doesn’t know how tofollow links, generates lots of 404s
RewriteCond %{HTTP_USER_AGENT} web.?(auto¦bandit¦collector¦copier¦devil¦downloader¦fetch¦hook¦mole¦miner¦mirror¦reaper¦sauger¦sucker¦site¦snake¦stripper¦weasel¦zip) [NC,OR] # ODs
RewriteCond %{REMOTE_ADDR} "^63\.148\.99\.2(2[4-9]¦[3-4][0-9]¦5[0-5])$" [OR] # Cyveillance spybot
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]¦1[3-9][0-9]¦2[0-4][0-9]¦25[0-5])$ [OR] # NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]¦2[0-4][0-9]¦25[0-5])$ [OR] # NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^64\.140\.49\.6([6-9])$ [OR] # Turnitin spybot

RewriteCond %{HTTP_USER_AGENT} ^Zeus

RewriteRule!err_¦robots\.txt - [F,L]

[edited by: jatar_k at 4:36 pm (utc) on April 1, 2003]
[edit reason] sidescroll, had to shrink a line [/edit]

StopSpam

5:24 pm on Apr 1, 2003 (gmt 0)

10+ Year Member



what those this line do?

RewriteRule!err_¦robots\.txt - [F,L]

can some one explain me what the

F and L means or point me to a site that explains it ;-)

i like to learn it as this coding is powerfull ;-)

PMKPMK thank you big time!

Tamsy

7:41 am on Apr 2, 2003 (gmt 0)

10+ Year Member



Hi pmkpmk

Check your Syntax at line:
RewriteCond %{HTTP_USER_AGENT} net.?(ants¦mechanic¦spider¦vampire¦zip)[NC,OR] # OD

It should read:
RewriteCond %{HTTP_USER_AGENT} net.?(ants¦mechanic¦spider¦vampire¦zip) [NC,OR] # OD

You forgot the [Space] between ..¦zip) and [NC,OR]

pmkpmk

10:40 am on Apr 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



StopSpam: I'm not an expert on this topic - rather doing some educated guessing combined with cut'n'paste :-)

The meaning of the "F" and "L" flags was discussed (much) earlier in this post. The "!err_¦robots.\txt" means, that the redirection is valid for ALL files EXCEPT for files beginning with "err_" (in my cas those are my error documents like err_403.html) and robots.txt (in order to give a bot a chance to see where it is not wanted).

franklin dematto

6:11 am on Apr 3, 2003 (gmt 0)

10+ Year Member



Numerous UA's have been tagged on this thread. Could someone classify them? Many people only want to block one or the other. For instane, I want to block e-mail harvesters, but don't mind if people download my site for offline viewing or even archiving. And I want to avoid false positives.

ratboy

8:59 pm on Apr 9, 2003 (gmt 0)



This is really useful info, I've been wondering how much energy to put out in attempting to block spiders, this has given me enough to make an educated move. Too bad that the way for spider programmers to bypass this htaccess method is so ridiculously easy, but it seems like these techniques will help at least over the short term, thanks especially to toolman and superman, you guys really put out some good stuff, saves us all a lot of work, and many hours of pointless trial and error.
This 243 message thread spans 25 pages: 243