|
Default User Agents Discussion
|
wilderness
#:3626412
| 2:15 pm on April 14, 2008 (utc 0) |
System: The following message was cut out of thread at: [url=http://www.webmasterworld.com/search_engine_spiders/3626003.htm]http://www.webmasterworld.com/search_engine_spiders/3626003.htm[/url] by incredibill - 11:32 am on April 14, 2008 (PST -8)
You fellas have done a fine job of inclusion. larbin is another, FrontPage as well. With a little broadening of the category? You might also consider the numerous link checkers as well.
|
incrediBILL
#:3626670
| 7:37 pm on April 14, 2008 (utc 0) |
Don, Larbin doesn't qualify for this particular information thread because it's an actual crawler itself, not a programming library or command line tool used to make crawlers. | Larbin is an HTTP Web crawler with an easy interface that runs under Linux. It can fetch more than 5 million pages a day on a standard PC (with a good network). |
| We may do another thread later about opensource and commercially available crawlers and such since larbin, nutch, heritrix, etc. for OpenSource and the google appliance and a bunch of offline readers and other stuff for commercial.
|
incrediBILL
#:3627968
| 8:39 am on April 16, 2008 (utc 0) |
Can anyone think of any other default UAs? I'm drawing a blank now...
|
Mokita
#:3628073
| 11:23 am on April 16, 2008 (utc 0) |
| Can anyone think of any other default UAs? |
| Please define a "default UA".
|
Hobbs
#:3628129
| 1:22 pm on April 16, 2008 (utc 0) |
| Please define a "default UA" |
| That would be the user agent appearing in your logs coming in from scripts (or development libraries) that crawl your pages whose user agent setting were set to the default settings by sloppy scrapers that you will block in htaccess. :-) or left at default settings by hard working folks developing useful web applications that you will be blocking too. :-))
|
Mokita
#:3628143
| 1:32 pm on April 16, 2008 (utc 0) |
Like "grub" or "nutch"?
|
incrediBILL
#:3628502
| 7:01 pm on April 16, 2008 (utc 0) |
Those will fit the next thread about default User Agents for off-the-shelf crawlers. At the moment we're just looking to compile a list of default user agents of libraries and command line tools commonly used in spiders that don't identify themselves.
|