Robot Rules Parser

M-Software.de - Robot Rules Parser online

Anlässlich eines kleines Problems mit der robots.txt habe ich ein kleines Testprogramm geschrieben, mit dem ich meine robots.txt testen kann. Da ich nun aber nicht alles selber machen wollte, habe ich mit mal die unterschiedlichen Methoden der Verarbeitung von robots.txt Dateien angesehen. Darunter waren auch die Verarbeitungsprogramm von wget, htdig und nutch, die alle OpenSource sind und daher im Sourcecode vorliegen.

Aufgabe der robots.txt ist es dem Crawler mitzuteilen, welche Webseiten er von der Domain nicht anfordern darf. Eine typische robots.txt Datei sieht folgendermaßen aus.
# robots.txt für http://m-software.de/

User-agent: 
User-agent: Bad Spider
Disallow: /

User-agent: *
Disallow: /intern1/
Disallow: /intern2/
Disallow: /rss.php
In dieser robots.txt werden allen Crawlern (*) die Verzeichnisse intern1, intern2 und die Datei rss.php im Hauptverzeichnis verboten. Zusätzlich wird noch dem Crawler der auf den Namen "Bad Spider" hört das Hauptverzeichnis verboten. "Bad Spider" ist dabei ein Alias, den man durch den Spider ersetzen sollte, der nicht auf die Domain zugreifen darf. Um nun nicht irgendeinen Fehler in der robots.txt zu haben kann man nun in kleinen Java Programm dass in dem IFRAME geladen wurde die robots.txt testen.

Viel Spaß damit. Natürlich ist jeder gerne eingeladen, den IFRAME auf seiner eigenen Seite anzuzeigen. <IFRAME SRC="http://service.m-software.de/robots/" WIDTH="467" HEIGHT="151" scrolling="no" frameborder="0"></IFRAME> PS: Die robots.txt wird von vielen Spidern berücksichtigt, aber man sollte sich nicht darauf verlassen. Hier noch eine Liste von verdächtigen Robots.
User-Agent: ActiveAgent
User-Agent: Alexibot
User-Agent: Aqua_Products
User-Agent: AskJeeves
User-Agent: BackDoorBot
User-Agent: BackDoorBot 1.0
User-Agent: BackDoorBot/1.0
User-Agent: BackWeb
User-Agent: BecomeBot
User-Agent: Black Hole
User-Agent: BlackWidow
User-Agent: BlowFish
User-Agent: BlowFish 1.0
User-Agent: BlowFish/1.0
User-Agent: Bookmark search tool
User-Agent: BotALot
User-Agent: BotRightHere
User-Agent: BuiltBotTough
User-Agent: Bullseye
User-Agent: Bullseye/1.0
User-Agent: BunnySlippers
User-Agent: Cegbfeieh
User-Agent: Cegbfeieh 
User-Agent: CheeseBot
User-Agent: CherryPicker
User-Agent: CherryPicker /1.0
User-Agent: CherryPicker 1.0
User-Agent: CherryPickerElite 1.0
User-Agent: CherryPickerElite/1.0
User-Agent: CherryPickerSE 1.0
User-Agent: CherryPickerSE/1.0
User-Agent: ChinaClaw
User-Agent: Collector
User-Agent: Copernic
User-Agent: Copier
User-Agent: CopyRightCheck
User-Agent: Crescent
User-Agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
User-Agent: Crescent Internet ToolPak HTTPOLE Control v.1.0
User-Agent: DISCo
User-Agent: DISCo Pump
User-Agent: DISCo Pump 3.1
User-Agent: DittoSpyder
User-Agent: Download Demon
User-Agent: Download Wonder
User-Agent: Downloader
User-Agent: Drip
User-Agent: EirGrabber
User-Agent: EmailCollector
User-Agent: EmailCollector 1.0
User-Agent: EmailSiphon
User-Agent: EmailWolf
User-Agent: EmailWolf 1.00
User-Agent: Enterprise_Search
User-Agent: Enterprise_Search/1.0
User-Agent: EroCrawler
User-Agent: Express WebPictures
User-Agent: ExtractorPro
User-Agent: EyeNetIE
User-Agent: FairAd Client
User-Agent: FileHound
User-Agent: Flaming AttackBot
User-Agent: FlashGet
User-Agent: Foobot
User-Agent: FreeFind
User-Agent: Gaisbot
User-Agent: GetRight
User-Agent: GetRight/4.2
User-Agent: GetSmart
User-Agent: GetWeb!
User-Agent: Go!Zilla
User-Agent: Go-Ahead-Got-It
User-Agent: Googlebot-Image
User-Agent: GrabNet
User-Agent: Grabber
User-Agent: Grafula
User-Agent: HLoader
User-Agent: HMView
User-Agent: HTTrack
User-Agent: Harvest
User-Agent: Harvest 1.5
User-Agent: Harvest/1.5
User-Agent: Hatena Antenna
User-Agent: Image Stripper
User-Agent: Image Sucker
User-Agent: Indy Library
User-Agent: InfoNaviRobot
User-Agent: InterGET
User-Agent: Internet Ninja
User-Agent: Iria
User-Agent: Iron33
User-Agent: Iron33/1.0.2
User-Agent: JOC
User-Agent: JOC Web Spider
User-Agent: Jeeves
User-Agent: JennyBot
User-Agent: JetCar
User-Agent: Jetbot
User-Agent: Jetbot/1.0
User-Agent: JustView
User-Agent: Kenjin Spider
User-Agent: Keyword Density
User-Agent: Keyword Density/0.9
User-Agent: LNSpiderguy
User-Agent: LexiBot
User-Agent: LinkScan
User-Agent: LinkScan/8.1a Unix
User-Agent: LinkWalker
User-Agent: LinkextractorPro
User-Agent: MIDown tool
User-Agent: MIIxpc
User-Agent: MIIxpc/4.2
User-Agent: MSIECrawler
User-Agent: Mag-Net
User-Agent: Magnet
User-Agent: Mass Downloader
User-Agent: Mata Hari
User-Agent: Memo
User-Agent: Microsoft URL Control
User-Agent: Microsoft URL Control - 5.01.4511
User-Agent: Microsoft URL Control - 6.00.8169
User-Agent: Mirror
User-Agent: Mister PiX
User-Agent: Mozilla
User-Agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)
User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 2000)
User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 9
User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)
User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows 98)
User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows ME)
User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)
User-Agent: Mozilla/4.0 (compatible; MSIE 4.0; Windows XP)
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; AIRF)
User-Agent: NICErsPRO
User-Agent: NPBot
User-Agent: Navroad
User-Agent: NearSite
User-Agent: Net Vampire
User-Agent: NetAnts
User-Agent: NetMechanic
User-Agent: NetSpider
User-Agent: NetZIP
User-Agent: Ninja
User-Agent: Nutch
User-Agent: Octopus
User-Agent: Offline Explorer
User-Agent: Offline Navigator
User-Agent: OmniExplorer_Bot
User-Agent: Openbot
User-Agent: Openfind
User-Agent: Openfind 
User-Agent: Openfind data gathere
User-Agent: Openfind data gatherer
User-Agent: Oracle Ultra Search
User-Agent: PageGrabber
User-Agent: Papa Foto
User-Agent: PerMan
User-Agent: ProPowerBot
User-Agent: ProPowerBot/2.14
User-Agent: ProWebWalker
User-Agent: Pump
User-Agent: Python-urllib
User-Agent: QueryN Metasearch
User-Agent: RMA
User-Agent: Radiation
User-Agent: Radiation Retriever
User-Agent: Radiation Retriever 1.1
User-Agent: ReGet
User-Agent: RealDownload
User-Agent: Reaper
User-Agent: Recorder
User-Agent: RepoMonkey
User-Agent: RepoMonkey Bait & Tackle/v1.01
User-Agent: Roverbot
User-Agent: Siphon
User-Agent: SiteSnagger
User-Agent: SmartDownload
User-Agent: Snake
User-Agent: SpaceBison
User-Agent: SpankBot
User-Agent: Stanford
User-Agent: Stanford Comp Sci
User-Agent: Sucker
User-Agent: SuperBot
User-Agent: SuperHTTP
User-Agent: Surfbot
User-Agent: Szukacz
User-Agent: Szukacz/1.4
User-Agent: Szukacz/1.4 
User-Agent: Teleport
User-Agent: Teleport Pro
User-Agent: Teleport Pro/1.29.1590
User-Agent: Teleport Pro/1.29.1616
User-Agent: Teleport Pro/1.29.1632
User-Agent: Teleport Pro/1.29.1718
User-Agent: TeleportPro
User-Agent: Telesoft
User-Agent: Teoma
User-Agent: The Intraformant
User-Agent: TheNomad
User-Agent: TightTwatBot
User-Agent: Titan
User-Agent: True_Robot
User-Agent: True_Robot/1.0
User-Agent: URL Control
User-Agent: URL_Spider_Pro
User-Agent: URLy Warning
User-Agent: VCI
User-Agent: VCI WebViewer VCI WebViewer Win32
User-Agent: Vacuum
User-Agent: VoidEYE
User-Agent: WWW-Collector
User-Agent: WWW-Collector-E
User-Agent: WWWOFFLE
User-Agent: WX_mail
User-Agent: Web Image Collector
User-Agent: Web Sucker
User-Agent: WebAuto
User-Agent: WebBandit
User-Agent: WebBandit 2.1
User-Agent: WebBandit 3.50
User-Agent: WebBandit/3.50
User-Agent: WebCapture 2.0
User-Agent: WebCopier
User-Agent: WebCopier v.2.2
User-Agent: WebCopier v3.2a
User-Agent: WebEMailExtrac.
User-Agent: WebEMailExtractor 1.0B
User-Agent: WebEnhancer
User-Agent: WebFetch
User-Agent: WebGo IS
User-Agent: WebLeacher
User-Agent: WebReaper
User-Agent: WebSauger
User-Agent: WebStripper
User-Agent: WebVac
User-Agent: WebWhacker
User-Agent: WebZIP
User-Agent: WebZIP/4.21
User-Agent: WebZIP/5.0
User-Agent: WebZip
User-Agent: WebZip/4.0
User-Agent: WebmasterWorld
User-Agent: WebmasterWorld Extractor
User-Agent: WebmasterWorldForumBot
User-Agent: Website
User-Agent: Website Quester
User-Agent: Website eXtractor
User-Agent: Webster
User-Agent: Webster Pro
User-Agent: Wget
User-Agent: Wget/1.5.3
User-Agent: Wget/1.6
User-Agent: Whacker
User-Agent: WhoWhere
User-Agent: Widow
User-Agent: Xaldon
User-Agent: Xaldon/WebSpider
User-Agent: Xenu\'s
User-Agent: Xenu\'s Link Sleuth 1.1c
User-Agent: Zeus
User-Agent: Zeus 32297 Webster Pro V2.9 Win32
User-Agent: Zeus Link Scout
User-Agent: aconon Index
User-Agent: asterias
User-Agent: autoemailspider
User-Agent: b2w
User-Agent: b2w 0.1
User-Agent: b2w/0.1
User-Agent: cosmos
User-Agent: dloader(naverrobot)/1.0
User-Agent: dumbot
User-Agent: eCatch
User-Agent: emailcollector
User-Agent: es
User-Agent: gotit
User-Agent: grub
User-Agent: grub-client
User-Agent: hloader
User-Agent: httplib
User-Agent: humanlinks
User-Agent: ia_archiver
User-Agent: ia_archiver/1.6
User-Agent: larbin
User-Agent: lftp
User-Agent: libWeb
User-Agent: libWeb/clsHTTP
User-Agent: likse
User-Agent: looksmart
User-Agent: lwp-trivial
User-Agent: lwp-trivial/1.34
User-Agent: moget
User-Agent: moget/2.1
User-Agent: mozilla
User-Agent: mozilla/3
User-Agent: mozilla/4
User-Agent: mozilla/5
User-Agent: naver
User-Agent: pavuk
User-Agent: pcBrowser
User-Agent: psbot
User-Agent: scooter
User-Agent: searchpreview
User-Agent: sootle
User-Agent: spanner
User-Agent: suzuran
User-Agent: tAkeOut
User-Agent: toCrawl/UrlDispatcher
User-Agent: turingos
User-Agent: webbandit 4.00.0