Skip to main content

CrawlerDetect is a Python library designed to identify bots, crawlers, and spiders by analyzing their user agents.

Project description

test

About CrawlerDetect

This is a Python wrapper for CrawlerDetect a web crawler detection library. It helps identify bots, crawlers, and spiders using the user agent and other HTTP headers. Currently, it can detect over 3,678 bots, spiders, and crawlers.

How to install

$ pip install crawlerdetect

How to use

Method Reference

camelCase snake_case Description
isCrawler() is_crawler() Check if user agent is a crawler
getMatches() get_matches() Get the name of detected crawlers

Variant 1

from crawlerdetect import CrawlerDetect
crawler_detect = CrawlerDetect()
crawler_detect.isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')
# true if crawler user agent detected

Variant 2

from crawlerdetect import CrawlerDetect
crawler_detect = CrawlerDetect(user_agent='Mozilla/5.0 (iPhone; CPU iPhone OS 7_1 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile (compatible; Yahoo Ad monitoring; https://help.yahoo.com/kb/yahoo-ad-monitoring-SLN24857.html)')
crawler_detect.isCrawler()
# true if crawler user agent detected

Variant 3

from crawlerdetect import CrawlerDetect
crawler_detect = CrawlerDetect(headers={'DOCUMENT_ROOT': '/home/test/public_html', 'GATEWAY_INTERFACE': 'CGI/1.1', 'HTTP_ACCEPT': '*/*', 'HTTP_ACCEPT_ENCODING': 'gzip, deflate', 'HTTP_CACHE_CONTROL': 'no-cache', 'HTTP_CONNECTION': 'Keep-Alive', 'HTTP_FROM': 'googlebot(at)googlebot.com', 'HTTP_HOST': 'www.test.com', 'HTTP_PRAGMA': 'no-cache', 'HTTP_USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'PATH': '/bin:/usr/bin', 'QUERY_STRING': 'order=closingDate', 'REDIRECT_STATUS': '200', 'REMOTE_ADDR': '127.0.0.1', 'REMOTE_PORT': '3360', 'REQUEST_METHOD': 'GET', 'REQUEST_URI': '/?test=testing', 'SCRIPT_FILENAME': '/home/test/public_html/index.php', 'SCRIPT_NAME': '/index.php', 'SERVER_ADDR': '127.0.0.1', 'SERVER_ADMIN': 'webmaster@test.com', 'SERVER_NAME': 'www.test.com', 'SERVER_PORT': '80', 'SERVER_PROTOCOL': 'HTTP/1.1', 'SERVER_SIGNATURE': '', 'SERVER_SOFTWARE': 'Apache', 'UNIQUE_ID': 'Vx6MENRxerBUSDEQgFLAAAAAS', 'PHP_SELF': '/index.php', 'REQUEST_TIME_FLOAT': 1461619728.0705, 'REQUEST_TIME': 1461619728})
crawler_detect.isCrawler()
# true if crawler user agent detected

Output the name of the bot that matched (if any)

from crawlerdetect import CrawlerDetect
crawler_detect = CrawlerDetect()
crawler_detect.isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')
# true if crawler user agent detected
crawler_detect.getMatches()
# Sosospider

Get version of the library

import crawlerdetect
crawlerdetect.__version__

Contributing

The patterns and testcases are synced from the PHP repo. If you find a bot/spider/crawler user agent that crawlerdetect fails to detect, please submit a pull request with the regex pattern and a testcase to the upstream PHP repo.

Failing that, just create an issue with the user agent you have found, and we'll take it from there :)

Development

Setup

$ poetry install

Running tests

$ poetry run pytest

Update crawlers from upstream PHP repo

$ ./update_data.sh

Bump version

$ poetry run bump-my-version bump [patch|minor|major]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlerdetect-0.3.2.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawlerdetect-0.3.2-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file crawlerdetect-0.3.2.tar.gz.

File metadata

  • Download URL: crawlerdetect-0.3.2.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.10 Darwin/24.5.0

File hashes

Hashes for crawlerdetect-0.3.2.tar.gz
Algorithm Hash digest
SHA256 1c2f9ccbb786c756c4f5bce62503ac0792b88b0291df6dbd5633f3e9c8a7f432
MD5 0815a39e0e51686c1feb1ed2ac692c26
BLAKE2b-256 f697f33c16f3ececdfb98582ef0559aee745bcc9fe00d9873fbdb74ad06d7b8b

See more details on using hashes here.

File details

Details for the file crawlerdetect-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: crawlerdetect-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.12.10 Darwin/24.5.0

File hashes

Hashes for crawlerdetect-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 42e53a1fca1f99fc9459d5c699300e94e02d0a23d1cec4750fe37ce61e6e8441
MD5 53eb94baa5755d4b7d0997b8eaf34f31
BLAKE2b-256 e6471be5b2bc4ce8ab32e592817946017efb89f65957c0d17aa346b97c023f18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page