Skip to main content

Crawler detection from User-Agent strings in 50 ns. Zero deps, no regex, ReDoS-safe.

Project description

is-crawler

Crawler detection from User-Agent strings in 40 ns. Zero deps, no regex, ReDoS-safe.

PyPI Python License Stars Downloads

Issues PRs Welcome Buy Me a Coffee

pip install is-crawler
from is_crawler import is_crawler

is_crawler("Googlebot/2.1 (+http://www.google.com/bot.html)")  # True
is_crawler("Mozilla/5.0 (X11; Linux x86_64) Firefox/120.0")    # False

One call, runs on every request without blinking.

\(°o°)/   caught one!
 /| |\

Why

Crawler detection sits on the request hot path. Most libraries reach for big regex tables, which means slow first hits, ReDoS exposure on hostile UAs, and millisecond-scale latency you pay forever.

is_crawler runs str.find and small char scans against curated keywords. No backtracking, no DB load, no network. The optional crawler_info adds DB lookups when you want classification. Everything else (FCrDNS, IP ranges, robots.txt, middleware) is opt-in.

is-crawler  ▏                                                  0.04 µs
cua         ████████████████████████████████████████████████  64.00 µs
is-crawler crawler-user-agents ua-parser
Hot-path regex no yes yes
ReDoS-safe yes no no
FCrDNS verify yes no no
IP range lookup yes no no
WSGI/ASGI MW yes no no
Warm is_crawler 0.04 µs 66 µs n/a

In the wild

What the API returns on real UAs you will actually see:

User agent is_crawler crawler_name crawler_version crawler_url crawler_signals crawler_info.tags
Mozilla/5.0 (compatible; GPTBot/1.2; +https://openai.com/gptbot) True GPTBot '1.2' 'https://openai.com/gptbot' ['bot_signal', 'bare_compatible', 'url_in_ua'] ('ai-crawler',)
ChatGPT-User/1.0 True ChatGPT-User '1.0' None ['bot_signal', 'no_browser_signature'] ('ai-fetcher',)
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/120.0.0.0 Safari/537.36 True HeadlessChrome '120.0.0.0' None ['bot_signal'] ('browser-automation',)
curl/8.4.0 True curl '8.4.0' None ['no_browser_signature'] ('http-library',)
python-requests/2.31.0 True python-requests '2.31.0' None ['no_browser_signature'] ('http-library',)
Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) True AhrefsBot '7.0' 'http://ahrefs.com/robot/' ['bot_signal', 'bare_compatible', 'url_in_ua'] ('seo',)
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) True facebookexternalhit '1.1' 'http://www.facebook.com/externalhit_uatext.php' ['bot_signal', 'no_browser_signature', 'url_in_ua'] ('social-preview',)
Mozilla/5.0 (compatible; Nikto/2.5.0) True Nikto '2.5.0' None ['bare_compatible', 'known_tool'] ('scanner',)
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 False None None None [] None

Detection

from is_crawler import (
    is_crawler, crawler_signals, crawler_match, crawler_matches,
    crawler_info, crawler_has_tag,
    crawler_name, crawler_version, crawler_url, crawler_contact,
)

ua = "Googlebot/2.1 (+http://www.google.com/bot.html)"

is_crawler(ua)         # True
crawler_name(ua)       # 'Googlebot'
crawler_version(ua)    # '2.1'
crawler_url(ua)        # 'http://www.google.com/bot.html'
crawler_signals(ua)    # ['bot_signal', 'no_browser_signature', 'url_in_ua']
crawler_match(ua)      # 'bot'
crawler_matches(ua)    # ['bot', '+http://']

ua2 = "MyBot/1.0 (contact: bot@example.com)"
crawler_contact(ua2)   # 'bot@example.com'
crawler_contact(ua)    # None

is_crawler short-circuits on three rules: positive bot signal (keywords like bot/crawl/spider, known tools, embedded URL/email), missing browser signature (no Mozilla/, WebKit, OS token, etc.), or a bare (compatible; ...) block.

crawler_signals exposes which rules fired, for logging and diagnostics. crawler_match / crawler_matches return the original-case substring(s) that fired the bot signal ('GPTBot', '+http://', an embedded contact email), or None / [] when none did.

Classification

crawler_info matches against ~2900 curated patterns from tn3w/Crawlerdex, downloaded fresh on each release. Patterns compile lazily in 48-entry chunks.

info = crawler_info(ua)
info.url            # 'http://www.google.com/bot.html'
info.description    # "Google's main web crawling bot..."
info.tags           # ('search-engine',)

crawler_has_tag(ua, "search-engine")        # True
crawler_has_tag(ua, ["ai-crawler", "seo"])  # False

Tags: search-engine, ai-crawler, ai-fetcher, seo, social-preview, advertising, archiver, feed-reader, monitoring, scanner, academic, http-library, browser-automation.

One-tag wrappers exist for each: is_search_engine, is_ai_crawler, is_ai_fetcher, is_seo, is_social_preview, is_advertising, is_archiver, is_feed_reader, is_monitoring, is_scanner, is_academic, is_http_library, is_browser_automation.

Quick gates:

is_good_crawler(ua)   # search-engine, social-preview, feed-reader, archiver, academic
is_bad_crawler(ua)    # ai-crawler, scanner, http-library, browser-automation, seo

ai-fetcher, advertising, and monitoring are policy-dependent and belong to neither group.

Custom patterns

Register runtime patterns to tag internal scrapers or override false-positives. Matched before the bundled DB, so they win on conflict. The optimized lookup (combined regex) rebuilds on every change; the hot path stays untouched when no custom patterns exist (one is None check).

from is_crawler import register_crawler, unregister_crawler, crawler_info, crawler_has_tag

register_crawler("internal", "InternalScraper", tags=["internal"], rdns=[".corp.example"])

crawler_has_tag("InternalScraper/1.0", "internal")  # True (even though not bot-like)
crawler_info("Googlebot/2.1")                        # unchanged, DB still wins on miss

register_crawler("not-a-bot", "Googlebot", tags=[])  # override a false-positive
unregister_crawler("internal")                       # True if it existed

register_crawler(name, pattern, *, url, description, tags, rdns)name is the registry key, pattern a regex searched against the UA. rdns feeds straight into verify_crawler_ip, so custom crawlers get FCrDNS verification too.

custom_crawlers(*entries) is a context manager for per-test overrides — it restores the prior registry on exit. clear_custom_crawlers() wipes all.

from is_crawler import custom_crawlers

with custom_crawlers({"name": "test", "pattern": "MyTestBot", "tags": ["test"]}):
    assert crawler_has_tag("MyTestBot/1.0", "test")
# registry restored here

IP verification

Two strategies, use either or both. socket only, no deps.

from is_crawler.ip import (
    verify_crawler_ip, reverse_dns, forward_confirmed_rdns,
    ip_in_range, known_crawler_ip, known_crawler_rdns,
)

verify_crawler_ip("Googlebot/2.1", "66.249.66.1")  # True (FCrDNS, UA-name matched)
verify_crawler_ip("Googlebot/2.1", "8.8.8.8")      # False (spoof)

ip_in_range("66.249.66.1")        # True (CIDR lookup, offline)
known_crawler_rdns("66.249.66.1") # True (rDNS suffix matches any known crawler)

reverse_dns("8.8.8.8")                                      # 'dns.google'
forward_confirmed_rdns("66.249.66.1", (".googlebot.com",))  # hostname or None

verify_crawler_ip does the full FCrDNS dance: rDNS lookup, suffix check against the UA's vendor, forward lookup, IP match. Catches UA spoofing.

ip_in_range runs a bisect over collapsed CIDRs from 39 official sources (Google, Bing, OpenAI, Anthropic, Cloudflare, AWS, ...). Cheap and offline.

Middleware

Drop-in for any WSGI or ASGI app. Zero deps.

from is_crawler.contrib import WSGICrawlerMiddleware, ASGICrawlerMiddleware

app = WSGICrawlerMiddleware(app)                                  # Flask, Django
app = ASGICrawlerMiddleware(app, block=True, block_tags="ai-crawler")  # FastAPI, Starlette

# Flask:    request.environ["is_crawler"].is_crawler
# Django:   request.META["is_crawler"].name
# FastAPI:  request.scope["is_crawler"].verified

Both attach a CrawlerMiddlewareResult with user_agent, ip, is_crawler, name, verified, in_ip_range, rdns_match.

Flags: block, block_tags, verify_ip, check_ip_range, check_rdns, trust_forwarded. A positive in_ip_range or rdns_match forces is_crawler=True, which catches UA-less crawlers. With trust_forwarded=True, IP comes from Forwarded, then X-Forwarded-For, then X-Real-IP, then the direct client.

Recipes

Block AI scrapers, let search engines through (FastAPI):

from fastapi import FastAPI
from is_crawler.contrib import ASGICrawlerMiddleware

app = FastAPI()
app = ASGICrawlerMiddleware(app, block=True, block_tags="ai-crawler", trust_forwarded=True)

Serve a live robots.txt from the DB (Flask):

from flask import Response
from is_crawler import build_robots_txt

@app.route("/robots.txt")
def robots():
    return Response(build_robots_txt(disallow=["ai-crawler", "scanner"]), mimetype="text/plain")

Verify Googlebot is real before trusting it:

from is_crawler import is_crawler
from is_crawler.ip import verify_crawler_ip

if is_crawler(ua) and not verify_crawler_ip(ua, ip):
    abort(403)  # spoofed

Crawler share of an access log:

awk -F'"' '{print $6}' access.log | python -m is_crawler | \
  jq -r '.is_crawler' | sort | uniq -c

Snippets

Standalone copy-paste gists in snippets/. No install. Single-file, stdlib only: drop into any project. Includes minimal/full is_crawler, crawler_name, crawler_version, and a compact parse.

robots.txt / ai.txt

Generate directives from tags. Names are extracted from DB patterns, slash/URL-only entries skipped.

from is_crawler import build_robots_txt, build_ai_txt, robots_agents_for_tags

print(build_robots_txt(disallow=["ai-crawler", "scanner"]))
# User-agent: GPTBot
# Disallow: /
# ...

print(build_ai_txt())          # disallows all ai-crawler agents by default
# User-Agent: GPTBot
# Disallow: /
# ...

robots_agents_for_tags("ai-crawler")
# ['AI2Bot', 'Applebot-Extended', 'Bytespider', 'CCBot', 'ChatGPT-User', ...]

build_robots_txt also accepts a rules list of (path, tags) pairs for per-path control:

build_robots_txt(rules=[("/api", "scanner"), ("/private", "ai-crawler")])

assert_crawler(ua): like crawler_info but raises ValueError for unknown UAs.

CLI

is-crawler "Googlebot/2.1 (+http://www.google.com/bot.html)"   # detect (default)
tail -f access.log | awk -F'"' '{print $6}' | is-crawler        # stream from stdin
is-crawler parse "Mozilla/5.0 ... Chrome/134.0.0.0 Safari/537.36"
is-crawler verify "Googlebot/2.1" 66.249.66.1                   # FCrDNS spoof check
is-crawler ip 66.249.66.1 8.8.8.8                               # range + rDNS lookup
is-crawler robots --disallow ai-crawler,scanner --path /        # generate robots.txt
is-crawler ai-txt                                               # generate ai.txt

Commands (also runnable as python -m is_crawler):

Command Output
detect [UA...] JSON per UA: is_crawler, name, version, url, contact, signals, matches, info
parse [UA...] Full UserAgent parse as JSON (browser, OS, device, ...)
verify <UA> <IP> verified, in_ip_range, rdns, rdns_match
ip <IP...> in_ip_range, rdns, rdns_match per IP
robots robots.txt from --disallow / --allow / --path tags
ai-txt ai.txt disallowing --disallow tags (default ai-crawler)

detect/parse read UAs from stdin when none are given. Global flags: -p/--pretty (indented JSON), -h/--help, -V/--version. Tags are comma-separated.

UA Parser

parse(ua) returns a UserAgent with all common fields. Zero deps, no regex, 4096-entry LRU cache.

from is_crawler.parser import parse, parse_or_none

ua = parse("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36")

ua.browser          # 'Chrome'
ua.browser_version  # '134.0.0.0'
ua.browser_major    # '134'
ua.os               # 'Windows'
ua.os_version       # '10'
ua.engine           # 'Blink'
ua.engine_version   # '537.36'
ua.device           # 'Desktop'
ua.device_brand     # None
ua.device_model     # None
ua.cpu              # 'x86_64'
ua.is_mobile        # False
ua.is_tablet        # False
ua.is_crawler       # False
ua.is_webview       # False
ua.is_headless      # False
ua.channel          # None | 'beta' | 'dev' | 'canary' | 'nightly'
ua.app              # None | 'Facebook' | 'Instagram' | 'TikTok' ...
ua.app_version      # in-app browser version
ua.languages        # []
ua.rendering        # 'KHTML, like Gecko'
ua.product_token    # 'Mozilla/5.0'
ua.comment          # '(Windows NT 10.0; Win64; x64)'
ua.raw              # original string

ua.to_dict()        # all fields as dict

parse_or_none(value) normalises bytes/None/non-str, returns None for empty input.

Benchmarks

Python 3.14, Linux x86_64. cua = crawler-user-agents v1.47.

Apache Logs 42,512 UA entries (8,942 crawlers, 33,570 browsers, 21% ratio):

Scenario is_crawler crawler_info cua.is_crawler cua.crawler_info
Warm cache 0.037 µs 0.116 µs 66.234 µs 1585.007 µs
Cold cache 0.112 µs 1.008 µs - -

~1790× faster on the hot path, ~13660× faster for crawler_info warm. Full classify of 42,512 Apache log UAs runs in 1.80 ms.

Fixture UAs 2,149 crawlers + 19,910 browsers:

Scenario is_crawler (mixed) crawler_info cua.is_crawler (mixed) cua.crawler_info
Warm cache 0.05 µs 1.24 µs 80.95 µs 563.53 µs
Cold cache 1.37 µs 4.51 µs 82.00 µs 581.76 µs

UA parser 19,910 real browser UAs vs ua-parser (~24× faster):

Scenario parser.parse ua-parser
Warm cache 18.83 µs 443.20 µs
Cold cache 18.69 µs 443.05 µs

IP verification warm cache:

Function Time
ip_in_range 0.06 µs
reverse_dns 0.38 µs
forward_confirmed_rdns 2.05 µs
known_crawler_rdns 2.26 µs
verify_crawler_ip 3.02 µs

Every public function has a 32k-entry LRU cache. First-call rDNS latency is network-bound.

Implementation

is_crawler uses str.find and char scans, never regex, so hostile UAs cannot trigger backtracking. crawler_info does use re, but only against curated upstream patterns that are simple by construction.

Data files live in is_crawler/. crawlers.min.json is downloaded fresh from tn3w/Crawlerdex releases on each publish. IP ranges are built by:

python3 tools/build_ip_ranges.py     # crawler-ip-ranges.json from 39 official sources

Source definitions for IP ranges live in tools/crawler-ip-ranges.json and can be extended without touching the build script.

Development

pip install -e ".[dev]"
ruff format . && ruff check --fix .
npx --yes prettier --write --single-quote --print-width=100 --trailing-comma=es5 --end-of-line=lf "**/*.{md,yml,yaml,html,css,js,ts}" "tools/*.json"

See CONTRIBUTING.md. Report vulnerabilities via GitHub private security advisory, not public issues. See SECURITY.md and CODE_OF_CONDUCT.md.

License

Apache-2.0

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

is_crawler-1.5.25.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

is_crawler-1.5.25-py3-none-any.whl (250.1 kB view details)

Uploaded Python 3

File details

Details for the file is_crawler-1.5.25.tar.gz.

File metadata

  • Download URL: is_crawler-1.5.25.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for is_crawler-1.5.25.tar.gz
Algorithm Hash digest
SHA256 d0f5254cfe9debea88285a392952810987785dbfed4a9748a6366df6263feb11
MD5 80956cbf6e9c7c3da175bc935a94962c
BLAKE2b-256 fea1f18947139f2665319c5d916c0defc7896f067aa5ec63780425c1cd330b34

See more details on using hashes here.

Provenance

The following attestation bundles were made for is_crawler-1.5.25.tar.gz:

Publisher: publish.yml on tn3w/is-crawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file is_crawler-1.5.25-py3-none-any.whl.

File metadata

  • Download URL: is_crawler-1.5.25-py3-none-any.whl
  • Upload date:
  • Size: 250.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for is_crawler-1.5.25-py3-none-any.whl
Algorithm Hash digest
SHA256 e56940d0e6bb65c7331e2fa09b2ddf99bc87cff38d726e603bdbb74d53226b31
MD5 d654a17f2e45edba98a5f305413047fc
BLAKE2b-256 fa305a2271c35bc675fa2396a608867d63f1ac38a710ac270eec2f828aca0dc6

See more details on using hashes here.

Provenance

The following attestation bundles were made for is_crawler-1.5.25-py3-none-any.whl:

Publisher: publish.yml on tn3w/is-crawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page