Skip to main content

Fast, regex-free crawler detection from user agents. Zero deps, ReDoS-safe heuristics, ~100× faster than alternatives.

Project description

is-crawler

Fast, regex-free crawler detection from user agents. Zero deps, ReDoS-safe heuristics, ~100× faster than alternatives. Includes FCrDNS IP verification for 100+ known crawlers.

PyPI Python License Stars Downloads Buy Me a Coffee

Install

pip install is-crawler

Usage

from is_crawler import (
    is_crawler, crawler_signals, crawler_info, crawler_has_tag,
    crawler_name, crawler_version, crawler_url, CrawlerInfo,
)
from is_crawler.ip import (
    verify_crawler_ip,
    reverse_dns,
    forward_confirmed_rdns,
    ip_in_range,
    known_crawler_ip,
    known_crawler_rdns,
)

ua = "Googlebot/2.1 (+http://www.google.com/bot.html)"
ip = "66.249.66.1"

is_crawler(ua)                                  # True
crawler_signals(ua)                             # ['bot_signal', 'no_browser_signature', 'url_in_ua']
crawler_name(ua)                                # 'Googlebot'
crawler_version(ua)                             # '2.1'
crawler_url(ua)                                 # 'http://www.google.com/bot.html'
verify_crawler_ip(ua, ip)                       # True - FCrDNS validation
reverse_dns(ip)                                 # 'crawl-66-249-66-1.googlebot.com'
forward_confirmed_rdns(ip, (".googlebot.com",)) # hostname or None
ip_in_range(ip)                                 # True - in known crawler CIDRs
known_crawler_ip(ip)                            # alias for ip_in_range
known_crawler_rdns(ip)                          # True - known crawler via FCrDNS/rDNS

info = crawler_info(ua)                         # CrawlerInfo(...)
if info is not None:
    info.url                                    # 'http://www.google.com/bot.html'
    info.description                            # "Google's main web crawling bot..."
    info.tags                                   # ('search-engine',)

crawler_has_tag(ua, "search-engine")            # True
crawler_has_tag(ua, ["ai-crawler", "seo"])      # False

API

is_crawler(ua: str) -> bool

Heuristic detection. Returns True if the UA is a crawler. No DB lookup, no regex.

Three short-circuit rules:

  1. Positive signal: bot keywords (bot, crawl, spider, scrape, headless, slurp, archiv, preview, ...), known tools (playwright, selenium, wget, lighthouse, sqlmap, nikto, nmap, httrack, pingdom, google-safety, ...), or a URL/email embedded in the UA.
  2. No browser signature: missing Mozilla/, WebKit, Gecko, Trident, Presto, KHTML, Links, Lynx, Opera, or an OS token like (Windows, (Linux, (X11, (Macintosh.
  3. Bare (compatible; ...): classic bot block without OS/browser tokens inside.

crawler_signals(ua: str) -> list[str]

Which individual rules fired. Subset of: bot_signal, no_browser_signature, bare_compatible, known_tool, url_in_ua. Useful for diagnostics and logging. is_crawler does not call this.

crawler_name(ua: str) -> str | None

Product name extracted from the UA.

  • Googlebot/2.1 ...'Googlebot'
  • Mozilla/5.0 (compatible; bingbot/2.0; ...)'bingbot'
  • Mozilla/5.0 ... Speedy Spider (...)'Speedy Spider'
  • Chrome/Firefox/Safari → None

crawler_version(ua: str) -> str | None

Version token extracted from the UA. Returns None if no non-browser version is detectable.

  • curl/7.64.1'7.64.1'
  • Mozilla/5.0 (compatible; Miniflux/2.0.10; ...)'2.0.10'
  • Googlebot/2.1 ...'2.1'

crawler_url(ua: str) -> str | None

URL embedded in the UA (after +, ;, or -).

  • Googlebot/2.1 (+http://www.google.com/bot.html)'http://www.google.com/bot.html'
  • UA with no embedded URL → None

crawler_info(ua: str) -> CrawlerInfo | None

DB lookup against 1200 known crawler patterns. Returns None for browsers (short-circuits via is_crawler).

class CrawlerInfo(NamedTuple):
    url: str                # crawler's info/docs URL (may be '')
    description: str        # human-readable description
    tags: tuple[str, ...]   # classification tags, e.g. ('search-engine',)

crawler_has_tag(ua: str, tags: str | Iterable[str]) -> bool

True if the crawler has any of the given tags. tags accepts a single string or a list.

Available tags: search-engine, ai-crawler, seo, social-preview, advertising, archiver, feed-reader, monitoring, scanner, academic, http-library, browser-automation.

Category shortcuts

One-tag wrappers over crawler_has_tag:

is_search_engine(ua)       # 'search-engine'
is_ai_crawler(ua)          # 'ai-crawler'
is_seo(ua)                 # 'seo'
is_social_preview(ua)      # 'social-preview'
is_advertising(ua)         # 'advertising'
is_archiver(ua)            # 'archiver'
is_feed_reader(ua)         # 'feed-reader'
is_monitoring(ua)          # 'monitoring'
is_scanner(ua)             # 'scanner'
is_academic(ua)            # 'academic'
is_http_library(ua)        # 'http-library'
is_browser_automation(ua)  # 'browser-automation'

is_good_crawler(ua) / is_bad_crawler(ua)

Opinionated groupings for quick allow/deny gates.

  • Good (indexing, previews, archives, feeds, research): search-engine, social-preview, feed-reader, archiver, academic.
  • Bad (scraping, scanning, unattributed traffic): ai-crawler, scanner, http-library, browser-automation, seo.

advertising and monitoring are intentionally neither: policy-dependent.

Middleware

from is_crawler.contrib import WSGICrawlerMiddleware

app = WSGICrawlerMiddleware(app)

# Flask
request.environ["is_crawler"].is_crawler

# Django
request.META["is_crawler"].name
from is_crawler.contrib import ASGICrawlerMiddleware

app = ASGICrawlerMiddleware(app, block=True, block_tags="ai-crawler")

# FastAPI / Starlette
request.scope["is_crawler"].is_crawler
request.state.crawler.verified

Both middlewares are zero-dep. They attach CrawlerMiddlewareResult with user_agent, ip, is_crawler, name, verified, in_ip_range, rdns_match.

  • WSGICrawlerMiddleware: Flask, Django, any WSGI app
  • ASGICrawlerMiddleware: FastAPI, Starlette, any ASGI app

Optional flags: block=True, block_tags=..., verify_ip=True, check_ip_range=True, check_rdns=True, trust_forwarded=True.

IP flags:

  • verify_ip → strict FCrDNS (rDNS + forward lookup, UA-name matched). Sets verified.
  • check_ip_range → CIDR lookup against shipped ranges. Sets in_ip_range. Cheap, offline.
  • check_rdns → rDNS suffix against any known crawler domain. Sets rdns_match. One DNS lookup.

A positive in_ip_range or rdns_match also forces is_crawler=True (catches UA-less crawlers).

With trust_forwarded=True, middleware uses the first IP from X-Forwarded-For, then X-Real-IP, before the direct client address.

robots.txt helpers

Generate directives from DB tags. Names extracted from DB patterns (slash/URL-only entries skipped).

from is_crawler import build_robots_txt, robots_agents_for_tags, iter_crawlers

robots_agents_for_tags("ai-crawler")
# ['AI2Bot', 'Applebot-Extended', 'Bytespider', 'CCBot', 'ChatGPT-User', 'Claude-Web', 'GPTBot', ...]

print(build_robots_txt(disallow=["ai-crawler", "scanner"]))
# User-agent: GPTBot
# Disallow: /
#
# User-agent: Nikto
# Disallow: /
# ...

build_robots_txt(allow="search-engine", path="/public")
# User-agent: Googlebot
# Allow: /public
# ...

for info, name in iter_crawlers():      # (CrawlerInfo, robots-name) per DB entry
    ...

IP verification (is_crawler.ip)

Two complementary strategies: use either or both.

FCrDNS (forward-confirmed reverse DNS)

rDNS → suffix check → forward lookup → IP match. Catches UA spoofing. socket only, no deps.

from is_crawler.ip import verify_crawler_ip, forward_confirmed_rdns, reverse_dns

verify_crawler_ip("Googlebot/2.1 (+http://www.google.com/bot.html)", "66.249.66.1")
# True → rDNS ends in .googlebot.com AND forward lookup returns same IP

verify_crawler_ip("Googlebot/2.1", "8.8.8.8")               # False (spoof)
reverse_dns("8.8.8.8")                                       # 'dns.google'
forward_confirmed_rdns("66.249.66.1", (".googlebot.com",))   # hostname or None

Built-in suffixes: Googlebot, Bingbot, Applebot, DuckDuckBot, YandexBot, Baiduspider, FacebookBot, and 80+ more. Crawler name taken from crawler_name(ua).

IP range lookup

Check whether an IP belongs to any known crawler's published CIDR range. Requires a build range database which is included (see Tools below).

from is_crawler.ip import ip_in_range, known_crawler_ip, known_crawler_rdns

ip_in_range("66.249.66.1")    # True : in Googlebot ranges
ip_in_range("8.8.8.8")        # False: not a known crawler range
known_crawler_ip("66.249.66.1")  # alias for ip_in_range
known_crawler_rdns("66.249.66.1")  # True: reverse DNS matches a known crawler domain

Results are LRU-cached. The file is optional: if absent, ip_in_range returns False rather than raising.

Tools

Scripts in tools/ build the data files shipped inside the package.

build_user_agents.py

Compiles is_crawler/crawler-user-agents.json from the upstream monperrus/crawler-user-agents source plus local extras.

python3 tools/build_user_agents.py
python3 tools/build_user_agents.py --input crawler-user-agents.json --output is_crawler/crawler-user-agents.json

build_ip_ranges.py

Fetches live IP range data from 39 official sources (Google, Bing, DuckDuckGo, Apple, OpenAI, Anthropic, Perplexity, Common Crawl, Cloudflare, Fastly, AWS, Azure, Oracle Cloud, GitHub, Telegram, Ahrefs, Yandex, Facebook, Kagi, Amazon, UptimeRobot, Pingdom, Stripe, and more) and writes a flat is_crawler/crawler-ip-ranges.json mapping each source name to its CIDR list.

python3 tools/build_ip_ranges.py
python3 tools/build_ip_ranges.py --timeout 30 --skip-errors

Source definitions live in tools/crawler-ip-ranges.json (name → {url, pattern}) and can be extended independently of the build script.

CLI

python -m is_crawler "Googlebot/2.1 (+http://www.google.com/bot.html)"
tail -f access.log | awk -F'"' '{print $6}' | python -m is_crawler

One JSON object per UA (arg or stdin line) with is_crawler, name, version, url, signals, info.

Caching

Every public function has a 32k-entry LRU cache. Repeat UAs hit in ~40 ns.

Benchmarks

Python 3.14, Linux x86_64. cua = crawler-user-agents v1.44.

Synthetic corpus

Corpus: 1,231 crawler UAs + 15,812 browser UAs.

Scenario is_crawler crawler_info cua.is_crawler cua.crawler_info
Warm cache 0.05 µs 0.60 µs 158.9 µs 732.0 µs
Cold cache 1.85 µs 2.07 µs 176.94 µs 733.4 µs

That is roughly 3000× faster for hot is_crawler, 96× faster for cold is_crawler, and 354× faster for cold crawler_info.

Real Apache logs

Corpus: 42,512 UA entries from two Apache access logs (8,942 crawlers, 33,570 browsers, 21.0% crawler ratio).

Scenario is_crawler (all) crawler_info (all) cua.is_crawler (all) cua.crawler_info (all)
Warm cache 0.044 µs 0.115 µs 64.121 µs 1513.618 µs
Cold cache 0.143 µs 0.970 µs - -

Full-log classify time:

Log Time Crawlers found
apache_access_1.txt 2.22 ms 6,462
apache_access_2.txt 0.77 ms 2,480
Combined 2.16 ms 8,942

IP verification

First-call rDNS latency is network-dependent.

Function Warm cache
ip_in_range 0.06 µs
known_crawler_ip 0.08 µs
reverse_dns 0.48 µs
forward_confirmed_rdns 3.69 µs
known_crawler_rdns 4.27 µs
verify_crawler_ip 3.23 µs

Notes

  • Warm cache reflects repeated lookups with LRU hits.
  • Cold cache clears the public API caches between benchmark runs.
  • DB patterns compile lazily per 48-entry chunk on first match.

Implementation Notes

Why regex-free?

Crawler detection runs on every request, so predictable runtime matters. is-crawler implements its hot-path heuristics with str.find plus char scans instead of regex backtracking. That keeps is_crawler() fast and avoids the usual ReDoS footguns from hostile user-agent strings.

crawler_info() does use re, but only against curated upstream patterns from monperrus/crawler-user-agents, and those patterns are simple enough to avoid catastrophic backtracking in practice.

Formatting

pip install black isort
isort . && black .
npx prtfm

Contributing

Bug reports, feature requests, and pull requests are welcome. See CONTRIBUTING.md to get started.

Security

Report vulnerabilities via GitHub private security advisory, do not open a public issue. See SECURITY.md.

Code of Conduct

See CODE_OF_CONDUCT.md.

License

Apache-2.0

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

is_crawler-1.5.3.2.tar.gz (399.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

is_crawler-1.5.3.2-py3-none-any.whl (387.9 kB view details)

Uploaded Python 3

File details

Details for the file is_crawler-1.5.3.2.tar.gz.

File metadata

  • Download URL: is_crawler-1.5.3.2.tar.gz
  • Upload date:
  • Size: 399.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for is_crawler-1.5.3.2.tar.gz
Algorithm Hash digest
SHA256 9e90bad00fb928c107fa034a8d83b572ba3ff60671c545a4431c58ccf1c00136
MD5 a5ddba6ae52d73b028ae7b34f14f41f8
BLAKE2b-256 8b4f378c977fe71d42bdabb3234eb6d0137645dbc13b2ea2126a68b8a237e055

See more details on using hashes here.

Provenance

The following attestation bundles were made for is_crawler-1.5.3.2.tar.gz:

Publisher: publish.yml on tn3w/is-crawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file is_crawler-1.5.3.2-py3-none-any.whl.

File metadata

  • Download URL: is_crawler-1.5.3.2-py3-none-any.whl
  • Upload date:
  • Size: 387.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for is_crawler-1.5.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8eea23bc3609fdbd34a9815d2ac9a140ef23e1d704c9469951888194d45f5f2b
MD5 71d9b74e4b09671dca4f7a480d4d1d64
BLAKE2b-256 1f09021eefc7785b14f70f1b738a7c708b374fedf7f9cc070fe5c13276320f47

See more details on using hashes here.

Provenance

The following attestation bundles were made for is_crawler-1.5.3.2-py3-none-any.whl:

Publisher: publish.yml on tn3w/is-crawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page