Fast, regex-free crawler detection from user agents. Zero deps, ReDoS-safe heuristics, ~100× faster than alternatives.
Project description
is-crawler
Fast, regex-free crawler detection from user agents. Zero deps, ReDoS-safe heuristics, ~100× faster than alternatives. Includes FCrDNS IP verification for 100+ known crawlers.
Install
pip install is-crawler
Usage
from is_crawler import (
is_crawler, crawler_signals, crawler_info, crawler_has_tag,
crawler_name, crawler_version, crawler_url, CrawlerInfo,
)
from is_crawler.ip import (
verify_crawler_ip,
reverse_dns,
forward_confirmed_rdns,
ip_in_range,
known_crawler_ip,
known_crawler_rdns,
)
ua = "Googlebot/2.1 (+http://www.google.com/bot.html)"
ip = "66.249.66.1"
is_crawler(ua) # True
crawler_signals(ua) # ['bot_signal', 'no_browser_signature', 'url_in_ua']
crawler_name(ua) # 'Googlebot'
crawler_version(ua) # '2.1'
crawler_url(ua) # 'http://www.google.com/bot.html'
verify_crawler_ip(ua, ip) # True - FCrDNS validation
reverse_dns(ip) # 'crawl-66-249-66-1.googlebot.com'
forward_confirmed_rdns(ip, (".googlebot.com",)) # hostname or None
ip_in_range(ip) # True - in known crawler CIDRs
known_crawler_ip(ip) # alias for ip_in_range
known_crawler_rdns(ip) # True - known crawler via FCrDNS/rDNS
info = crawler_info(ua) # CrawlerInfo(...)
if info is not None:
info.url # 'http://www.google.com/bot.html'
info.description # "Google's main web crawling bot..."
info.tags # ('search-engine',)
crawler_has_tag(ua, "search-engine") # True
crawler_has_tag(ua, ["ai-crawler", "seo"]) # False
API
is_crawler(ua: str) -> bool
Heuristic detection. Returns True if the UA is a crawler. No DB lookup, no regex.
Three short-circuit rules:
- Positive signal: bot keywords (
bot,crawl,spider,scrape,headless,slurp,archiv,preview, ...), known tools (playwright,selenium,wget,lighthouse,sqlmap,nikto,nmap,httrack,pingdom,google-safety, ...), or a URL/email embedded in the UA. - No browser signature: missing
Mozilla/,WebKit,Gecko,Trident,Presto,KHTML,Links,Lynx,Opera, or an OS token like(Windows,(Linux,(X11,(Macintosh. - Bare
(compatible; ...): classic bot block without OS/browser tokens inside.
crawler_signals(ua: str) -> list[str]
Which individual rules fired. Subset of: bot_signal, no_browser_signature, bare_compatible, known_tool, url_in_ua. Useful for diagnostics and logging. is_crawler does not call this.
crawler_name(ua: str) -> str | None
Product name extracted from the UA.
Googlebot/2.1 ...→'Googlebot'Mozilla/5.0 (compatible; bingbot/2.0; ...)→'bingbot'Mozilla/5.0 ... Speedy Spider (...)→'Speedy Spider'- Chrome/Firefox/Safari →
None
crawler_version(ua: str) -> str | None
Version token extracted from the UA. Returns None if no non-browser version is detectable.
curl/7.64.1→'7.64.1'Mozilla/5.0 (compatible; Miniflux/2.0.10; ...)→'2.0.10'Googlebot/2.1 ...→'2.1'
crawler_url(ua: str) -> str | None
URL embedded in the UA (after +, ;, or -).
Googlebot/2.1 (+http://www.google.com/bot.html)→'http://www.google.com/bot.html'- UA with no embedded URL →
None
crawler_info(ua: str) -> CrawlerInfo | None
DB lookup against 1200 known crawler patterns. Returns None for browsers (short-circuits via is_crawler).
class CrawlerInfo(NamedTuple):
url: str # crawler's info/docs URL (may be '')
description: str # human-readable description
tags: tuple[str, ...] # classification tags, e.g. ('search-engine',)
crawler_has_tag(ua: str, tags: str | Iterable[str]) -> bool
True if the crawler has any of the given tags. tags accepts a single string or a list.
Available tags: search-engine, ai-crawler, seo, social-preview, advertising, archiver, feed-reader, monitoring, scanner, academic, http-library, browser-automation.
Category shortcuts
One-tag wrappers over crawler_has_tag:
is_search_engine(ua) # 'search-engine'
is_ai_crawler(ua) # 'ai-crawler'
is_seo(ua) # 'seo'
is_social_preview(ua) # 'social-preview'
is_advertising(ua) # 'advertising'
is_archiver(ua) # 'archiver'
is_feed_reader(ua) # 'feed-reader'
is_monitoring(ua) # 'monitoring'
is_scanner(ua) # 'scanner'
is_academic(ua) # 'academic'
is_http_library(ua) # 'http-library'
is_browser_automation(ua) # 'browser-automation'
is_good_crawler(ua) / is_bad_crawler(ua)
Opinionated groupings for quick allow/deny gates.
- Good (indexing, previews, archives, feeds, research):
search-engine,social-preview,feed-reader,archiver,academic. - Bad (scraping, scanning, unattributed traffic):
ai-crawler,scanner,http-library,browser-automation,seo.
advertising and monitoring are intentionally neither: policy-dependent.
Middleware
from is_crawler.contrib import WSGICrawlerMiddleware
app = WSGICrawlerMiddleware(app)
# Flask
request.environ["is_crawler"].is_crawler
# Django
request.META["is_crawler"].name
from is_crawler.contrib import ASGICrawlerMiddleware
app = ASGICrawlerMiddleware(app, block=True, block_tags="ai-crawler")
# FastAPI / Starlette
request.scope["is_crawler"].is_crawler
request.state.crawler.verified
Both middlewares are zero-dep. They attach CrawlerMiddlewareResult with
user_agent, ip, is_crawler, name, and verified.
WSGICrawlerMiddleware: Flask, Django, any WSGI appASGICrawlerMiddleware: FastAPI, Starlette, any ASGI app
Optional flags: block=True, block_tags=..., verify_ip=True,
trust_forwarded=True.
With trust_forwarded=True, middleware uses the first IP from
X-Forwarded-For, then X-Real-IP, before the direct client address.
robots.txt helpers
Generate directives from DB tags. Names extracted from DB patterns (slash/URL-only entries skipped).
from is_crawler import build_robots_txt, robots_agents_for_tags, iter_crawlers
robots_agents_for_tags("ai-crawler")
# ['AI2Bot', 'Applebot-Extended', 'Bytespider', 'CCBot', 'ChatGPT-User', 'Claude-Web', 'GPTBot', ...]
print(build_robots_txt(disallow=["ai-crawler", "scanner"]))
# User-agent: GPTBot
# Disallow: /
#
# User-agent: Nikto
# Disallow: /
# ...
build_robots_txt(allow="search-engine", path="/public")
# User-agent: Googlebot
# Allow: /public
# ...
for info, name in iter_crawlers(): # (CrawlerInfo, robots-name) per DB entry
...
IP verification (is_crawler.ip)
Two complementary strategies: use either or both.
FCrDNS (forward-confirmed reverse DNS)
rDNS → suffix check → forward lookup → IP match. Catches UA spoofing. socket only, no deps.
from is_crawler.ip import verify_crawler_ip, forward_confirmed_rdns, reverse_dns
verify_crawler_ip("Googlebot/2.1 (+http://www.google.com/bot.html)", "66.249.66.1")
# True → rDNS ends in .googlebot.com AND forward lookup returns same IP
verify_crawler_ip("Googlebot/2.1", "8.8.8.8") # False (spoof)
reverse_dns("8.8.8.8") # 'dns.google'
forward_confirmed_rdns("66.249.66.1", (".googlebot.com",)) # hostname or None
Built-in suffixes: Googlebot, Bingbot, Applebot, DuckDuckBot, YandexBot, Baiduspider, FacebookBot, and 80+ more. Crawler name taken from crawler_name(ua).
IP range lookup
Check whether an IP belongs to any known crawler's published CIDR range. Requires building the range database first (see Tools below).
from is_crawler.ip import ip_in_range, known_crawler_ip, known_crawler_rdns
ip_in_range("66.249.66.1") # True : in Googlebot ranges
ip_in_range("8.8.8.8") # False: not a known crawler range
known_crawler_ip("66.249.66.1") # alias for ip_in_range
known_crawler_rdns("66.249.66.1") # True: reverse DNS matches a known crawler domain
Results are LRU-cached. The file is optional: if absent, ip_in_range returns False rather than raising.
Tools
Scripts in tools/ build the data files shipped inside the package.
build_user_agents.py
Compiles is_crawler/crawler-user-agents.json from the upstream monperrus/crawler-user-agents source plus local extras.
python3 tools/build_user_agents.py
python3 tools/build_user_agents.py --input crawler-user-agents.json --output is_crawler/crawler-user-agents.json
build_ip_ranges.py
Fetches live IP range data from 39 official sources (Google, Bing, DuckDuckGo, Apple, OpenAI, Anthropic, Perplexity, Common Crawl, Cloudflare, Fastly, AWS, Azure, Oracle Cloud, GitHub, Telegram, Ahrefs, Yandex, Facebook, Kagi, Amazon, UptimeRobot, Pingdom, Stripe, and more) and writes a flat is_crawler/crawler-ip-ranges.json mapping each source name to its CIDR list.
python3 tools/build_ip_ranges.py
python3 tools/build_ip_ranges.py --timeout 30 --skip-errors
Source definitions live in tools/crawler-ip-ranges.json (name → {url, pattern}) and can be extended independently of the build script.
CLI
python -m is_crawler "Googlebot/2.1 (+http://www.google.com/bot.html)"
tail -f access.log | awk -F'"' '{print $6}' | python -m is_crawler
One JSON object per UA (arg or stdin line) with is_crawler, name, version, url, signals, info.
Caching
Every public function has a 32k-entry LRU cache. Repeat UAs hit in ~40 ns.
Benchmarks
Python 3.14, Linux x86_64. cua = crawler-user-agents v1.44.
Synthetic corpus
Corpus: 1,231 crawler UAs + 15,812 browser UAs.
| Scenario | is_crawler |
crawler_info |
cua.is_crawler |
cua.crawler_info |
|---|---|---|---|---|
| Warm cache | 0.05 µs | 0.60 µs | 158.9 µs | 732.0 µs |
| Cold cache | 1.85 µs | 2.07 µs | 176.94 µs | 733.4 µs |
That is roughly 3000× faster for hot is_crawler, 96× faster for cold is_crawler, and 354× faster for cold crawler_info.
Real Apache logs
Corpus: 42,512 UA entries from two Apache access logs (8,942 crawlers, 33,570 browsers, 21.0% crawler ratio).
| Scenario | is_crawler (all) |
crawler_info (all) |
cua.is_crawler (all) |
cua.crawler_info (all) |
|---|---|---|---|---|
| Warm cache | 0.044 µs | 0.115 µs | 64.121 µs | 1513.618 µs |
| Cold cache | 0.143 µs | 0.970 µs | - | - |
Full-log classify time:
| Log | Time | Crawlers found |
|---|---|---|
apache_access_1.txt |
2.22 ms | 6,462 |
apache_access_2.txt |
0.77 ms | 2,480 |
| Combined | 2.16 ms | 8,942 |
Notes
- Warm cache reflects repeated UA lookups with LRU hits.
- Cold cache clears the public API caches between benchmark runs.
- DB patterns compile lazily per 48-entry chunk on first match.
Implementation Notes
Why regex-free?
Crawler detection runs on every request, so predictable runtime matters. is-crawler implements its hot-path heuristics with str.find plus char scans instead of regex backtracking. That keeps is_crawler() fast and avoids the usual ReDoS footguns from hostile user-agent strings.
crawler_info() does use re, but only against curated upstream patterns from monperrus/crawler-user-agents, and those patterns are simple enough to avoid catastrophic backtracking in practice.
Formatting
pip install black isort
isort . && black .
npx prtfm
Contributing
Bug reports, feature requests, and pull requests are welcome. See CONTRIBUTING.md to get started.
Security
Report vulnerabilities via GitHub private security advisory, do not open a public issue. See SECURITY.md.
Code of Conduct
See CODE_OF_CONDUCT.md.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file is_crawler-1.5.2.1.tar.gz.
File metadata
- Download URL: is_crawler-1.5.2.1.tar.gz
- Upload date:
- Size: 383.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4de20b4c97a9074fe765f79584e282c2b20a76c0d25b45406651122f1be9a08a
|
|
| MD5 |
b5713c2b7cbdfc8cb4050e5f9073a9cb
|
|
| BLAKE2b-256 |
ec7f0044db37c132c51523faa2603ab5d9fda480dbb9e0667095f44e27b720d7
|
Provenance
The following attestation bundles were made for is_crawler-1.5.2.1.tar.gz:
Publisher:
publish.yml on tn3w/is-crawler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
is_crawler-1.5.2.1.tar.gz -
Subject digest:
4de20b4c97a9074fe765f79584e282c2b20a76c0d25b45406651122f1be9a08a - Sigstore transparency entry: 1373908222
- Sigstore integration time:
-
Permalink:
tn3w/is-crawler@dd1b59d0ae2e626336421f66c42b7bfb2ae24927 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/tn3w
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dd1b59d0ae2e626336421f66c42b7bfb2ae24927 -
Trigger Event:
push
-
Statement type:
File details
Details for the file is_crawler-1.5.2.1-py3-none-any.whl.
File metadata
- Download URL: is_crawler-1.5.2.1-py3-none-any.whl
- Upload date:
- Size: 372.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67ed7f6d6020a805b159ac14366a1019e47feaf9153a6247a71e652cde8fe9d5
|
|
| MD5 |
18e09df066784b6f75874e8c37094719
|
|
| BLAKE2b-256 |
1b90fe6ebdc7eb041d157f92677b014e814d31d3efa2e89e7bd4c5d571ad1191
|
Provenance
The following attestation bundles were made for is_crawler-1.5.2.1-py3-none-any.whl:
Publisher:
publish.yml on tn3w/is-crawler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
is_crawler-1.5.2.1-py3-none-any.whl -
Subject digest:
67ed7f6d6020a805b159ac14366a1019e47feaf9153a6247a71e652cde8fe9d5 - Sigstore transparency entry: 1373908315
- Sigstore integration time:
-
Permalink:
tn3w/is-crawler@dd1b59d0ae2e626336421f66c42b7bfb2ae24927 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/tn3w
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@dd1b59d0ae2e626336421f66c42b7bfb2ae24927 -
Trigger Event:
push
-
Statement type: