Fast, regex-free crawler detection from user agents. Zero deps, ReDoS-safe heuristics, ~100× faster than alternatives.
Project description
is-crawler
Fast, regex-free crawler detection from user agents. Zero deps, ReDoS-safe heuristics, ~100× faster than alternatives. Includes FCrDNS IP verification for 100+ known crawlers.
Why regex-free?
Regex is a frequent source of ReDoS vulnerabilities, one un-anchored .* or nested quantifier against a hostile UA can spike CPU to seconds. Crawler detection runs on every request, so a catastrophic pattern is a denial-of-service primitive. is-crawler implements all heuristics with str.find + char scans. No regex engine, no backtracking, no ReDoS surface. crawler_info uses re only to match against curated DB patterns (monperrus/crawler-user-agents) which are simple literals (e.g. Googlebot\/, bingbot, AdsBot-Google([^-]|$), [wW]get), no nested quantifiers, no catastrophic backtracking paths.
Install
pip install is-crawler
Usage
from is_crawler import (
is_crawler, crawler_signals, crawler_info, crawler_has_tag,
crawler_name, crawler_version, crawler_url, CrawlerInfo,
)
from is_crawler.ip import verify_crawler_ip
ua = "Googlebot/2.1 (+http://www.google.com/bot.html)"
ip = "66.249.66.1"
is_crawler(ua) # True
crawler_signals(ua) # ['bot_signal', 'no_browser_signature', 'url_in_ua']
crawler_name(ua) # 'Googlebot'
crawler_version(ua) # '2.1'
crawler_url(ua) # 'http://www.google.com/bot.html'
verify_crawler_ip(ua, ip) # True - FCrDNS validation
info = crawler_info(ua) # CrawlerInfo(...)
if info is not None:
info.url # 'http://www.google.com/bot.html'
info.description # "Google's main web crawling bot..."
info.tags # ('search-engine',)
crawler_has_tag(ua, "search-engine") # True
crawler_has_tag(ua, ["ai-crawler", "seo"]) # False
API
is_crawler(ua: str) -> bool
Heuristic detection. Returns True if the UA is a crawler. No DB lookup, no regex.
Three short-circuit rules:
- Positive signal: bot keywords (
bot,crawl,spider,scrape,headless,slurp,archiv,preview, ...), known tools (playwright,selenium,wget,lighthouse,sqlmap,nikto,nmap,httrack,pingdom,google-safety, ...), or a URL/email embedded in the UA. - No browser signature: missing
Mozilla/,WebKit,Gecko,Trident,Presto,KHTML,Links,Lynx,Opera, or an OS token like(Windows,(Linux,(X11,(Macintosh. - Bare
(compatible; ...): classic bot block without OS/browser tokens inside.
crawler_signals(ua: str) -> list[str]
Which individual rules fired. Subset of: bot_signal, no_browser_signature, bare_compatible, known_tool, url_in_ua. Useful for diagnostics and logging. is_crawler does not call this.
crawler_name(ua: str) -> str | None
Product name extracted from the UA.
Googlebot/2.1 ...→'Googlebot'Mozilla/5.0 (compatible; bingbot/2.0; ...)→'bingbot'Mozilla/5.0 ... Speedy Spider (...)→'Speedy Spider'- Chrome/Firefox/Safari →
None
crawler_version(ua: str) -> str | None
Version token extracted from the UA. Returns None if no non-browser version is detectable.
curl/7.64.1→'7.64.1'Mozilla/5.0 (compatible; Miniflux/2.0.10; ...)→'2.0.10'Googlebot/2.1 ...→'2.1'
crawler_url(ua: str) -> str | None
URL embedded in the UA (after +, ;, or -).
Googlebot/2.1 (+http://www.google.com/bot.html)→'http://www.google.com/bot.html'- UA with no embedded URL →
None
crawler_info(ua: str) -> CrawlerInfo | None
DB lookup against 1200 known crawler patterns. Returns None for browsers (short-circuits via is_crawler).
class CrawlerInfo(NamedTuple):
url: str # crawler's info/docs URL (may be '')
description: str # human-readable description
tags: tuple[str, ...] # classification tags, e.g. ('search-engine',)
crawler_has_tag(ua: str, tags: str | Iterable[str]) -> bool
True if the crawler has any of the given tags. tags accepts a single string or a list.
Available tags: search-engine, ai-crawler, seo, social-preview, advertising, archiver, feed-reader, monitoring, scanner, academic, http-library, browser-automation.
Category shortcuts
One-tag wrappers over crawler_has_tag:
is_search_engine(ua) # 'search-engine'
is_ai_crawler(ua) # 'ai-crawler'
is_seo(ua) # 'seo'
is_social_preview(ua) # 'social-preview'
is_advertising(ua) # 'advertising'
is_archiver(ua) # 'archiver'
is_feed_reader(ua) # 'feed-reader'
is_monitoring(ua) # 'monitoring'
is_scanner(ua) # 'scanner'
is_academic(ua) # 'academic'
is_http_library(ua) # 'http-library'
is_browser_automation(ua) # 'browser-automation'
is_good_crawler(ua) / is_bad_crawler(ua)
Opinionated groupings for quick allow/deny gates.
- Good (indexing, previews, archives, feeds, research):
search-engine,social-preview,feed-reader,archiver,academic. - Bad (scraping, scanning, unattributed traffic):
ai-crawler,scanner,http-library,browser-automation,seo.
advertising and monitoring are intentionally neither: policy-dependent.
Middleware
from is_crawler.contrib import WSGICrawlerMiddleware
app = WSGICrawlerMiddleware(app)
# Flask
request.environ["is_crawler"].is_crawler
# Django
request.META["is_crawler"].name
from is_crawler.contrib import ASGICrawlerMiddleware
app = ASGICrawlerMiddleware(app, block=True, block_tags="ai-crawler")
# FastAPI / Starlette
request.scope["is_crawler"].is_crawler
request.state.crawler.verified
Both middlewares are zero-dep. They attach CrawlerMiddlewareResult with
user_agent, ip, is_crawler, name, and verified.
WSGICrawlerMiddleware: Flask, Django, any WSGI appASGICrawlerMiddleware: FastAPI, Starlette, any ASGI app
Optional flags: block=True, block_tags=..., verify_ip=True,
trust_forwarded=True.
With trust_forwarded=True, middleware uses the first IP from
X-Forwarded-For before the direct client address.
robots.txt helpers
Generate directives from DB tags. Names extracted from DB patterns (slash/URL-only entries skipped).
from is_crawler import build_robots_txt, robots_agents_for_tags, iter_crawlers
robots_agents_for_tags("ai-crawler")
# ['AI2Bot', 'Applebot-Extended', 'Bytespider', 'CCBot', 'ChatGPT-User', 'Claude-Web', 'GPTBot', ...]
print(build_robots_txt(disallow=["ai-crawler", "scanner"]))
# User-agent: GPTBot
# Disallow: /
#
# User-agent: Nikto
# Disallow: /
# ...
build_robots_txt(allow="search-engine", path="/public")
# User-agent: Googlebot
# Allow: /public
# ...
for info, name in iter_crawlers(): # (CrawlerInfo, robots-name) per DB entry
...
IP verification (is_crawler.ip)
Forward-confirmed reverse DNS (FCrDNS). rDNS → suffix check → forward lookup → IP match. Catches UA spoofing. socket only, no deps.
from is_crawler.ip import verify_crawler_ip, forward_confirmed_rdns, reverse_dns
verify_crawler_ip("Googlebot/2.1 (+http://www.google.com/bot.html)", "66.249.66.1")
# True → rDNS ends in .googlebot.com AND forward lookup returns same IP
verify_crawler_ip("Googlebot/2.1", "8.8.8.8") # False (spoof)
reverse_dns("8.8.8.8") # 'dns.google'
forward_confirmed_rdns("66.249.66.1", (".googlebot.com",)) # hostname or None
Built-in suffixes: Googlebot, Bingbot, Applebot, DuckDuckBot, YandexBot, Baiduspider, FacebookBot, and 80+ more. Crawler name taken from crawler_name(ua).
CLI
python -m is_crawler "Googlebot/2.1 (+http://www.google.com/bot.html)"
tail -f access.log | awk -F'"' '{print $6}' | python -m is_crawler
One JSON object per UA (arg or stdin line) with is_crawler, name, version, url, signals, info.
Caching
Every public function has a 32k-entry LRU cache. Repeat UAs hit in ~40 ns.
Benchmarks
Python 3.14, Linux x86_64. Corpus: 1,231 crawler UAs, 15,812 browser UAs. cua = crawler-user-agents v1.44.
Hot-path (warm cache)
| Function | is_crawler | cua | speedup |
|---|---|---|---|
is_crawler (mixed) |
0.05 µs | 158.9 µs | 3000× |
crawler_info |
0.60 µs | 732.0 µs | 1220× |
crawler_signals |
1.13 µs | - | - |
crawler_name |
0.33 µs | - | - |
crawler_version |
0.32 µs | - | - |
crawler_url |
0.09 µs | - | - |
crawler_has_tag |
0.10 µs | - | - |
Cold-cache (per-call, no LRU hits)
| Function | Test Case | is_crawler | cua | speedup |
|---|---|---|---|---|
is_crawler |
crawlers | 1.94 µs | 64.35 µs | 33× |
is_crawler |
browsers | 1.85 µs | 183.76 µs | 99× |
is_crawler |
mixed | 1.85 µs | 176.94 µs | 96× |
crawler_info |
- | 2.07 µs | 733.4 µs | 354× |
crawler_name |
- | 1.36 µs | - | - |
crawler_version |
- | 1.37 µs | - | - |
crawler_url |
- | 0.29 µs | - | - |
Cold-start
| Module | Cold-start |
|---|---|
is_crawler |
1.29 ms |
crawleruseragents |
0.80 ms |
DB patterns compile lazily per 48-entry chunk on first match.
Formatting
pip install black isort
isort . && black .
npx prtfm
Contributing
Bug reports, feature requests, and pull requests are welcome. See CONTRIBUTING.md to get started.
Security
Report vulnerabilities via GitHub private security advisory, do not open a public issue. See SECURITY.md.
Code of Conduct
See CODE_OF_CONDUCT.md.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file is_crawler-1.5.0.2.tar.gz.
File metadata
- Download URL: is_crawler-1.5.0.2.tar.gz
- Upload date:
- Size: 79.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54f1ee330b60d9754efa53908d9835d865940920f1d1b4b8b3cb14ca97b5a6d0
|
|
| MD5 |
36ff5c22487e621c332053e5a6c6bae7
|
|
| BLAKE2b-256 |
a4ea8b853a7b9629d0c7b2ae9235e166412d1a625ec0467fadbd0fec146c3acf
|
Provenance
The following attestation bundles were made for is_crawler-1.5.0.2.tar.gz:
Publisher:
publish.yml on tn3w/is-crawler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
is_crawler-1.5.0.2.tar.gz -
Subject digest:
54f1ee330b60d9754efa53908d9835d865940920f1d1b4b8b3cb14ca97b5a6d0 - Sigstore transparency entry: 1359315522
- Sigstore integration time:
-
Permalink:
tn3w/is-crawler@67cb4087e7c10994c47346342dd9d1995c025606 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/tn3w
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@67cb4087e7c10994c47346342dd9d1995c025606 -
Trigger Event:
push
-
Statement type:
File details
Details for the file is_crawler-1.5.0.2-py3-none-any.whl.
File metadata
- Download URL: is_crawler-1.5.0.2-py3-none-any.whl
- Upload date:
- Size: 67.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c64769f4192b05328f9f6a5cc851cc4e97d63d4556c8652bb6bb1958145d07f7
|
|
| MD5 |
0512fd2daf38947ae7c5c48c32911dce
|
|
| BLAKE2b-256 |
17372216f6106db405ddc5ff629cdd489e163a65c8d27ec4d3c0c7114b7d04e3
|
Provenance
The following attestation bundles were made for is_crawler-1.5.0.2-py3-none-any.whl:
Publisher:
publish.yml on tn3w/is-crawler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
is_crawler-1.5.0.2-py3-none-any.whl -
Subject digest:
c64769f4192b05328f9f6a5cc851cc4e97d63d4556c8652bb6bb1958145d07f7 - Sigstore transparency entry: 1359315536
- Sigstore integration time:
-
Permalink:
tn3w/is-crawler@67cb4087e7c10994c47346342dd9d1995c025606 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/tn3w
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@67cb4087e7c10994c47346342dd9d1995c025606 -
Trigger Event:
push
-
Statement type: