Fast, regex-free crawler detection from user agents. Zero deps, ReDoS-safe heuristics, ~40× faster than alternatives.
Project description
is-crawler
Fast, regex-free crawler detection from user agents. Zero deps, ReDoS-safe heuristics, ~40× faster than alternatives.
Docs & live demo: is-crawler.tn3w.dev
Why regex-free?
Regex is a frequent source of ReDoS vulnerabilities, one un-anchored .* or nested quantifier against a hostile UA can spike CPU to seconds. Crawler detection runs on every request, so a catastrophic pattern is a denial-of-service primitive. is-crawler implements all heuristics with str.find + char scans. No regex engine, no backtracking, no ReDoS surface. crawler_info uses re only to match against curated DB patterns (monperrus/crawler-user-agents) which are simple literals (e.g. Googlebot\/, bingbot, AdsBot-Google([^-]|$), [wW]get), no nested quantifiers, no catastrophic backtracking paths.
Install
pip install is-crawler
Usage
from is_crawler import (
is_crawler, crawler_signals, crawler_info, crawler_has_tag,
crawler_name, crawler_version, crawler_url, CrawlerInfo,
)
ua = "Googlebot/2.1 (+http://www.google.com/bot.html)"
is_crawler(ua) # True
crawler_signals(ua) # ['bot_signal', 'no_browser_signature', 'url_in_ua']
crawler_name(ua) # 'Googlebot'
crawler_version(ua) # '2.1'
crawler_url(ua) # 'http://www.google.com/bot.html'
info = crawler_info(ua) # CrawlerInfo(...)
if info is not None:
info.url # 'http://www.google.com/bot.html'
info.description # "Google's main web crawling bot..."
info.tags # ('search-engine',)
crawler_has_tag(ua, "search-engine") # True
crawler_has_tag(ua, ["ai-crawler", "seo"]) # False
API
is_crawler(ua: str) -> bool
Heuristic detection. Returns True if the UA is a crawler. No DB lookup, no regex.
Three short-circuit rules:
- Positive signal: bot keywords (
bot,crawl,spider,scrape,headless,slurp,archiv,preview, ...), known tools (playwright,selenium,wget,lighthouse,sqlmap,nikto,nmap,httrack,pingdom,google-safety, ...), or a URL/email embedded in the UA. - No browser signature: missing
Mozilla/,WebKit,Gecko,Trident,Presto,KHTML,Links,Lynx,Opera, or an OS token like(Windows,(Linux,(X11,(Macintosh. - Bare
(compatible; ...): classic bot block without OS/browser tokens inside.
crawler_signals(ua: str) -> list[str]
Which individual rules fired. Subset of: bot_signal, no_browser_signature, bare_compatible, known_tool, url_in_ua. Useful for diagnostics and logging. is_crawler does not call this.
crawler_name(ua: str) -> str | None
Product name extracted from the UA.
Googlebot/2.1 ...→'Googlebot'Mozilla/5.0 (compatible; bingbot/2.0; ...)→'bingbot'Mozilla/5.0 ... Speedy Spider (...)→'Speedy Spider'- Chrome/Firefox/Safari →
None
crawler_version(ua: str) -> str | None
Version token extracted from the UA. Returns None if no non-browser version is detectable.
curl/7.64.1→'7.64.1'Mozilla/5.0 (compatible; Miniflux/2.0.10; ...)→'2.0.10'Googlebot/2.1 ...→'2.1'
crawler_url(ua: str) -> str | None
URL embedded in the UA (after +, ;, or -).
Googlebot/2.1 (+http://www.google.com/bot.html)→'http://www.google.com/bot.html'- UA with no embedded URL →
None
crawler_info(ua: str) -> CrawlerInfo | None
DB lookup against 646 known crawler patterns. Returns None for browsers (short-circuits via is_crawler).
class CrawlerInfo(NamedTuple):
url: str # crawler's info/docs URL (may be '')
description: str # human-readable description
tags: tuple[str, ...] # classification tags, e.g. ('search-engine',)
crawler_has_tag(ua: str, tags: str | Iterable[str]) -> bool
True if the crawler has any of the given tags. tags accepts a single string or a list.
Available tags: search-engine, ai-crawler, seo, social-preview, advertising, archiver, feed-reader, monitoring, scanner, academic, http-library, browser-automation.
Category shortcuts
One-tag wrappers over crawler_has_tag:
is_search_engine(ua) # 'search-engine'
is_ai_crawler(ua) # 'ai-crawler'
is_seo(ua) # 'seo'
is_social_preview(ua) # 'social-preview'
is_advertising(ua) # 'advertising'
is_archiver(ua) # 'archiver'
is_feed_reader(ua) # 'feed-reader'
is_monitoring(ua) # 'monitoring'
is_scanner(ua) # 'scanner'
is_academic(ua) # 'academic'
is_http_library(ua) # 'http-library'
is_browser_automation(ua) # 'browser-automation'
is_good_crawler(ua) / is_bad_crawler(ua)
Opinionated groupings for quick allow/deny gates.
- Good (indexing, previews, archives, feeds, research):
search-engine,social-preview,feed-reader,archiver,academic. - Bad (scraping, scanning, unattributed traffic):
ai-crawler,scanner,http-library,browser-automation,seo.
advertising and monitoring are intentionally neither: policy-dependent.
Middleware
from is_crawler import is_crawler, crawler_has_tag
@app.before_request
def gate():
ua = request.headers.get("User-Agent", "")
if crawler_has_tag(ua, "ai-crawler"):
abort(403)
if is_crawler(ua):
log_crawler(ua)
robots.txt helpers
Generate directives from DB tags. Names extracted from DB patterns (slash/URL-only entries skipped).
from is_crawler import build_robots_txt, robots_agents_for_tags, iter_crawlers
robots_agents_for_tags("ai-crawler")
# ['AI2Bot', 'Applebot-Extended', 'Bytespider', 'CCBot', 'ChatGPT-User', 'Claude-Web', 'GPTBot', ...]
print(build_robots_txt(disallow=["ai-crawler", "scanner"]))
# User-agent: GPTBot
# Disallow: /
#
# User-agent: Nikto
# Disallow: /
# ...
build_robots_txt(allow="search-engine", path="/public")
# User-agent: Googlebot
# Allow: /public
# ...
for info, name in iter_crawlers(): # (CrawlerInfo, robots-name) per DB entry
...
CLI
python -m is_crawler "Googlebot/2.1 (+http://www.google.com/bot.html)"
tail -f access.log | awk -F'"' '{print $6}' | python -m is_crawler
One JSON object per UA (arg or stdin line) with is_crawler, name, version, url, signals, info.
Caching
Every public function has a 32k-entry LRU cache. Repeat UAs hit in ~40 ns.
Benchmarks
Python 3.14, Linux x86_64. Corpus: 1,231 crawler UAs, 15,812 browser UAs. cua = crawler-user-agents v1.44.
Hot-path (warm cache)
| Function | is_crawler | cua | speedup |
|---|---|---|---|
is_crawler (mixed) |
0.05 µs | 158.9 µs | 3000× |
crawler_info |
0.60 µs | 732.0 µs | 1220× |
crawler_signals |
1.13 µs | - | - |
crawler_name |
0.33 µs | - | - |
crawler_version |
0.32 µs | - | - |
crawler_url |
0.09 µs | - | - |
crawler_has_tag |
0.10 µs | - | - |
Cold-cache (per-call, no LRU hits)
| Function | Test Case | is_crawler | cua | speedup |
|---|---|---|---|---|
is_crawler |
crawlers | 1.94 µs | 64.35 µs | 33× |
is_crawler |
browsers | 1.85 µs | 183.76 µs | 99× |
is_crawler |
mixed | 1.85 µs | 176.94 µs | 96× |
crawler_info |
- | 2.07 µs | 733.4 µs | 354× |
crawler_name |
- | 1.36 µs | - | - |
crawler_version |
- | 1.37 µs | - | - |
crawler_url |
- | 0.29 µs | - | - |
Cold-start
| Module | Cold-start |
|---|---|
is_crawler |
1.29 ms |
crawleruseragents |
0.80 ms |
DB patterns compile lazily per 48-entry chunk on first match.
Formatting
pip install black isort
isort . && black .
npx prtfm
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file is_crawler-1.4.4.tar.gz.
File metadata
- Download URL: is_crawler-1.4.4.tar.gz
- Upload date:
- Size: 39.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73a84a6309dc8657eabbb6a75e081f0a7b3fdb6e12019c5f3a76c6ef66d9b623
|
|
| MD5 |
0e386c707f5c13b407461a7337b1d384
|
|
| BLAKE2b-256 |
7b0c49133c45abbdd63bb38d354ef2004583e1fe5d2b6057a0205779dde11ba9
|
Provenance
The following attestation bundles were made for is_crawler-1.4.4.tar.gz:
Publisher:
publish.yml on tn3w/is-crawler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
is_crawler-1.4.4.tar.gz -
Subject digest:
73a84a6309dc8657eabbb6a75e081f0a7b3fdb6e12019c5f3a76c6ef66d9b623 - Sigstore transparency entry: 1340722436
- Sigstore integration time:
-
Permalink:
tn3w/is-crawler@402b562fdfd8bd58e8b347924db18b1e8f98bfd5 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/tn3w
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@402b562fdfd8bd58e8b347924db18b1e8f98bfd5 -
Trigger Event:
push
-
Statement type:
File details
Details for the file is_crawler-1.4.4-py3-none-any.whl.
File metadata
- Download URL: is_crawler-1.4.4-py3-none-any.whl
- Upload date:
- Size: 31.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8acc71c257044c4a653c7489077f486f5282279409a32ce57b521bbb39c5946b
|
|
| MD5 |
b21280de44d61eeea3afac6ef15ef13c
|
|
| BLAKE2b-256 |
beafe37a27d2789e0e531fa58d283a41cd181c3e69f9691e3bfd58869aca3b0e
|
Provenance
The following attestation bundles were made for is_crawler-1.4.4-py3-none-any.whl:
Publisher:
publish.yml on tn3w/is-crawler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
is_crawler-1.4.4-py3-none-any.whl -
Subject digest:
8acc71c257044c4a653c7489077f486f5282279409a32ce57b521bbb39c5946b - Sigstore transparency entry: 1340722439
- Sigstore integration time:
-
Permalink:
tn3w/is-crawler@402b562fdfd8bd58e8b347924db18b1e8f98bfd5 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/tn3w
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@402b562fdfd8bd58e8b347924db18b1e8f98bfd5 -
Trigger Event:
push
-
Statement type: