Crawler detection from User-Agent strings in 50 ns. Zero deps, no regex, ReDoS-safe.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tn3w.dev

These details have not been verified by PyPI

Project links

Funding

Project description

is-crawler

Crawler detection from User-Agent strings in 50 ns. Zero deps, no regex, ReDoS-safe.

pip install is-crawler

from is_crawler import is_crawler

is_crawler("Googlebot/2.1 (+http://www.google.com/bot.html)")  # True
is_crawler("Mozilla/5.0 (X11; Linux x86_64) Firefox/120.0")    # False

One call, runs on every request without blinking.

\(°o°)/   caught one!
 /| |\

Why

Crawler detection sits on the request hot path. Most libraries reach for big regex tables, which means slow first hits, ReDoS exposure on hostile UAs, and millisecond-scale latency you pay forever.

is_crawler runs str.find and small char scans against curated keywords. No backtracking, no DB load, no network. The optional crawler_info adds DB lookups when you want classification. Everything else (FCrDNS, IP ranges, robots.txt, middleware) is opt-in.

is-crawler  ▏                                                  0.04 µs
cua         ████████████████████████████████████████████████  64.00 µs

	is-crawler	crawler-user-agents	ua-parser
Hot-path regex	no	yes	yes
ReDoS-safe	yes	no	no
FCrDNS verify	yes	no	no
IP range lookup	yes	no	no
WSGI/ASGI MW	yes	no	no
Warm `is_crawler`	0.04 µs	64 µs	n/a

In the wild

What the API returns on real UAs you will actually see:

User agent	`is_crawler`	`crawler_name`	tag
`Mozilla/5.0 ... Chrome/120.0.0.0 Safari/537.36`	False	None	-
`Googlebot/2.1 (+http://www.google.com/bot.html)`	True	Googlebot	search-engine
`Mozilla/5.0 (compatible; GPTBot/1.2; +https://openai.com/gptbot)`	True	GPTBot	ai-crawler
`Mozilla/5.0 ... HeadlessChrome/120.0.0.0 Safari/537.36`	True	HeadlessChrome	browser-automation
`curl/8.4.0`	True	curl	http-library
`python-requests/2.31.0`	True	python-requests	http-library
`Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)`	True	AhrefsBot	seo
`facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)`	True	facebookexternalhit	social-preview
`Mozilla/5.0 (compatible; Nikto/2.5.0)`	True	Nikto	scanner
`Mozilla/5.0 ... Safari/605.1.15` (no UA marker, valid Safari)	False	None	-

Detection

from is_crawler import (
    is_crawler, crawler_signals, crawler_info, crawler_has_tag,
    crawler_name, crawler_version, crawler_url, crawler_contact,
)

ua = "Googlebot/2.1 (+http://www.google.com/bot.html)"

is_crawler(ua)         # True
crawler_name(ua)       # 'Googlebot'
crawler_version(ua)    # '2.1'
crawler_url(ua)        # 'http://www.google.com/bot.html'
crawler_signals(ua)    # ['bot_signal', 'no_browser_signature', 'url_in_ua']

ua2 = "MyBot/1.0 (contact: bot@example.com)"
crawler_contact(ua2)   # 'bot@example.com'
crawler_contact(ua)    # None

is_crawler short-circuits on three rules: positive bot signal (keywords like bot/crawl/spider, known tools, embedded URL/email), missing browser signature (no Mozilla/, WebKit, OS token, etc.), or a bare (compatible; ...) block.

crawler_signals exposes which rules fired, for logging and diagnostics.

Classification

crawler_info matches against 1200 curated patterns from monperrus/crawler-user-agents plus extras. Patterns compile lazily in 48-entry chunks.

info = crawler_info(ua)
info.url            # 'http://www.google.com/bot.html'
info.description    # "Google's main web crawling bot..."
info.tags           # ('search-engine',)

crawler_has_tag(ua, "search-engine")        # True
crawler_has_tag(ua, ["ai-crawler", "seo"])  # False

Tags: search-engine, ai-crawler, seo, social-preview, advertising, archiver, feed-reader, monitoring, scanner, academic, http-library, browser-automation.

One-tag wrappers exist for each: is_search_engine, is_ai_crawler, is_seo, is_social_preview, is_advertising, is_archiver, is_feed_reader, is_monitoring, is_scanner, is_academic, is_http_library, is_browser_automation.

Quick gates:

is_good_crawler(ua)   # search-engine, social-preview, feed-reader, archiver, academic
is_bad_crawler(ua)    # ai-crawler, scanner, http-library, browser-automation, seo

advertising and monitoring are policy-dependent and belong to neither group.

IP verification

Two strategies, use either or both. socket only, no deps.

from is_crawler.ip import (
    verify_crawler_ip, reverse_dns, forward_confirmed_rdns,
    ip_in_range, known_crawler_ip, known_crawler_rdns,
)

verify_crawler_ip("Googlebot/2.1", "66.249.66.1")  # True (FCrDNS, UA-name matched)
verify_crawler_ip("Googlebot/2.1", "8.8.8.8")      # False (spoof)

ip_in_range("66.249.66.1")        # True (CIDR lookup, offline)
known_crawler_rdns("66.249.66.1") # True (rDNS suffix matches any known crawler)

reverse_dns("8.8.8.8")                                      # 'dns.google'
forward_confirmed_rdns("66.249.66.1", (".googlebot.com",))  # hostname or None

verify_crawler_ip does the full FCrDNS dance: rDNS lookup, suffix check against the UA's vendor, forward lookup, IP match. Catches UA spoofing.

ip_in_range runs a bisect over collapsed CIDRs from 39 official sources (Google, Bing, OpenAI, Anthropic, Cloudflare, AWS, ...). Cheap and offline.

Middleware

Drop-in for any WSGI or ASGI app. Zero deps.

from is_crawler.contrib import WSGICrawlerMiddleware, ASGICrawlerMiddleware

app = WSGICrawlerMiddleware(app)                                  # Flask, Django
app = ASGICrawlerMiddleware(app, block=True, block_tags="ai-crawler")  # FastAPI, Starlette

# Flask:    request.environ["is_crawler"].is_crawler
# Django:   request.META["is_crawler"].name
# FastAPI:  request.scope["is_crawler"].verified

Both attach a CrawlerMiddlewareResult with user_agent, ip, is_crawler, name, verified, in_ip_range, rdns_match.

Flags: block, block_tags, verify_ip, check_ip_range, check_rdns, trust_forwarded. A positive in_ip_range or rdns_match forces is_crawler=True, which catches UA-less crawlers. With trust_forwarded=True, IP comes from X-Forwarded-For then X-Real-IP then the direct client.

Recipes

Block AI scrapers, let search engines through (FastAPI):

from fastapi import FastAPI
from is_crawler.contrib import ASGICrawlerMiddleware

app = FastAPI()
app = ASGICrawlerMiddleware(app, block=True, block_tags="ai-crawler", trust_forwarded=True)

Serve a live robots.txt from the DB (Flask):

from flask import Response
from is_crawler import build_robots_txt

@app.route("/robots.txt")
def robots():
    return Response(build_robots_txt(disallow=["ai-crawler", "scanner"]), mimetype="text/plain")

Verify Googlebot is real before trusting it:

from is_crawler import is_crawler
from is_crawler.ip import verify_crawler_ip

if is_crawler(ua) and not verify_crawler_ip(ua, ip):
    abort(403)  # spoofed

Crawler share of an access log:

awk -F'"' '{print $6}' access.log | python -m is_crawler | \
  jq -r '.is_crawler' | sort | uniq -c

robots.txt / ai.txt

Generate directives from tags. Names are extracted from DB patterns, slash/URL-only entries skipped.

from is_crawler import build_robots_txt, build_ai_txt, robots_agents_for_tags

print(build_robots_txt(disallow=["ai-crawler", "scanner"]))
# User-agent: GPTBot
# Disallow: /
# ...

print(build_ai_txt())          # disallows all ai-crawler agents by default
# User-Agent: GPTBot
# Disallow: /
# ...

robots_agents_for_tags("ai-crawler")
# ['AI2Bot', 'Applebot-Extended', 'Bytespider', 'CCBot', 'ChatGPT-User', ...]

build_robots_txt also accepts a rules list of (path, tags) pairs for per-path control:

build_robots_txt(rules=[("/api", "scanner"), ("/private", "ai-crawler")])

assert_crawler(ua) — like crawler_info but raises ValueError for unknown UAs.

CLI

python -m is_crawler "Googlebot/2.1 (+http://www.google.com/bot.html)"
tail -f access.log | awk -F'"' '{print $6}' | python -m is_crawler
python -m is_crawler --help     # usage
python -m is_crawler --version  # show version

One JSON object per UA with is_crawler, name, version, url, contact, signals, info.

UA Parser

parse(ua) returns a UserAgent with all common fields. Zero deps, no regex, 4096-entry LRU cache.

from is_crawler.parser import parse, parse_or_none

ua = parse("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36")

ua.browser          # 'Chrome'
ua.browser_version  # '134.0.0.0'
ua.browser_major    # '134'
ua.os               # 'Windows'
ua.os_version       # '10/11'
ua.engine           # 'Blink'
ua.engine_version   # '537.36'
ua.device           # 'Desktop'
ua.device_brand     # None
ua.device_model     # None
ua.cpu              # 'x86_64'
ua.is_mobile        # False
ua.is_tablet        # False
ua.is_crawler       # False
ua.languages        # []
ua.rendering        # 'KHTML, like Gecko'
ua.product_token    # 'Mozilla/5.0'
ua.comment          # '(Windows NT 10.0; Win64; x64)'
ua.raw              # original string

ua.to_dict()        # all fields as dict

parse_or_none(value) normalises bytes/None/non-str, returns None for empty input.

Benchmarks

Python 3.14, Linux x86_64. cua = crawler-user-agents v1.47.

Apache Logs 42,512 UA entries (8,942 crawlers, 33,570 browsers, 21% ratio):

Scenario	`is_crawler`	`crawler_info`	`cua.is_crawler`	`cua.crawler_info`
Warm cache	0.046 µs	0.116 µs	66.234 µs	1585.007 µs
Cold cache	0.151 µs	0.987 µs	—	—

~1440× faster on the hot path, ~13700× faster for crawler_info warm. Full classify of 42,512 Apache log UAs runs in 2.15 ms.

Fixture UAs 2,149 crawlers + 19,910 browsers:

Scenario	`is_crawler` (mixed)	`crawler_info`	`cua.is_crawler` (mixed)	`cua.crawler_info`
Warm cache	0.04 µs	1.33 µs	80.95 µs	563.53 µs
Cold cache	2.07 µs	4.85 µs	82.00 µs	581.76 µs

UA parser 19,910 real browser UAs vs ua-parser (~20× faster):

Scenario	`parser.parse`	`ua-parser`
Warm cache	21.45 µs	443.20 µs
Cold cache	21.20 µs	443.05 µs

IP verification warm cache:

Function	Time
`ip_in_range`	0.06 µs
`reverse_dns`	0.48 µs
`verify_crawler_ip`	3.23 µs
`forward_confirmed_rdns`	3.69 µs
`known_crawler_rdns`	4.27 µs

Every public function has a 32k-entry LRU cache. First-call rDNS latency is network-bound.

Implementation

is_crawler uses str.find and char scans, never regex, so hostile UAs cannot trigger backtracking. crawler_info does use re, but only against curated upstream patterns that are simple by construction.

Data files are built by scripts in tools/:

python3 tools/build_user_agents.py   # crawler-user-agents.json from monperrus/crawler-user-agents
python3 tools/build_ip_ranges.py     # crawler-ip-ranges.json from 39 official sources

Source definitions for IP ranges live in tools/crawler-ip-ranges.json and can be extended without touching the build script.

Development

pip install -e ".[dev]"
ruff format . && ruff check --fix .
npx --yes prettier --write --single-quote --print-width=100 --trailing-comma=es5 --end-of-line=lf "**/*.{md,yml,yaml,html,css,js,ts}" "tools/*.json"

See CONTRIBUTING.md. Report vulnerabilities via GitHub private security advisory, not public issues. See SECURITY.md and CODE_OF_CONDUCT.md.

License

Apache-2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tn3w.dev

These details have not been verified by PyPI

Project links

Funding

Release history Release notifications | RSS feed

1.5.18

May 6, 2026

1.5.17

May 5, 2026

1.5.16

May 3, 2026

1.5.15

May 3, 2026

1.5.14

May 2, 2026

1.5.13

May 2, 2026

1.5.12

May 1, 2026

1.5.11

May 1, 2026

1.5.10

Apr 29, 2026

1.5.9.4

Apr 29, 2026

1.5.9.3

Apr 28, 2026

1.5.9.2

Apr 27, 2026

This version

1.5.9.1

Apr 26, 2026

1.5.9

Apr 26, 2026

1.5.8

Apr 26, 2026

1.5.7.7

Apr 26, 2026

1.5.7.6

Apr 26, 2026

1.5.7.5

Apr 26, 2026

1.5.7.4

Apr 25, 2026

1.5.7.3

Apr 25, 2026

1.5.7.2

Apr 25, 2026

1.5.7.1

Apr 25, 2026

1.5.7

Apr 25, 2026

1.5.6.1

Apr 25, 2026

1.5.6

Apr 25, 2026

1.5.5.1

Apr 25, 2026

1.5.5

Apr 25, 2026

1.5.4

Apr 25, 2026

1.5.3.4

Apr 25, 2026

1.5.3.3

Apr 25, 2026

1.5.3.2

Apr 24, 2026

1.5.3.1

Apr 24, 2026

1.5.3

Apr 24, 2026

1.5.2.2

Apr 24, 2026

1.5.2.1

Apr 24, 2026

1.5.2

Apr 24, 2026

1.5.1.1

Apr 23, 2026

1.5.1

Apr 23, 2026

1.5.0.7

Apr 22, 2026

1.5.0.6

Apr 22, 2026

1.5.0.5

Apr 22, 2026

1.5.0.4

Apr 22, 2026

1.5.0.3

Apr 22, 2026

1.5.0.2

Apr 22, 2026

1.5.0.1

Apr 22, 2026

1.5.0

Apr 21, 2026

1.4.9.1

Apr 21, 2026

1.4.9

Apr 20, 2026

1.4.8

Apr 20, 2026

1.4.7

Apr 20, 2026

1.4.6

Apr 20, 2026

1.4.5

Apr 19, 2026

1.4.4

Apr 19, 2026

1.4.3

Apr 19, 2026

1.4.2

Apr 19, 2026

1.4.1

Apr 19, 2026

1.4.0

Apr 19, 2026

1.3.7

Apr 18, 2026

1.3.6

Apr 18, 2026

1.3.5

Apr 18, 2026

1.3.4.1

Apr 18, 2026

1.3.4

Apr 18, 2026

1.3.3

Apr 17, 2026

1.3.2

Apr 16, 2026

1.3.1

Apr 16, 2026

1.3.0

Apr 15, 2026

1.2.0

Apr 14, 2026

1.1.4

Apr 14, 2026

1.1.3

Apr 14, 2026

1.1.2

Apr 13, 2026

1.1.1

Apr 13, 2026

1.1.0

Apr 13, 2026

1.0.6

Apr 9, 2026

1.0.5

Apr 9, 2026

1.0.4

Apr 9, 2026

1.0.3

Apr 9, 2026

1.0.2

Apr 9, 2026

1.0.1

Apr 9, 2026

1.0.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

is_crawler-1.5.9.1.tar.gz (1.4 MB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

is_crawler-1.5.9.1-py3-none-any.whl (398.4 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file is_crawler-1.5.9.1.tar.gz.

File metadata

Download URL: is_crawler-1.5.9.1.tar.gz
Upload date: Apr 26, 2026
Size: 1.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for is_crawler-1.5.9.1.tar.gz
Algorithm	Hash digest
SHA256	`0b2d21548845edd1b640558fd5e219e8ef80cfaf2f34611fd10d7bd7f23f4655`
MD5	`52274f23038b4daad7607547137fad6c`
BLAKE2b-256	`b13ff365b130dcca8916c8c3b11fee9639676f70dca61973e652f977ea4881cd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for is_crawler-1.5.9.1.tar.gz:

Publisher: publish.yml on tn3w/is-crawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: is_crawler-1.5.9.1.tar.gz
- Subject digest: 0b2d21548845edd1b640558fd5e219e8ef80cfaf2f34611fd10d7bd7f23f4655
- Sigstore transparency entry: 1391618458
- Sigstore integration time: Apr 26, 2026
Source repository:
- Permalink: tn3w/is-crawler@a08c0fa5ed424e8143d08c6e50f3827b9b5e6e76
- Branch / Tag: refs/heads/master
- Owner: https://github.com/tn3w
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a08c0fa5ed424e8143d08c6e50f3827b9b5e6e76
- Trigger Event: push

File details

Details for the file is_crawler-1.5.9.1-py3-none-any.whl.

File metadata

Download URL: is_crawler-1.5.9.1-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 398.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for is_crawler-1.5.9.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8a6ad4945c6c707dec336cfc88d53ae116fb817ce4755bb1a4d26538f58d10df`
MD5	`8b932e7e1c25541bde65c214edbc92f7`
BLAKE2b-256	`6d64d3c5ff8d7293eeadd86c65aacf7676524bcc38336a5c5ce5922aa0307d5e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for is_crawler-1.5.9.1-py3-none-any.whl:

Publisher: publish.yml on tn3w/is-crawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: is_crawler-1.5.9.1-py3-none-any.whl
- Subject digest: 8a6ad4945c6c707dec336cfc88d53ae116fb817ce4755bb1a4d26538f58d10df
- Sigstore transparency entry: 1391618509
- Sigstore integration time: Apr 26, 2026
Source repository:
- Permalink: tn3w/is-crawler@a08c0fa5ed424e8143d08c6e50f3827b9b5e6e76
- Branch / Tag: refs/heads/master
- Owner: https://github.com/tn3w
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a08c0fa5ed424e8143d08c6e50f3827b9b5e6e76
- Trigger Event: push

is-crawler 1.5.9.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

is-crawler

Why

In the wild

Detection

Classification

IP verification

Middleware

Recipes

robots.txt / ai.txt

CLI

UA Parser

Benchmarks

Implementation

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance