Detect a website's anti-bot defenses and rate how hard it is to scrape.
Project description
doorknock
Detect a website's anti-bot defenses and rate how hard it is to scrape — from the command line or from Python.
doorknock performs a single set of HTTP probes against a target URL and runs a battery of detectors that look for:
- WAFs / CDNs — Cloudflare, Akamai, Sucuri, Imperva (Incapsula), F5 BIG-IP, AWS WAF, Fastly, Azure Front Door, StackPath, Wallarm, Reblaze, Barracuda, …
- Bot management — DataDome, PerimeterX / HUMAN, Kasada, Shape Security (F5), Imperva ABP (Distil), Arkose Labs, Reblaze, Radware, Netacea, …
- CAPTCHAs — Google reCAPTCHA / reCAPTCHA Enterprise, hCaptcha, Cloudflare Turnstile, Arkose FunCaptcha, GeeTest, DataDome captcha, custom image captchas, …
- JavaScript challenges — Cloudflare "Just a moment", Incapsula challenges, Akamai sensor cookies, Kasada KPSDK, PerimeterX interstitials, "checking your browser" pages, SPA / client-only rendering
- Rate-limit signals —
RateLimit-*,X-RateLimit-*,Retry-After, hostile statuses (403/406/429/503) - User-Agent filtering — compares a no-UA probe against a browser-UA probe
- Cookie/session requirements — large initial cookie sets, CSRF/XSRF tokens,
__Host-/__Secure-cookies - TLS / HTTP version — HTTPS, HTTP/2 (relevant to JA3/JA4 fingerprinting)
- robots.txt rules —
Disallow: /for everyone, scraper-targeted user agents - Client-side fingerprinting — FingerprintJS, Castle, Sift, Forter, ThreatMetrix, iovation, custom canvas/WebGL/audio probes
Findings are weighted by severity and aggregated into a single scraping difficulty rating from EASY to EXTREME along with a 0–100 score.
Install
pip install doorknock
For colored, prettier CLI output you can also install the cli extras (uses Rich):
pip install "doorknock[cli]"
Requires Python 3.8+.
CLI
doorknock https://example.com
======================================================================
Target: https://example.com
Final URL: https://example.com/
Status: 200
Difficulty: EASY (score 0/100)
======================================================================
Looks easy to scrape. No meaningful anti-bot defenses were detected.
A plain requests script with a polite User-Agent should work.
Useful flags:
| Flag | Purpose |
|---|---|
--json |
Emit machine-readable JSON. |
--timeout 10 |
HTTP timeout in seconds. |
--no-verify |
Disable TLS verification. |
--no-robots |
Skip the robots.txt fetch. |
--no-color |
Disable ANSI colors in human output. |
--user-agent "..." |
Override the User-Agent for the main probe. |
--exit-code |
Exit non-zero when difficulty is HARD or worse (useful in CI). |
You can also run it without installing the entry point:
python -m doorknock https://example.com --json
Library usage
from doorknock import scan
result = scan("https://example.com")
print(result.difficulty) # Difficulty.EASY
print(result.score) # 0..100
print(result.summary)
for f in result.findings:
print(f.severity.value, f.category.value, f.name)
The same data is available as a plain dict / JSON:
import json
print(json.dumps(result.to_dict(), indent=2))
# or
print(result.to_json())
For more control, use the class directly:
from doorknock import AntiBotScanner
scanner = AntiBotScanner(
timeout=10,
user_agent="my-bot/1.0",
extra_headers={"Accept-Language": "en-GB,en;q=0.9"},
check_robots=True,
probe_no_user_agent=True,
)
result = scanner.scan("https://example.com")
Difficulty buckets
| Score | Difficulty | What it means |
|---|---|---|
| 0–14 | EASY |
Nothing meaningful in the way. Plain requests works. |
| 15–34 | MODERATE |
Light defenses — UA filtering, rate limits, generic CDN. Use a session and realistic headers. |
| 35–59 | HARD |
Real WAF, CAPTCHA, or JS challenges. Plan for a real browser or proxies. |
| 60–84 | VERY_HARD |
Layered defenses (bot management + CAPTCHA / JS challenge). Undetected browser + residential proxies likely. |
| 85–100 | EXTREME |
Top-tier bot management (DataDome, PerimeterX, Kasada, Shape, …). Expect a serious engineering project or commercial unblockers. |
How the scoring works
Every finding has a severity (info, low, medium, high, critical) which maps to a weight. Weights are summed per category and capped so one chatty detector cannot dominate the result. A small "synergy bonus" is added when multiple serious categories show up together (e.g. bot management + CAPTCHA + fingerprinting), because layered defenses are meaningfully harder than a single layer.
What it does NOT do
- It does not attempt to bypass anything. It performs read-only HTTP requests (
GET /,GET /robots.txt, plus a no-UA probe). - It does not execute JavaScript. Findings come purely from headers, cookies, status codes, and the raw HTML body.
- It is a heuristic tool. Some defenses (TLS / JA3 fingerprinting, behavioral biometrics, server-side ML) cannot be observed from a single HTTP request and are inferred from vendor signatures. False negatives are possible — especially for in-house systems.
- Detection is best-effort: real-world sites mix vendors and rebrand things constantly.
Ethics & legality
Use this tool to evaluate sites you have permission to access, to assess your own infrastructure, or for security research. Respect robots.txt, terms of service, and applicable law in your jurisdiction.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doorknock-0.1.0.tar.gz.
File metadata
- Download URL: doorknock-0.1.0.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb337d59781223664b8e858a71f712a79dbd5395bbc834e00a8d4c6927c0c3a5
|
|
| MD5 |
e95b068024b34c4162746f44fcbda17a
|
|
| BLAKE2b-256 |
cd8e1affa1f9278d31c485f2f14659a04cd14beed5f7b48c3ba4bf266efffb43
|
File details
Details for the file doorknock-0.1.0-py3-none-any.whl.
File metadata
- Download URL: doorknock-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f5d4bc676b342c26f6f20471fed2598098b43eed33a17104daad7e6ef83b360
|
|
| MD5 |
4b727d69ba4b0a9dfcb8f7d62b399842
|
|
| BLAKE2b-256 |
163691e96e5ac045f2950a0ad6b6ba22bf3015fa009998510d3aa9b8a4823952
|