Pre-scraping intelligence tool — scan a website before scraping it
Project description
scrapalyser
Pre-scraping intelligence tool. Scan any website before writing a single line of scraper.
Install
# Core (curl_cffi engine)
pip install scrapalyser
# With Playwright support
pip install scrapalyser[playwright]
playwright install chromium
Usage
import scrapalyser
# Simple scan — returns a dict
report = scrapalyser.scan("https://example.com")
print(report)
# Full options
report = scrapalyser.scan(
url="https://example.com",
output="txt", # "json" (default) or "txt"
lang="fr", # "en" (default), "fr", "es", "br"
save="report.txt", # optional — write output to file
engine="curl", # "curl" (default) or "playwright"
headless=True, # True (default) or False — playwright only
screenshot="shot.png", # optional — playwright only
)
Example output
JSON
{
"url": "https://example.com",
"scanned_at": "2026-05-19T12:32:00Z",
"status_code": 200,
"engine": "curl",
"antibot": {
"detected": true,
"name": "Cloudflare Turnstile"
},
"technology": {
"type": "React"
},
"js_required": true,
"api": {
"detected": true,
"endpoints": [
"https://api.example.com/v1/search"
]
},
"robots_txt": {
"found": true,
"url": "https://example.com/robots.txt"
},
"sitemap": {
"found": true,
"url": "https://example.com/sitemap.xml"
},
"login_wall": {
"detected": false,
"type": null
}
}
TXT (French)
═══════════════════════════════════════════
SCRAPALYSER RAPPORT
https://example.com
Scanné le : 2026-05-19T12:32:00Z
Status : 200
═══════════════════════════════════════════
[ANTI-BOT]
⚠️ Détecté : Cloudflare Turnstile
[TECHNOLOGIE]
🖥️ Type : React
[JAVASCRIPT]
⚠️ JS requis : Oui → utiliser Playwright/Selenium
[API DÉTECTÉES]
✅ https://api.example.com/v1/search
[ROBOTS.TXT]
✅ https://example.com/robots.txt
[SITEMAP]
✅ https://example.com/sitemap.xml
[LOGIN WALL]
✅ Aucun login requis
═══════════════════════════════════════════
Features
- 🛡️ Anti-bot detection — Cloudflare, DataDome, PerimeterX, Akamai, Kasada, Imperva, Sucuri, reCAPTCHA, hCaptcha + Unknown/Custom antibot detection
- 🖥️ Technology detection — React, Vue, Angular, Next.js, Nuxt, Svelte, WordPress, Shopify, Drupal, Joomla, Wix, Webflow
- ⚡ JS requirement detection — know before you write whether
requestsis enough or Playwright is needed - 🌐 API endpoint discovery — via CSP headers, inline scripts, and XHR/Fetch interception (Playwright mode)
- 🤖 robots.txt parser — URL extraction with common path variants (
/robot.txt,/robots.txt) - 🗺️ Sitemap detection — via robots.txt or direct probe (
/sitemap.xml,/sitemaps.xml...) - 🔐 Login wall detection — form, redirect, button, OAuth
- 📸 Screenshot — capture the page as seen by the browser (Playwright mode)
- 🚫 Blocked by antibot — if the site blocks you, all fields return
"blocked by antibot"instantly - 🌍 Multi-language output — fr, en, es, br
- 📄 JSON & TXT export — machine-readable or human-readable
Engines
| Feature | curl |
playwright |
|---|---|---|
| Speed | ⚡ Fast | 🐢 Slower |
| JS execution | ❌ | ✅ |
| XHR/Fetch API detection | ❌ | ✅ |
| Screenshot | ❌ | ✅ |
| Bot bypass | Partial | Better with headless=False |
Anti-bot detection
When the site blocks you (403, captcha page), scrapalyser reports which antibot is responsible
and marks all other fields as "blocked by antibot" — so you know exactly what you're up
against before writing anything.
Supported:
| Solution | Detection method |
|---|---|
| Cloudflare / Turnstile | cf-ray header, __cf_bm cookie |
| DataDome | x-datadome header, datadome cookie |
| PerimeterX | _px cookie, script patterns |
| Akamai | ak_bmsc / bm_sz cookies |
| Kasada | kkrta / kasada.io scripts |
| Imperva | x-iinfo header, incap_ses cookie |
| Sucuri | x-sucuri-id header |
| reCAPTCHA | script pattern |
| hCaptcha | script pattern |
| Unknown / Custom | suspicious headers, cookies, obfuscated scripts |
Contributing
- Fork the repo
- Create a feature branch (
git checkout -b feat/my-feature) - Commit your changes
- Open a pull request
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapalyser-0.1.2.tar.gz.
File metadata
- Download URL: scrapalyser-0.1.2.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1591c2c90d8ea2ef9f66ac8e4e1763e386002bac7212504c2f793487517a5cbc
|
|
| MD5 |
df8a9419468ef985d5cf7875472c470d
|
|
| BLAKE2b-256 |
1248a90a3582ff99d8763488c39cd6602845ef6dc22cff38c0deccadc8e187be
|
File details
Details for the file scrapalyser-0.1.2-py3-none-any.whl.
File metadata
- Download URL: scrapalyser-0.1.2-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00e9d486e1659656b8f7e63e65dc3a45cec2bcff759116b3edfa4a58c2cbbf4e
|
|
| MD5 |
9bd6586cf454f0bc42d3cc7801d76934
|
|
| BLAKE2b-256 |
f6f3b1d8ecb617804ef34aade6a0a17af2522f7294b4dae7b2fa75939f6dcff8
|