Pre-scraping intelligence tool — scan a website before scraping it
Project description
scrapalyser
Pre-scraping intelligence tool. Scan any website before writing a single line of scraper.
Install
# Core (curl_cffi engine)
pip install scrapalyser
# With Playwright support
pip install scrapalyser[playwright]
playwright install chromium
Usage
import scrapalyser
# Simple scan — returns a dict
report = scrapalyser.scan("https://example.com")
print(report)
# Full options
report = scrapalyser.scan(
url="https://example.com",
output="txt", # "json" (default) or "txt"
lang="fr", # "en" (default), "fr", "es", "br"
save="report.txt", # optional — write output to file
engine="curl", # "curl" (default) or "playwright"
headless=True, # True (default) or False — playwright only
screenshot="shot.png", # optional — playwright only
)
Example output
JSON
{
"url": "https://example.com",
"scanned_at": "2026-04-30T12:32:00Z",
"status_code": 200,
"engine": "curl",
"antibot": {
"detected": true,
"name": "Cloudflare Turnstile"
},
"technology": {
"type": "React"
},
"js_required": true,
"api": {
"detected": true,
"endpoints": [
"https://api.example.com/v1/search"
]
},
"robots_txt": {
"found": true,
"url": "https://example.com/robots.txt"
},
"sitemap": {
"found": true,
"url": "https://example.com/sitemap.xml"
},
"login_wall": {
"detected": false,
"type": null
}
}
TXT (French)
═══════════════════════════════════════════
SCRAPALYSER RAPPORT
https://example.com
Scanné le : 2026-04-30T12:32:00Z
Status : 200
═══════════════════════════════════════════
[ANTI-BOT]
⚠️ Détecté : Cloudflare Turnstile
[TECHNOLOGIE]
🖥️ Type : React
[JAVASCRIPT]
⚠️ JS requis : Oui → utiliser Playwright/Selenium
[API DÉTECTÉES]
✅ https://api.example.com/v1/search
[ROBOTS.TXT]
✅ https://example.com/robots.txt
[SITEMAP]
✅ https://example.com/sitemap.xml
[LOGIN WALL]
✅ Aucun login requis
═══════════════════════════════════════════
Features
- 🛡️ Anti-bot detection — Cloudflare, DataDome, PerimeterX, Akamai, Kasada, reCAPTCHA, hCaptcha
- 🖥️ Technology detection — React, Vue, Angular, Next.js, Nuxt, Svelte, WordPress, Shopify, Drupal, Joomla, Wix, Webflow
- ⚡ JS requirement detection — know before you write whether
requestsis enough or Playwright is needed - 🌐 API endpoint discovery — via CSP headers, inline scripts, and XHR/Fetch interception (Playwright mode)
- 🤖 robots.txt parser — URL extraction with common path variants
- 🗺️ Sitemap detection — via robots.txt or direct probe
- 🔐 Login wall detection — form, redirect, button, OAuth
- 📸 Screenshot — capture the page as seen by the browser (Playwright mode)
- 🌍 Multi-language output — fr, en, es, br
- 📄 JSON & TXT export — machine-readable or human-readable
Engines
| Feature | curl |
playwright |
|---|---|---|
| Speed | Fast | Slower |
| JS execution | ❌ | ✅ |
| XHR/Fetch API detection | ❌ | ✅ |
| Screenshot | ❌ | ✅ |
| Bot detection bypass | Partial | Better with headless=False |
Contributing
- Fork the repo
- Create a feature branch (
git checkout -b feat/my-feature) - Commit your changes
- Open a pull request
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapalyser-0.1.1.tar.gz
(14.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapalyser-0.1.1.tar.gz.
File metadata
- Download URL: scrapalyser-0.1.1.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da6d8d912750e888a1166eb2dc4d98ed203dc3b4f1ddffc0f06eadd57cc1b4d4
|
|
| MD5 |
fdf5ab2e67eed5483b88be9c4e776351
|
|
| BLAKE2b-256 |
a0b6a4183740a41c74cf50c3187d611963c437fd2af94b052e2106993fe68ce6
|
File details
Details for the file scrapalyser-0.1.1-py3-none-any.whl.
File metadata
- Download URL: scrapalyser-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f6c7e9bd4e1f8a3d80e1f64d5d2b454941ee9ae18fd97ac5e6dd661858c7635
|
|
| MD5 |
8e2db172aacdeb5091c83cdf38e7e92c
|
|
| BLAKE2b-256 |
3d345f952b42b610a1507327175c8cbfde913a6e8608ee428350692b30fefff4
|