Pre-scraping intelligence tool — scan a website before scraping it

These details have not been verified by PyPI

Project links

Homepage

Project description

scrapalyser

Pre-scraping intelligence tool. Scan any website before writing a single line of scraper.

Install

# Core (curl_cffi engine)
pip install scrapalyser

# With Playwright support
pip install scrapalyser[playwright]
playwright install chromium

Usage

import scrapalyser

# Simple scan — returns a dict
report = scrapalyser.scan("https://example.com")
print(report)

# Full options
report = scrapalyser.scan(
    url="https://example.com",
    output="txt",           # "json" (default) or "txt"
    lang="fr",              # "en" (default), "fr", "es", "br"
    save="report.txt",      # optional — write output to file
    engine="curl",          # "curl" (default) or "playwright"
    headless=True,          # True (default) or False — playwright only
    screenshot="shot.png",  # optional — playwright only
)

Example output

JSON

{
  "url": "https://example.com",
  "scanned_at": "2026-05-19T12:32:00Z",
  "status_code": 200,
  "engine": "curl",
  "antibot": {
    "detected": true,
    "name": "Cloudflare Turnstile"
  },
  "technology": {
    "type": "React"
  },
  "js_required": true,
  "api": {
    "detected": true,
    "endpoints": [
      "https://api.example.com/v1/search"
    ]
  },
  "robots_txt": {
    "found": true,
    "url": "https://example.com/robots.txt"
  },
  "sitemap": {
    "found": true,
    "url": "https://example.com/sitemap.xml"
  },
  "login_wall": {
    "detected": false,
    "type": null
  }
}

TXT (French)

═══════════════════════════════════════════
        SCRAPALYSER RAPPORT
        https://example.com
        Scanné le : 2026-05-19T12:32:00Z
        Status : 200
═══════════════════════════════════════════

[ANTI-BOT]
  ⚠️  Détecté : Cloudflare Turnstile

[TECHNOLOGIE]
  🖥️  Type : React

[JAVASCRIPT]
  ⚠️  JS requis : Oui → utiliser Playwright/Selenium

[API DÉTECTÉES]
  ✅  https://api.example.com/v1/search

[ROBOTS.TXT]
  ✅  https://example.com/robots.txt

[SITEMAP]
  ✅  https://example.com/sitemap.xml

[LOGIN WALL]
  ✅  Aucun login requis

═══════════════════════════════════════════

Features

🛡️ Anti-bot detection — Cloudflare, DataDome, PerimeterX, Akamai, Kasada, Imperva, Sucuri, reCAPTCHA, hCaptcha + Unknown/Custom antibot detection
🖥️ Technology detection — React, Vue, Angular, Next.js, Nuxt, Svelte, WordPress, Shopify, Drupal, Joomla, Wix, Webflow
⚡ JS requirement detection — know before you write whether requests is enough or Playwright is needed
🌐 API endpoint discovery — via CSP headers, inline scripts, and XHR/Fetch interception (Playwright mode)
🤖 robots.txt parser — URL extraction with common path variants (/robot.txt, /robots.txt)
🗺️ Sitemap detection — via robots.txt or direct probe (/sitemap.xml, /sitemaps.xml...)
🔐 Login wall detection — form, redirect, button, OAuth
📸 Screenshot — capture the page as seen by the browser (Playwright mode)
🚫 Blocked by antibot — if the site blocks you, all fields return "blocked by antibot" instantly
🌍 Multi-language output — fr, en, es, br
📄 JSON & TXT export — machine-readable or human-readable

Engines

Feature	`curl`	`playwright`
Speed	⚡ Fast	🐢 Slower
JS execution	❌	✅
XHR/Fetch API detection	❌	✅
Screenshot	❌	✅
Bot bypass	Partial	Better with `headless=False`

Anti-bot detection

When the site blocks you (403, captcha page), scrapalyser reports which antibot is responsible and marks all other fields as "blocked by antibot" — so you know exactly what you're up against before writing anything.

Supported:

Solution	Detection method
Cloudflare / Turnstile	`cf-ray` header, `__cf_bm` cookie
DataDome	`x-datadome` header, `datadome` cookie
PerimeterX	`_px` cookie, script patterns
Akamai	`ak_bmsc` / `bm_sz` cookies
Kasada	`kkrta` / `kasada.io` scripts
Imperva	`x-iinfo` header, `incap_ses` cookie
Sucuri	`x-sucuri-id` header
reCAPTCHA	script pattern
hCaptcha	script pattern
Unknown / Custom	suspicious headers, cookies, obfuscated scripts

Contributing

Fork the repo
Create a feature branch (git checkout -b feat/my-feature)
Commit your changes
Open a pull request

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.2

May 20, 2026

0.1.1

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapalyser-0.1.2.tar.gz (15.6 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapalyser-0.1.2-py3-none-any.whl (17.9 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file scrapalyser-0.1.2.tar.gz.

File metadata

Download URL: scrapalyser-0.1.2.tar.gz
Upload date: May 20, 2026
Size: 15.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for scrapalyser-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`1591c2c90d8ea2ef9f66ac8e4e1763e386002bac7212504c2f793487517a5cbc`
MD5	`df8a9419468ef985d5cf7875472c470d`
BLAKE2b-256	`1248a90a3582ff99d8763488c39cd6602845ef6dc22cff38c0deccadc8e187be`

See more details on using hashes here.

File details

Details for the file scrapalyser-0.1.2-py3-none-any.whl.

File metadata

Download URL: scrapalyser-0.1.2-py3-none-any.whl
Upload date: May 20, 2026
Size: 17.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for scrapalyser-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`00e9d486e1659656b8f7e63e65dc3a45cec2bcff759116b3edfa4a58c2cbbf4e`
MD5	`9bd6586cf454f0bc42d3cc7801d76934`
BLAKE2b-256	`f6f3b1d8ecb617804ef34aade6a0a17af2522f7294b4dae7b2fa75939f6dcff8`

See more details on using hashes here.

scrapalyser 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scrapalyser

Install

Usage

Example output

JSON

TXT (French)

Features

Engines

Anti-bot detection

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes