Skip to main content

Pre-scraping intelligence tool — scan a website before scraping it

Project description

scrapalyser

PyPI version Python 3.9+ License: MIT

Pre-scraping intelligence tool. Scan any website before writing a single line of scraper.


Install

# Core (curl_cffi engine)
pip install scrapalyser

# With Playwright support
pip install scrapalyser[playwright]
playwright install chromium

Usage

import scrapalyser

# Simple scan — returns a dict
report = scrapalyser.scan("https://example.com")
print(report)

# Full options
report = scrapalyser.scan(
    url="https://example.com",
    output="txt",           # "json" (default) or "txt"
    lang="fr",              # "en" (default), "fr", "es", "br"
    save="report.txt",      # optional — write output to file
    engine="curl",          # "curl" (default) or "playwright"
    headless=True,          # True (default) or False — playwright only
    screenshot="shot.png",  # optional — playwright only
)

Example output

JSON

{
  "url": "https://example.com",
  "scanned_at": "2026-05-19T12:32:00Z",
  "status_code": 200,
  "engine": "curl",
  "antibot": {
    "detected": true,
    "name": "Cloudflare Turnstile"
  },
  "technology": {
    "type": "React"
  },
  "js_required": true,
  "api": {
    "detected": true,
    "endpoints": [
      "https://api.example.com/v1/search"
    ]
  },
  "robots_txt": {
    "found": true,
    "url": "https://example.com/robots.txt"
  },
  "sitemap": {
    "found": true,
    "url": "https://example.com/sitemap.xml"
  },
  "login_wall": {
    "detected": false,
    "type": null
  }
}

TXT (French)

═══════════════════════════════════════════
        SCRAPALYSER RAPPORT
        https://example.com
        Scanné le : 2026-05-19T12:32:00Z
        Status : 200
═══════════════════════════════════════════

[ANTI-BOT]
  ⚠️  Détecté : Cloudflare Turnstile

[TECHNOLOGIE]
  🖥️  Type : React

[JAVASCRIPT]
  ⚠️  JS requis : Oui → utiliser Playwright/Selenium

[API DÉTECTÉES]
  ✅  https://api.example.com/v1/search

[ROBOTS.TXT]
  ✅  https://example.com/robots.txt

[SITEMAP]
  ✅  https://example.com/sitemap.xml

[LOGIN WALL]
  ✅  Aucun login requis

═══════════════════════════════════════════

Features

  • 🛡️ Anti-bot detection — Cloudflare, DataDome, PerimeterX, Akamai, Kasada, Imperva, Sucuri, reCAPTCHA, hCaptcha + Unknown/Custom antibot detection
  • 🖥️ Technology detection — React, Vue, Angular, Next.js, Nuxt, Svelte, WordPress, Shopify, Drupal, Joomla, Wix, Webflow
  • JS requirement detection — know before you write whether requests is enough or Playwright is needed
  • 🌐 API endpoint discovery — via CSP headers, inline scripts, and XHR/Fetch interception (Playwright mode)
  • 🤖 robots.txt parser — URL extraction with common path variants (/robot.txt, /robots.txt)
  • 🗺️ Sitemap detection — via robots.txt or direct probe (/sitemap.xml, /sitemaps.xml...)
  • 🔐 Login wall detection — form, redirect, button, OAuth
  • 📸 Screenshot — capture the page as seen by the browser (Playwright mode)
  • 🚫 Blocked by antibot — if the site blocks you, all fields return "blocked by antibot" instantly
  • 🌍 Multi-language output — fr, en, es, br
  • 📄 JSON & TXT export — machine-readable or human-readable

Engines

Feature curl playwright
Speed ⚡ Fast 🐢 Slower
JS execution
XHR/Fetch API detection
Screenshot
Bot bypass Partial Better with headless=False

Anti-bot detection

When the site blocks you (403, captcha page), scrapalyser reports which antibot is responsible and marks all other fields as "blocked by antibot" — so you know exactly what you're up against before writing anything.

Supported:

Solution Detection method
Cloudflare / Turnstile cf-ray header, __cf_bm cookie
DataDome x-datadome header, datadome cookie
PerimeterX _px cookie, script patterns
Akamai ak_bmsc / bm_sz cookies
Kasada kkrta / kasada.io scripts
Imperva x-iinfo header, incap_ses cookie
Sucuri x-sucuri-id header
reCAPTCHA script pattern
hCaptcha script pattern
Unknown / Custom suspicious headers, cookies, obfuscated scripts

Contributing

  1. Fork the repo
  2. Create a feature branch (git checkout -b feat/my-feature)
  3. Commit your changes
  4. Open a pull request

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapalyser-0.1.2.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapalyser-0.1.2-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapalyser-0.1.2.tar.gz.

File metadata

  • Download URL: scrapalyser-0.1.2.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for scrapalyser-0.1.2.tar.gz
Algorithm Hash digest
SHA256 1591c2c90d8ea2ef9f66ac8e4e1763e386002bac7212504c2f793487517a5cbc
MD5 df8a9419468ef985d5cf7875472c470d
BLAKE2b-256 1248a90a3582ff99d8763488c39cd6602845ef6dc22cff38c0deccadc8e187be

See more details on using hashes here.

File details

Details for the file scrapalyser-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: scrapalyser-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for scrapalyser-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 00e9d486e1659656b8f7e63e65dc3a45cec2bcff759116b3edfa4a58c2cbbf4e
MD5 9bd6586cf454f0bc42d3cc7801d76934
BLAKE2b-256 f6f3b1d8ecb617804ef34aade6a0a17af2522f7294b4dae7b2fa75939f6dcff8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page