Skip to main content

Pre-scraping intelligence tool — scan a website before scraping it

Project description

scrapalyser

PyPI version Python 3.9+ License: MIT

Pre-scraping intelligence tool. Scan any website before writing a single line of scraper.


Install

# Core (curl_cffi engine)
pip install scrapalyser

# With Playwright support
pip install scrapalyser[playwright]
playwright install chromium

Usage

import scrapalyser

# Simple scan — returns a dict
report = scrapalyser.scan("https://example.com")
print(report)

# Full options
report = scrapalyser.scan(
    url="https://example.com",
    output="txt",           # "json" (default) or "txt"
    lang="fr",              # "en" (default), "fr", "es", "br"
    save="report.txt",      # optional — write output to file
    engine="curl",          # "curl" (default) or "playwright"
    headless=True,          # True (default) or False — playwright only
    screenshot="shot.png",  # optional — playwright only
)

Example output

JSON

{
  "url": "https://example.com",
  "scanned_at": "2026-04-30T12:32:00Z",
  "status_code": 200,
  "engine": "curl",
  "antibot": {
    "detected": true,
    "name": "Cloudflare Turnstile"
  },
  "technology": {
    "type": "React"
  },
  "js_required": true,
  "api": {
    "detected": true,
    "endpoints": [
      "https://api.example.com/v1/search"
    ]
  },
  "robots_txt": {
    "found": true,
    "url": "https://example.com/robots.txt"
  },
  "sitemap": {
    "found": true,
    "url": "https://example.com/sitemap.xml"
  },
  "login_wall": {
    "detected": false,
    "type": null
  }
}

TXT (French)

═══════════════════════════════════════════
        SCRAPALYSER RAPPORT
        https://example.com
        Scanné le : 2026-04-30T12:32:00Z
        Status : 200
═══════════════════════════════════════════

[ANTI-BOT]
  ⚠️  Détecté : Cloudflare Turnstile

[TECHNOLOGIE]
  🖥️  Type : React

[JAVASCRIPT]
  ⚠️  JS requis : Oui → utiliser Playwright/Selenium

[API DÉTECTÉES]
  ✅  https://api.example.com/v1/search

[ROBOTS.TXT]
  ✅  https://example.com/robots.txt

[SITEMAP]
  ✅  https://example.com/sitemap.xml

[LOGIN WALL]
  ✅  Aucun login requis

═══════════════════════════════════════════

Features

  • 🛡️ Anti-bot detection — Cloudflare, DataDome, PerimeterX, Akamai, Kasada, reCAPTCHA, hCaptcha
  • 🖥️ Technology detection — React, Vue, Angular, Next.js, Nuxt, Svelte, WordPress, Shopify, Drupal, Joomla, Wix, Webflow
  • JS requirement detection — know before you write whether requests is enough or Playwright is needed
  • 🌐 API endpoint discovery — via CSP headers, inline scripts, and XHR/Fetch interception (Playwright mode)
  • 🤖 robots.txt parser — URL extraction with common path variants
  • 🗺️ Sitemap detection — via robots.txt or direct probe
  • 🔐 Login wall detection — form, redirect, button, OAuth
  • 📸 Screenshot — capture the page as seen by the browser (Playwright mode)
  • 🌍 Multi-language output — fr, en, es, br
  • 📄 JSON & TXT export — machine-readable or human-readable

Engines

Feature curl playwright
Speed Fast Slower
JS execution
XHR/Fetch API detection
Screenshot
Bot detection bypass Partial Better with headless=False

Contributing

  1. Fork the repo
  2. Create a feature branch (git checkout -b feat/my-feature)
  3. Commit your changes
  4. Open a pull request

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapalyser-0.1.1.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapalyser-0.1.1-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapalyser-0.1.1.tar.gz.

File metadata

  • Download URL: scrapalyser-0.1.1.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for scrapalyser-0.1.1.tar.gz
Algorithm Hash digest
SHA256 da6d8d912750e888a1166eb2dc4d98ed203dc3b4f1ddffc0f06eadd57cc1b4d4
MD5 fdf5ab2e67eed5483b88be9c4e776351
BLAKE2b-256 a0b6a4183740a41c74cf50c3187d611963c437fd2af94b052e2106993fe68ce6

See more details on using hashes here.

File details

Details for the file scrapalyser-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: scrapalyser-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for scrapalyser-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1f6c7e9bd4e1f8a3d80e1f64d5d2b454941ee9ae18fd97ac5e6dd661858c7635
MD5 8e2db172aacdeb5091c83cdf38e7e92c
BLAKE2b-256 3d345f952b42b610a1507327175c8cbfde913a6e8608ee428350692b30fefff4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page