Analyze website anti-bot protections before you scrape

These details have not been verified by PyPI

Project links

Homepage

Project description

caniscrape 🔍

Know before you scrape. Analyze any website's anti-bot protections in seconds.

Stop wasting hours building scrapers only to discover the site has Cloudflare + JavaScript rendering + CAPTCHA + rate limiting. caniscrape does reconnaissance upfront so you know exactly what you're dealing with before writing a single line of code.

🎯 What It Does

caniscrape analyzes a URL and tells you:

What protections are active (WAF, CAPTCHA, rate limits, TLS fingerprinting, honeypots)
Difficulty score (0-10 scale: Easy → Very Hard)
Specific recommendations on what tools/proxies you'll need
Estimated complexity so you can decide: build it yourself or use a service

🚀 Quick Start

Installation

pip install caniscrape

Required dependency:

# Install wafw00f (WAF detection)
pipx install wafw00f

# Install Playwright browsers (for JS detection)
playwright install chromium

Basic Usage

caniscrape https://example.com

Example Output

🔍 Analyzing: https://newegg.com...
🤖 Checking robots.txt...
🔬 Analyzing TLS fingerprint...
⚙️ Analyzing JavaScript rendering...
🕵️ Analyzing for behavioral traps (default scan)...
⚔️ Detecting CAPTCHA...
⏱️ Profiling rate limits with Python client...
🔍 Running WAF detection...


───────────────────────────────────────────────  DIFFICULTY SCORE: 6/10 (Hard)  ───────────────────────────────────────────────

╭───────────────────────╮
│🛡️  ACTIVE PROTECTIONS │
╰───────────────────────╯
    ✅ robots.txt: Website allows scraping (for details on specific pages, navigate to <url>/robots.txt in your browser.)
    ❌ TLS Fingerprinting: Site blocks standard Python clients but allows browser-like clients.
    ✅ JavaScript: Not required for main content.
    ✅ Behavioral Analysis: No obvious honeypot traps detected.
    ❌ CAPTCHA: Cloudflare Turnstile detected (on page load).
    ❌ Rate Limiting: Blocked Immediately (Blocked after 1 requests with a 3.0s delay.)
    💡 Advice: This is likely due to client fingerprinting (TLS fingerprinting, User-Agent, etc.), not a classic rate limit.
       Run the analysis again. A different browser identity will be used, which may not be blocked.
       Otherwise, try the --impersonate flag, it will take longer but is likely to succeed.
    ❌ WAF: Kona SiteDefender by (Akamai)

───────────────────────────────────────────────────── 💡 RECOMMENDATIONS ──────────────────────────────────────────────────────

Required Tools:
  • A CAPTCHA solving service (e.g., 2Captcha, Anti-Captcha).
  • A library with browser impersonation like curl_cffi, or a full headless browser.
  • A pool of high-quality proxies (residential or mobile) to rotate IP addresses.

Scraping Strategy:
  • Implement delays between requests (e.g., 3-5 seconds).
  • Integrate the CAPTCHA solver into your script to handle challenges when they appear.
  • Rotate User-Agents and other headers on every request.
  • Standard Python HTTP clients (like requests/aiohttp) will be blocked.

────────────────────────────────────────────────────── Analysis Complete ──────────────────────────────────────────────────────

🔬 What It Analyzes

1. WAF Detection

Identifies Web Application Firewalls (Cloudflare, Akamai, Imperva, DataDome, PerimeterX, etc.)

2. Rate Limiting

Tests with burst and sustained traffic patterns
Detects HTTP 429s, timeouts, throttling, soft bans
Determines blocking threshold (requests/min)

3. JavaScript Rendering

Compares content with/without JS execution
Detects SPAs (React, Vue, Angular)
Calculates percentage of content missing without JS

4. CAPTCHA Detection

Scans for reCAPTCHA, hCaptcha, Cloudflare Turnstile
Tests if CAPTCHA appears on load or after rate limiting
Monitors network traffic for challenge endpoints

5. TLS Fingerprinting

Compares standard Python clients vs browser-like clients
Detects if site blocks based on TLS handshake signatures

6. Behavioral Analysis

Scans for invisible "honeypot" links (bot traps)
Detects if site is monitoring mouse/scroll behavior

7. robots.txt

Checks scraping permissions
Extracts recommended crawl-delay

🛠️ Advanced Usage

Aggressive WAF Detection

# Find ALL WAFs (slower, may trigger rate limits)
caniscrape https://example.com --find-all

Browser Impersonation

# Use curl_cffi for better stealth (slower but more likely to succeed)
caniscrape https://example.com --impersonate

Deep Honeypot Scanning

# Check 2/3 of links (more accurate, slower)
caniscrape https://example.com --thorough

# Check ALL links (most accurate, very slow on large sites)
caniscrape https://example.com --deep

Combine Options

caniscrape https://example.com --impersonate --find-all --thorough

📊 Difficulty Scoring

The tool calculates a 0-10 difficulty score based on:

Factor	Impact
CAPTCHA on page load	+5 points
CAPTCHA after rate limit	+4 points
DataDome/PerimeterX WAF	+4 points
Akamai/Imperva WAF	+3 points
Aggressive rate limiting	+3 points
Cloudflare WAF	+2 points
Honeypot traps detected	+2 points
TLS fingerprinting active	+1 point

Score interpretation:

0-2: Easy (basic scraping will work)
3-4: Medium (need some precautions)
5-7: Hard (requires advanced techniques)
8-10: Very Hard (consider using a service)

🔧 Installation Details

System Requirements

Python 3.9+
pip or pipx

Full Installation

# 1. Install caniscrape
pip install caniscrape

# 2. Install wafw00f (WAF detection)
# Option A: Using pipx (recommended)
python -m pip install --user pipx
pipx install wafw00f

# Option B: Using pip
pip install wafw00f

# 3. Install Playwright browsers (for JS/CAPTCHA/behavioral detection)
playwright install chromium

Dependencies

Core dependencies (installed automatically):

click - CLI framework
rich - Terminal formatting
aiohttp - Async HTTP requests
beautifulsoup4 - HTML parsing
playwright - Headless browser automation
curl_cffi - Browser impersonation

External tools (install separately):

wafw00f - WAF detection

🎓 Use Cases

For Developers

Before building a scraper: Check if it's even feasible
Debugging scraper issues: Identify what protection broke your scraper
Client estimates: Give accurate time/cost estimates for scraping projects

For Data Engineers

Pipeline planning: Know what infrastructure you'll need (proxies, CAPTCHA solvers)
Cost estimation: Calculate proxy/CAPTCHA costs before committing to a data source

For Researchers

Site selection: Find the easiest data sources for your research
Compliance: Check robots.txt before scraping

⚠️ Limitations & Disclaimers

What It Can't Detect

Dynamic protections: Some sites only trigger defenses under specific conditions
Behavioral AI: Advanced ML-based bot detection that adapts in real-time
Account-based restrictions: Protections that only activate for logged-in users

Legal & Ethical Notes

This tool is for reconnaissance only - it does not bypass protections
Always respect robots.txt and terms of service
Some sites may consider aggressive scanning hostile - use --find-all and --deep sparingly
You are responsible for how you use this tool and any scrapers you build

Technical Notes

Analysis takes 30-60 seconds per URL
Some checks require making multiple requests (may trigger rate limits)
Results are a snapshot - protections can change over time

🤝 Contributing

Found a bug? Have a feature request? Contributions are welcome!

Fork the repo
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

MIT License - see LICENSE file for details

🙏 Acknowledgments

Built on top of:

wafw00f - WAF detection
Playwright - Browser automation
curl_cffi - Browser impersonation

📬 Contact

Questions? Feedback? Open an issue on GitHub.

Remember: This tool tells you HOW HARD it will be to scrape. It doesn't do the scraping for you. Use it to make informed decisions before you start building.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.3

Nov 2, 2025

1.0.2

Nov 2, 2025

1.0.1

Oct 31, 2025

1.0.0

Oct 29, 2025

0.3.0

Oct 24, 2025

0.2.0

Oct 22, 2025

0.1.8

Oct 20, 2025

0.1.7

Oct 20, 2025

0.1.6

Oct 20, 2025

0.1.5

Oct 20, 2025

0.1.4

Oct 20, 2025

0.1.3

Oct 19, 2025

0.1.2

Oct 19, 2025

0.1.1

Oct 19, 2025

This version

0.1.0

Oct 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caniscrape-0.1.0.tar.gz (20.7 kB view details)

Uploaded Oct 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

caniscrape-0.1.0-py3-none-any.whl (22.0 kB view details)

Uploaded Oct 19, 2025 Python 3

File details

Details for the file caniscrape-0.1.0.tar.gz.

File metadata

Download URL: caniscrape-0.1.0.tar.gz
Upload date: Oct 19, 2025
Size: 20.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for caniscrape-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a1400b04be3fa94f22a48aa92787d4ee9eee2f61ae6d1dcd4c555f8324b9233d`
MD5	`efab612f869db16d9ff7c61d67607250`
BLAKE2b-256	`d5986d44f1e8c93e8d6b677cdfd8f1531edae656808c398caab1166c7ea04ed9`

See more details on using hashes here.

File details

Details for the file caniscrape-0.1.0-py3-none-any.whl.

File metadata

Download URL: caniscrape-0.1.0-py3-none-any.whl
Upload date: Oct 19, 2025
Size: 22.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for caniscrape-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a244f76bfeade6609fbde78aac6d114ffc11f46958ee4e17ad1b2025d679dcf6`
MD5	`002277196343541e0edc264624ac12e5`
BLAKE2b-256	`e7bde1075d6cb5d9e52d9bf9e7cf866231137935692e6772e982734fe856a219`

See more details on using hashes here.

caniscrape 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

caniscrape 🔍

🎯 What It Does

🚀 Quick Start

Installation

Basic Usage

Example Output

🔬 What It Analyzes

1. WAF Detection

2. Rate Limiting

3. JavaScript Rendering

4. CAPTCHA Detection

5. TLS Fingerprinting

6. Behavioral Analysis

7. robots.txt

🛠️ Advanced Usage

Aggressive WAF Detection

Browser Impersonation

Deep Honeypot Scanning

Combine Options

📊 Difficulty Scoring

🔧 Installation Details

System Requirements

Full Installation

Dependencies

🎓 Use Cases

For Developers

For Data Engineers

For Researchers

⚠️ Limitations & Disclaimers

What It Can't Detect

Legal & Ethical Notes

Technical Notes

🤝 Contributing

📝 License

🙏 Acknowledgments

📬 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes