Skip to main content

Stealthy Crawling. Maximum Results. A pluggable anti-bot and stealth framework for Scrapy.

Project description

scrapy-stealth logo

scrapy-stealth

Stealthy Crawling. Maximum Results.

A pluggable anti-bot and stealth framework for Scrapy.

PyPI version Python versions Downloads GitHub release License: MIT Changelog

scrapy-stealth extends Scrapy with browser impersonation, proxy rotation, fingerprint cycling, and intelligent retry strategies โ€” designed for large-scale, production-grade crawling.


๐Ÿง  Why scrapy-stealth?

Scrapy is fast and powerful, but modern websites use advanced anti-bot protections such as:

  • TLS fingerprinting
  • Browser behavior detection
  • Rate limiting and IP blocking

scrapy-stealth helps by adding:

  • ๐Ÿงฌ Browser-level impersonation (TLS + HTTP/2 fingerprints)
  • ๐Ÿ” Smarter retry strategies
  • ๐ŸŒ Proxy and fingerprint rotation
  • ๐Ÿ›ก๏ธ Anti-bot detection

Result

  • Higher success rate
  • Lower proxy cost
  • More stable crawls

๐Ÿ“Š Comparison

Feature scrapy-stealth scrapy-impersonate scrapy-playwright scrapy-splash Scrapy (default)
TLS fingerprint spoofing โœ… โœ… โŒ โŒ โŒ
HTTP/2 support โœ… โœ… โœ… โŒ โŒ
Browser impersonation โœ… โœ… โš ๏ธ partial โŒ โŒ
Proxy rotation (built-in) โœ… โŒ โŒ โŒ โŒ
Fingerprint rotation โœ… โŒ โŒ โŒ โŒ
Anti-bot detection โœ… โŒ โŒ โŒ โŒ
Smart retry logic โœ… โŒ โŒ โŒ โŒ
Per-request engine switching โœ… โŒ โŒ โŒ โŒ
Headless browser required โœ… โŒ โœ… โœ… โŒ
JavaScript rendering ๏ธโœ… โŒ โœ… โœ… โŒ
Screenshot / snapshot โœ… โŒ โœ… โœ… โŒ
Native Scrapy integration โœ… โœ… โœ… โœ… โœ…
Memory footprint ๐ŸŸข Low ๐ŸŸข Low ๐Ÿ”ด High ๐Ÿ”ด High ๐ŸŸข Low

โš ๏ธ scrapy-playwright passes real browser TLS but does not spoof fingerprint profiles like scrapy-stealth does. scrapy-impersonate provides TLS/HTTP2 impersonation via curl_cffi but lacks built-in rotation, detection, or per-request engine switching. JavaScript rendering is available via the optional browser driver โ€” use it selectively for pages that require a full browser.


โœจ Features

  • ๐Ÿ”Œ Pluggable engine system (scrapy, stealth)
  • ๐Ÿง  Per-request engine selection via request.meta
  • ๐ŸŒ Proxy support and rotation
  • ๐Ÿงฌ Browser fingerprint rotation
  • ๐Ÿ” Smart retry logic
  • ๐Ÿ›ก๏ธ Anti-bot detection (status + content-based, Cloudflare, Akamai)
  • โšก Thread-safe async integration
  • ๐Ÿ–ฅ๏ธ Real-browser engine (CDP) for JS-heavy pages
  • ๐Ÿ”„ Intelligent browser restart โ€” restarts on consecutive bans, not a fixed request count
  • ๐Ÿ“ธ Built-in snapshot decorator (scrapy_stealth.decorators.snapshot)

๐Ÿ“ฆ Installation

pip install scrapy-stealth

Requires Python 3.11+ and Scrapy 2.12โ€“2.x


โš™๏ธ Setup

Option 1 โ€” Global (settings.py)

# 1. Enable the middleware
DOWNLOADER_MIDDLEWARES = {
    "scrapy_stealth.middlewares.StealthDownloaderMiddleware": 950,
}

# 2. (Optional) Route ALL requests through stealth automatically โ€” no meta needed per request
STEALTH_ENABLED = True
STEALTH_DRIVER  = "turbo"   # "basic" (default), "turbo", or "browser"

# 3. (Optional) Proxy list for automatic rotation
#    Used when rotate_proxy=True (per-request) or when STEALTH_ENABLED=True with rotate_proxy
#    Supported schemes: http, https, socks4, socks5
STEALTH_PROXIES = [
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://user:pass@proxy3:8080",  # with authentication
    "socks5://proxy4:1080",
]

Option 2 โ€” Per-spider (custom_settings)

Configure the middleware and all stealth settings directly on the spider โ€” no changes to settings.py required.

class MySpider(scrapy.Spider):
    name = "example"

    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            "scrapy_stealth.middlewares.StealthDownloaderMiddleware": 950,
        },
        "STEALTH_ENABLED": True,
        "STEALTH_DRIVER": "turbo",
        "STEALTH_PROXIES": [
            "http://proxy1:8080",
            "http://user:pass@proxy2:8080",
            "socks5://proxy3:1080",
        ],
    }

Proxies are validated at startup โ€” invalid format or unsupported scheme raises ValueError immediately.


๐Ÿš€ Quick Start

Option A โ€” Per-request (stealth only on specific requests):

yield scrapy.Request(
    url="https://example.com",
    meta={"stealth": {}},
)

Option B โ€” Global mode (stealth on every request automatically):

# settings.py or custom_settings
STEALTH_ENABLED = True
STEALTH_DRIVER  = "turbo"
# No meta needed โ€” all requests go through stealth
yield scrapy.Request(url="https://example.com")

# Opt out for a specific request
yield scrapy.Request(url="https://api.internal/health", meta={"stealth": False})

๐Ÿ”ง Global Configuration

Customise package-wide defaults via the shared config instance. All settings must be applied at module level, before the spider class โ€” the engine client is created at middleware initialisation, so changes inside start_requests or parse will have no effect.

# myspider.py
import scrapy
from scrapy_stealth.config import config

config.DEFAULT_ENGINE  = "stealth"      # "scrapy" (native) or "stealth" (browser impersonation)
config.DEFAULT_PROFILE = "chrome_147"   # browser profile when meta["stealth"]["profile"] is not set
config.DEFAULT_TIMEOUT = 30             # stealth request timeout in seconds
config.STEALTH_DRIVER  = "turbo"        # "basic" (default), "turbo", or "browser"
config.HTTP2           = True           # False for servers that only support HTTP/1.1
config.BLOCK_CODES    |= {407}          # extend blocked status codes (|= keeps defaults)
config.BLOCK_KEYWORDS.append("banned")  # extend blocked body-text patterns
config.BROWSER_HEADLESS = True          # browser driver: headless mode (False = visible window, more stealthy)
config.BROWSER_SETTLE_S = 4.0          # browser driver: seconds to wait after navigation for JS to finish
config.BROWSER_EXECUTABLE_PATH = "/usr/bin/brave-browser"  # custom browser binary (default: auto-detect Chrome)
config.BROWSER_RESTART_AFTER_BANS = 5   # restart Chrome (fresh fingerprint) after 5 consecutive bans


class MySpider(scrapy.Spider):
    name = "example"
    ...
# โŒ wrong โ€” too late, the engine client is already created
class MySpider(scrapy.Spider):
    def start_requests(self):
        config.HTTP2 = False  # has no effect
        ...

You can also read any value programmatically:

config.get("DEFAULT_ENGINE")          # "scrapy"
config.get("MISSING_KEY", "default")  # "default"
Attribute Type Default Description
DEFAULT_ENGINE str "scrapy" Engine used when request.meta["stealth"] key is absent
DEFAULT_PROFILE str "chrome_147" Browser profile used when none is specified
DEFAULT_TIMEOUT int 30 Request timeout in seconds
STEALTH_DRIVER str "basic" Default driver: "basic", "turbo", or "browser". Also readable from Scrapy settings as STEALTH_DRIVER
HTTP2 bool True HTTP/2 mode; overridable per-request via meta["stealth"]["http2"]
BLOCK_CODES frozenset[int] {403, 429, 503} HTTP status codes considered blocked
BLOCK_KEYWORDS list[str] ["captcha", "access denied", โ€ฆ] Body-text patterns considered blocked
BROWSER_HEADLESS bool True Browser driver: headless mode (False = visible window, more stealthy)
BROWSER_SETTLE_S float 4.0 Browser driver: seconds to wait after navigation for JS to finish rendering
BROWSER_NO_SANDBOX bool | None None Browser driver: disable Chrome sandbox. None = auto-detect (enabled when running as root, e.g. Docker)
BROWSER_EXECUTABLE_PATH str | None None Browser driver: path to the browser binary. None = auto-detect Chrome/Chromium. Set to use Brave or a custom install (e.g. "/usr/bin/brave-browser")
BROWSER_MAX_TABS int 10 Browser driver: max concurrent Chrome tabs across in-flight requests
BROWSER_RESTART_AFTER_BANS int 5 Browser driver: restart Chrome (fresh fingerprint/cookies/CDP session) after this many consecutive banned/challenged responses. Any clean response resets the count

For one-off overrides on a single request, set meta["stealth"]["driver"] or meta["stealth"]["http2"] (see Per-Request Configuration below).


โš™๏ธ Per-Request Configuration

All options are passed via request.meta["stealth"].

The presence of meta["stealth"] (a dict) activates the stealth engine. Omit the key to use the default Scrapy engine. When STEALTH_ENABLED = True, all requests are stealth by default โ€” pass meta={"stealth": False} to opt out for a specific request.

yield scrapy.Request(
    url,
    meta={
        "stealth": {
            "driver": "turbo",
            "profile": "chrome_147",
            "proxy": "http://user:pass@proxy:8080",
            "stealth_timeout": 60,
            "http2": True,
            "rotate_proxy": True,
            "rotate_profile": True,
        }
    },
)
Key Type Description
driver str "basic", "turbo", or "browser" โ€” overrides config.STEALTH_DRIVER per-request
profile str Browser profile (e.g. "chrome_147", "safari_ios_18_1_1")
proxy str Explicit proxy URL
stealth_timeout int Per-request timeout in seconds (overrides default 30s)
http2 bool True = HTTP/2, False = HTTP/1.1 (overrides config.HTTP2 for this request)
rotate_proxy bool Auto-pick a proxy from STEALTH_PROXIES
rotate_profile bool Auto-pick a random browser profile
headless bool Browser driver only: True = headless, False = visible window (more stealthy)
settle float Browser driver only: seconds to wait for JS after navigation (default 4.0)
snapshot bool Browser driver only: capture a PNG snapshot โ€” result available as response.meta["snapshot_content"] (bytes)

๐Ÿ–ฅ๏ธ Browser Engine

For sites protected by Cloudflare JS challenges or heavy JavaScript rendering, use the browser driver. It runs a real Chrome instance via the DevTools Protocol (no WebDriver), keeping one persistent browser and opening a new tab per request.

Per-request (most common):

yield scrapy.Request(
    url,
    meta={
        "stealth": {
            "driver": "browser",
            "headless": False,   # visible window โ€” harder to detect (default: True)
            "settle": 4.0,       # seconds to wait for JS after page load
        }
    },
)

Heavy Cloudflare sites โ€” increase settle time:

meta={"stealth": {"driver": "browser", "headless": False, "settle": 12}}

Global default (all stealth requests use browser engine):

from scrapy_stealth.config import config

config.STEALTH_DRIVER   = "browser"
config.BROWSER_HEADLESS = False   # more stealthy
config.BROWSER_SETTLE_S = 6.0    # longer wait for JS

Custom browser binary (Brave, Chromium, or a non-default Chrome install):

from scrapy_stealth.config import config

config.BROWSER_EXECUTABLE_PATH = "/usr/bin/brave-browser"  # Linux
# config.BROWSER_EXECUTABLE_PATH = r"C:\Program Files\BraveSoftware\Brave-Browser\Application\brave.exe"  # Windows

Or via settings.py / custom_settings:

BROWSER_EXECUTABLE_PATH = "/usr/bin/brave-browser"

When BROWSER_EXECUTABLE_PATH is None (the default), scrapy-stealth auto-detects Google Chrome or Chromium from standard system paths. Set it explicitly when using Brave or a non-standard Chrome installation โ€” a clear error is raised if the path does not exist.

Intelligent restart:

The browser engine restarts Chrome intelligently rather than on a fixed schedule โ€” it only restarts (getting a fresh fingerprint, cookies, and CDP session) after BROWSER_RESTART_AFTER_BANS consecutive banned/challenged responses, as classified by the Anti-Bot Detection module (see below). A single clean response resets the streak, so a browser that's sailing through cleanly is left running indefinitely โ€” it's never restarted just because it has served a lot of requests.

from scrapy_stealth.config import config

config.BROWSER_RESTART_AFTER_BANS = 5  # restart after 5 consecutive bans (default)

Docker (running as root):

Chrome requires --no-sandbox when the process runs as root. scrapy-stealth detects this automatically, but you can also set it explicitly in settings.py:

BROWSER_NO_SANDBOX = True           # force no-sandbox (Docker, any root environment)
BROWSER_EXECUTABLE_PATH = "/usr/bin/chromium"  # use Chromium instead of Chrome in Docker

Or via config:

config.BROWSER_NO_SANDBOX = True
config.BROWSER_EXECUTABLE_PATH = "/usr/bin/chromium"

Performance note: the browser engine is slower than basic/turbo (~5-15s per page vs <2s). Use it selectively โ€” route only JS-protected URLs to "browser" and keep everything else on "turbo".


๐Ÿ“ธ Screenshots

Capture a PNG screenshot of any page rendered by the browser driver and save it to disk.

Enable on the request

yield scrapy.Request(
    url,
    meta={
        "stealth": {
            "driver": "browser",
            "snapshot": True,
        }
    },
    callback=self.parse,
)

The raw PNG bytes are available at response.meta["snapshot_content"] inside your callback.

Auto-save with snapshot decorator

from scrapy_stealth.decorators import snapshot

class MySpider(scrapy.Spider):

    @snapshot
    def parse(self, response): ...

    @snapshot(path="stealth_shots/page.png")
    def parse(self, response): ...

    @snapshot(path=lambda r: r.url.split("/")[-1] + ".png")
    def parse(self, response): ...

Note: Requires driver="browser" and snapshot=True in the request meta. Logs an error if no snapshot data is found in the response.

Custom handling (without the built-in helper)

The screenshot is just bytes in response.meta["snapshot_content"] โ€” do anything you like with it:

def parse(self, response):
    shot: bytes | None = response.meta.get("snapshot_content")
    if shot is None:
        return  # screenshot was not requested or capture failed

    # Save manually
    with open("page.png", "wb") as f:
        f.write(shot)

    # Pass to a pipeline via item
    yield {"url": response.url, "screenshot": shot}

๐Ÿ” Automatic Rotation

yield scrapy.Request(
    url,
    meta={
        "stealth": {
            "rotate_proxy": True,
            "rotate_profile": True,
        }
    },
)

๐Ÿงฉ Strategies

Proxy Rotation

from scrapy_stealth.strategies.proxy import ProxyRotator

proxy_rotator = ProxyRotator([
    "http://proxy1:8080",
    "http://proxy2:8080",
])

yield scrapy.Request(
    url,
    meta={
        "stealth": {
            "proxy": proxy_rotator.get(),
        }
    },
)

Fingerprint Rotation

from scrapy_stealth.strategies.fingerprint import ProfileRotator

fp = ProfileRotator()

yield scrapy.Request(
    url,
    meta={
        "stealth": {
            "profile": fp.get(),
        }
    },
)

Intelligent Retry

from scrapy_stealth.strategies.retry import RetryHandler

retry = RetryHandler()


def parse(self, response):
    if retry.should_retry(response):
        yield retry.build(response.request)
        return

๐Ÿ›ก๏ธ Anti-Bot Detection

from scrapy_stealth.detectors.antibot import AntiBotDetector

detector = AntiBotDetector()

if detector.is_blocked(response):
    print("Blocked!")

๐Ÿ“Š Example

import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com",
            meta={
                "stealth": {
                    "rotate_proxy": True,
                    "rotate_profile": True,
                }
            },
        )

    def parse(self, response):
        yield {
            "title": response.css("title::text").get(),
            "url": response.url,
        }

โšก Performance Insight

Using stealth selectively:

  • โšก Faster crawling (Scrapy for simple pages)
  • ๐Ÿ’ฐ Lower proxy cost
  • ๐Ÿ›ก๏ธ Better success rate on protected pages

๐Ÿ“œ Changelog

See CHANGELOG.md for a full history of changes, or browse GitHub Releases.


๐Ÿค Contributing

See CONTRIBUTING.md for guidelines on how to contribute.


๐Ÿ“„ License

This project is licensed under the MIT License โ€” free to use, modify, and distribute. See LICENSE for the full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_stealth-0.6.8b2.tar.gz (180.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_stealth-0.6.8b2-py3-none-any.whl (181.8 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_stealth-0.6.8b2.tar.gz.

File metadata

  • Download URL: scrapy_stealth-0.6.8b2.tar.gz
  • Upload date:
  • Size: 180.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapy_stealth-0.6.8b2.tar.gz
Algorithm Hash digest
SHA256 c293126f2eb35361cd5d2810c43e5fdab75293aad4f9eaa5a2d76950c408fbfd
MD5 37b7a30d3853df8294123fe3cd15292d
BLAKE2b-256 fc0ad908fad270c387cd99f9a9ba7b4ceb486941385911738dd3c485ffbca86b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_stealth-0.6.8b2.tar.gz:

Publisher: publish.yml on fawadss1/scrapy-stealth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapy_stealth-0.6.8b2-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_stealth-0.6.8b2-py3-none-any.whl
Algorithm Hash digest
SHA256 f7e339a9f2bd8125c02378fd254fdea4e7e30386994ef7d18256c916b2dbc2fa
MD5 5305c88280f19a5555002e654ea16090
BLAKE2b-256 f2b9d3aa309be2c07587ea79563dce5121130be5a3303894d57782540a8d742c

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_stealth-0.6.8b2-py3-none-any.whl:

Publisher: publish.yml on fawadss1/scrapy-stealth

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page