Stealthy Crawling. Maximum Results. A pluggable anti-bot and stealth framework for Scrapy.
Project description
scrapy-stealth
Stealthy Crawling. Maximum Results.
A pluggable anti-bot and stealth framework for Scrapy.
scrapy-stealth extends Scrapy with browser impersonation, proxy rotation, fingerprint cycling, and intelligent retry strategies โ
designed for large-scale, production-grade crawling.
๐ง Why scrapy-stealth?
Scrapy is fast and powerful, but modern websites use advanced anti-bot protections such as:
- TLS fingerprinting
- Browser behavior detection
- Rate limiting and IP blocking
scrapy-stealth helps by adding:
- ๐งฌ Browser-level impersonation (TLS + HTTP/2 fingerprints)
- ๐ Smarter retry strategies
- ๐ Proxy and fingerprint rotation
- ๐ก๏ธ Anti-bot detection
Result
- Higher success rate
- Lower proxy cost
- More stable crawls
๐ Comparison
| Feature | scrapy-stealth | scrapy-impersonate | scrapy-playwright | scrapy-splash | Scrapy (default) |
|---|---|---|---|---|---|
| TLS fingerprint spoofing | โ | โ | โ | โ | โ |
| HTTP/2 support | โ | โ | โ | โ | โ |
| Browser impersonation | โ | โ | โ ๏ธ partial | โ | โ |
| Proxy rotation (built-in) | โ | โ | โ | โ | โ |
| Fingerprint rotation | โ | โ | โ | โ | โ |
| Anti-bot detection | โ | โ | โ | โ | โ |
| Smart retry logic | โ | โ | โ | โ | โ |
| Per-request engine switching | โ | โ | โ | โ | โ |
| Headless browser required | โ | โ | โ | โ | โ |
| JavaScript rendering | ๏ธโ | โ | โ | โ | โ |
| Screenshot / snapshot | โ | โ | โ | โ | โ |
| Native Scrapy integration | โ | โ | โ | โ | โ |
| Memory footprint | ๐ข Low | ๐ข Low | ๐ด High | ๐ด High | ๐ข Low |
โ ๏ธ
scrapy-playwrightpasses real browser TLS but does not spoof fingerprint profiles likescrapy-stealthdoes.scrapy-impersonateprovides TLS/HTTP2 impersonation viacurl_cffibut lacks built-in rotation, detection, or per-request engine switching. JavaScript rendering is available via the optionalbrowserdriver โ use it selectively for pages that require a full browser.
โจ Features
- ๐ Pluggable engine system (
scrapy,stealth) - ๐ง Per-request engine selection via
request.meta - ๐ Proxy support and rotation
- ๐งฌ Browser fingerprint rotation
- ๐ Smart retry logic
- ๐ก๏ธ Anti-bot detection (status + content-based, Cloudflare, Akamai)
- โก Thread-safe async integration
- ๐ฅ๏ธ Real-browser engine (CDP) for JS-heavy pages
- ๐ธ Built-in snapshot decorator (
scrapy_stealth.decorators.snapshot)
๐ฆ Installation
pip install scrapy-stealth
Requires Python 3.11+ and Scrapy 2.12โ2.x
โ๏ธ Setup
Option 1 โ Global (settings.py)
# 1. Enable the middleware
DOWNLOADER_MIDDLEWARES = {
"scrapy_stealth.StealthDownloaderMiddleware": 950,
}
# 2. (Optional) Route ALL requests through stealth automatically โ no meta needed per request
STEALTH_ENABLED = True
STEALTH_DRIVER = "turbo" # "basic" (default), "turbo", or "browser"
# 3. (Optional) Proxy list for automatic rotation
# Used when rotate_proxy=True (per-request) or when STEALTH_ENABLED=True with rotate_proxy
# Supported schemes: http, https, socks4, socks5
STEALTH_PROXIES = [
"http://proxy1:8080",
"http://proxy2:8080",
"http://user:pass@proxy3:8080", # with authentication
"socks5://proxy4:1080",
]
Option 2 โ Per-spider (custom_settings)
Configure the middleware and all stealth settings directly on the spider โ no changes to settings.py required.
class MySpider(scrapy.Spider):
name = "example"
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"scrapy_stealth.StealthDownloaderMiddleware": 950,
},
"STEALTH_ENABLED": True,
"STEALTH_DRIVER": "turbo",
"STEALTH_PROXIES": [
"http://proxy1:8080",
"http://user:pass@proxy2:8080",
"socks5://proxy3:1080",
],
}
Proxies are validated at startup โ invalid format or unsupported scheme raises
ValueErrorimmediately.
๐ Quick Start
Option A โ Per-request (stealth only on specific requests):
yield scrapy.Request(
url="https://example.com",
meta={"stealth": {}},
)
Option B โ Global mode (stealth on every request automatically):
# settings.py or custom_settings
STEALTH_ENABLED = True
STEALTH_DRIVER = "turbo"
# No meta needed โ all requests go through stealth
yield scrapy.Request(url="https://example.com")
# Opt out for a specific request
yield scrapy.Request(url="https://api.internal/health", meta={"stealth": False})
๐ง Global Configuration
Customise package-wide defaults via the shared config instance.
All settings must be applied at module level, before the spider class โ the engine client is
created at middleware initialisation, so changes inside start_requests or parse will have no effect.
# myspider.py
import scrapy
from scrapy_stealth.config import config
config.DEFAULT_ENGINE = "stealth" # "scrapy" (native) or "stealth" (browser impersonation)
config.DEFAULT_PROFILE = "chrome_147" # browser profile when meta["stealth"]["profile"] is not set
config.DEFAULT_TIMEOUT = 30 # stealth request timeout in seconds
config.STEALTH_DRIVER = "turbo" # "basic" (default), "turbo", or "browser"
config.HTTP2 = True # False for servers that only support HTTP/1.1
config.BLOCK_CODES |= {407} # extend blocked status codes (|= keeps defaults)
config.BLOCK_KEYWORDS.append("banned") # extend blocked body-text patterns
config.BROWSER_HEADLESS = True # browser driver: headless mode (False = visible window, more stealthy)
config.BROWSER_SETTLE_S = 4.0 # browser driver: seconds to wait after navigation for JS to finish
config.BROWSER_EXECUTABLE_PATH = "/usr/bin/brave-browser" # custom browser binary (default: auto-detect Chrome)
class MySpider(scrapy.Spider):
name = "example"
...
# โ wrong โ too late, the engine client is already created
class MySpider(scrapy.Spider):
def start_requests(self):
config.HTTP2 = False # has no effect
...
You can also read any value programmatically:
config.get("DEFAULT_ENGINE") # "scrapy"
config.get("MISSING_KEY", "default") # "default"
| Attribute | Type | Default | Description |
|---|---|---|---|
DEFAULT_ENGINE |
str |
"scrapy" |
Engine used when request.meta["stealth"] key is absent |
DEFAULT_PROFILE |
str |
"chrome_147" |
Browser profile used when none is specified |
DEFAULT_TIMEOUT |
int |
30 |
Request timeout in seconds |
STEALTH_DRIVER |
str |
"basic" |
Default driver: "basic", "turbo", or "browser". Also readable from Scrapy settings as STEALTH_DRIVER |
HTTP2 |
bool |
True |
HTTP/2 mode; overridable per-request via meta["stealth"]["http2"] |
BLOCK_CODES |
frozenset[int] |
{403, 429, 503} |
HTTP status codes considered blocked |
BLOCK_KEYWORDS |
list[str] |
["captcha", "access denied", โฆ] |
Body-text patterns considered blocked |
BROWSER_HEADLESS |
bool |
True |
Browser driver: headless mode (False = visible window, more stealthy) |
BROWSER_SETTLE_S |
float |
4.0 |
Browser driver: seconds to wait after navigation for JS to finish rendering |
BROWSER_NO_SANDBOX |
bool | None |
None |
Browser driver: disable Chrome sandbox. None = auto-detect (enabled when running as root, e.g. Docker) |
BROWSER_EXECUTABLE_PATH |
str | None |
None |
Browser driver: path to the browser binary. None = auto-detect Chrome/Chromium. Set to use Brave or a custom install (e.g. "/usr/bin/brave-browser") |
For one-off overrides on a single request, set meta["stealth"]["driver"] or meta["stealth"]["http2"] (see Per-Request Configuration below).
โ๏ธ Per-Request Configuration
All options are passed via request.meta["stealth"].
The presence of meta["stealth"] (a dict) activates the stealth engine. Omit the key to use the default Scrapy engine.
When STEALTH_ENABLED = True, all requests are stealth by default โ pass meta={"stealth": False} to opt out for a specific request.
yield scrapy.Request(
url,
meta={
"stealth": {
"driver": "turbo",
"profile": "chrome_147",
"proxy": "http://user:pass@proxy:8080",
"stealth_timeout": 60,
"http2": True,
"rotate_proxy": True,
"rotate_profile": True,
}
},
)
| Key | Type | Description |
|---|---|---|
driver |
str |
"basic", "turbo", or "browser" โ overrides config.STEALTH_DRIVER per-request |
profile |
str |
Browser profile (e.g. "chrome_147", "safari_ios_18_1_1") |
proxy |
str |
Explicit proxy URL |
stealth_timeout |
int |
Per-request timeout in seconds (overrides default 30s) |
http2 |
bool |
True = HTTP/2, False = HTTP/1.1 (overrides config.HTTP2 for this request) |
rotate_proxy |
bool |
Auto-pick a proxy from STEALTH_PROXIES |
rotate_profile |
bool |
Auto-pick a random browser profile |
headless |
bool |
Browser driver only: True = headless, False = visible window (more stealthy) |
settle |
float |
Browser driver only: seconds to wait for JS after navigation (default 4.0) |
snapshot |
bool |
Browser driver only: capture a PNG snapshot โ result available as response.meta["snapshot_content"] (bytes) |
๐ฅ๏ธ Browser Engine
For sites protected by Cloudflare JS challenges or heavy JavaScript rendering, use the browser driver.
It runs a real Chrome instance via the DevTools Protocol (no WebDriver), keeping one persistent browser
and opening a new tab per request.
Per-request (most common):
yield scrapy.Request(
url,
meta={
"stealth": {
"driver": "browser",
"headless": False, # visible window โ harder to detect (default: True)
"settle": 4.0, # seconds to wait for JS after page load
}
},
)
Heavy Cloudflare sites โ increase settle time:
meta={"stealth": {"driver": "browser", "headless": False, "settle": 12}}
Global default (all stealth requests use browser engine):
from scrapy_stealth.config import config
config.STEALTH_DRIVER = "browser"
config.BROWSER_HEADLESS = False # more stealthy
config.BROWSER_SETTLE_S = 6.0 # longer wait for JS
Custom browser binary (Brave, Chromium, or a non-default Chrome install):
from scrapy_stealth.config import config
config.BROWSER_EXECUTABLE_PATH = "/usr/bin/brave-browser" # Linux
# config.BROWSER_EXECUTABLE_PATH = r"C:\Program Files\BraveSoftware\Brave-Browser\Application\brave.exe" # Windows
Or via settings.py / custom_settings:
BROWSER_EXECUTABLE_PATH = "/usr/bin/brave-browser"
When
BROWSER_EXECUTABLE_PATHisNone(the default),scrapy-stealthauto-detects Google Chrome or Chromium from standard system paths. Set it explicitly when using Brave or a non-standard Chrome installation โ a clear error is raised if the path does not exist.
Docker (running as root):
Chrome requires --no-sandbox when the process runs as root. scrapy-stealth detects this automatically,
but you can also set it explicitly in settings.py:
BROWSER_NO_SANDBOX = True # force no-sandbox (Docker, any root environment)
BROWSER_EXECUTABLE_PATH = "/usr/bin/chromium" # use Chromium instead of Chrome in Docker
Or via config:
config.BROWSER_NO_SANDBOX = True
config.BROWSER_EXECUTABLE_PATH = "/usr/bin/chromium"
Performance note: the browser engine is slower than
basic/turbo(~5-15s per page vs <2s). Use it selectively โ route only JS-protected URLs to"browser"and keep everything else on"turbo".
๐ธ Screenshots
Capture a PNG screenshot of any page rendered by the browser driver and save it to disk.
Enable on the request
yield scrapy.Request(
url,
meta={
"stealth": {
"driver": "browser",
"snapshot": True,
}
},
callback=self.parse,
)
The raw PNG bytes are available at response.meta["snapshot_content"] inside your callback.
Auto-save with snapshot decorator
from scrapy_stealth.decorators import snapshot
class MySpider(scrapy.Spider):
@snapshot
def parse(self, response): ...
@snapshot(path="stealth_shots/page.png")
def parse(self, response): ...
@snapshot(path=lambda r: r.url.split("/")[-1] + ".png")
def parse(self, response): ...
Note: Requires
driver="browser"andsnapshot=Truein the request meta. Logs an error if no snapshot data is found in the response.
Custom handling (without the built-in helper)
The screenshot is just bytes in response.meta["snapshot_content"] โ do anything you like with it:
def parse(self, response):
shot: bytes | None = response.meta.get("snapshot_content")
if shot is None:
return # screenshot was not requested or capture failed
# Save manually
with open("page.png", "wb") as f:
f.write(shot)
# Pass to a pipeline via item
yield {"url": response.url, "screenshot": shot}
๐ Automatic Rotation
yield scrapy.Request(
url,
meta={
"stealth": {
"rotate_proxy": True,
"rotate_profile": True,
}
},
)
๐งฉ Strategies
Proxy Rotation
from scrapy_stealth.strategies.proxy import ProxyRotator
proxy_rotator = ProxyRotator([
"http://proxy1:8080",
"http://proxy2:8080",
])
yield scrapy.Request(
url,
meta={
"stealth": {
"proxy": proxy_rotator.get(),
}
},
)
Fingerprint Rotation
from scrapy_stealth.strategies.fingerprint import ProfileRotator
fp = ProfileRotator()
yield scrapy.Request(
url,
meta={
"stealth": {
"profile": fp.get(),
}
},
)
Intelligent Retry
from scrapy_stealth.strategies.retry import RetryHandler
retry = RetryHandler()
def parse(self, response):
if retry.should_retry(response):
yield retry.build(response.request)
return
๐ก๏ธ Anti-Bot Detection
from scrapy_stealth.detectors.antibot import AntiBotDetector
detector = AntiBotDetector()
if detector.is_blocked(response):
print("Blocked!")
๐ Example
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
def start_requests(self):
yield scrapy.Request(
"https://example.com",
meta={
"stealth": {
"rotate_proxy": True,
"rotate_profile": True,
}
},
)
def parse(self, response):
yield {
"title": response.css("title::text").get(),
"url": response.url,
}
โก Performance Insight
Using stealth selectively:
- โก Faster crawling (Scrapy for simple pages)
- ๐ฐ Lower proxy cost
- ๐ก๏ธ Better success rate on protected pages
๐ Changelog
See CHANGELOG.md for a full history of changes, or browse GitHub Releases.
๐ค Contributing
See CONTRIBUTING.md for guidelines on how to contribute.
๐ License
This project is licensed under the MIT License โ free to use, modify, and distribute. See LICENSE for the full text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_stealth-0.6.8a1.tar.gz.
File metadata
- Download URL: scrapy_stealth-0.6.8a1.tar.gz
- Upload date:
- Size: 177.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32c45d3a39b1a4acab8652eae54e0b285cb0c8ecf739fcb50eb57237cc7bdcd7
|
|
| MD5 |
d3caed90f99c1b31cdcffd9f2dfcfd7b
|
|
| BLAKE2b-256 |
9266ebb9b854a56285551b901552da655e6dc096fb1f26a7525a838d68974df7
|
Provenance
The following attestation bundles were made for scrapy_stealth-0.6.8a1.tar.gz:
Publisher:
publish.yml on fawadss1/scrapy-stealth
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapy_stealth-0.6.8a1.tar.gz -
Subject digest:
32c45d3a39b1a4acab8652eae54e0b285cb0c8ecf739fcb50eb57237cc7bdcd7 - Sigstore transparency entry: 1801307944
- Sigstore integration time:
-
Permalink:
fawadss1/scrapy-stealth@c45fe02de9da2a3ac49ca0e2f62a3397d4010f97 -
Branch / Tag:
refs/tags/v0.6.8a1 - Owner: https://github.com/fawadss1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c45fe02de9da2a3ac49ca0e2f62a3397d4010f97 -
Trigger Event:
release
-
Statement type:
File details
Details for the file scrapy_stealth-0.6.8a1-py3-none-any.whl.
File metadata
- Download URL: scrapy_stealth-0.6.8a1-py3-none-any.whl
- Upload date:
- Size: 179.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7161ee17b5fd6efe91a1ad196fdfc31b5c39e2f2f5d86fc896875c5022c2847
|
|
| MD5 |
828d612c472fd4919a4e1efd10cd0d1c
|
|
| BLAKE2b-256 |
d8961e2ac7f7e0942124ea06c32b8593d77a20ee2989fca5f69a1bbc10b260b2
|
Provenance
The following attestation bundles were made for scrapy_stealth-0.6.8a1-py3-none-any.whl:
Publisher:
publish.yml on fawadss1/scrapy-stealth
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapy_stealth-0.6.8a1-py3-none-any.whl -
Subject digest:
b7161ee17b5fd6efe91a1ad196fdfc31b5c39e2f2f5d86fc896875c5022c2847 - Sigstore transparency entry: 1801308034
- Sigstore integration time:
-
Permalink:
fawadss1/scrapy-stealth@c45fe02de9da2a3ac49ca0e2f62a3397d4010f97 -
Branch / Tag:
refs/tags/v0.6.8a1 - Owner: https://github.com/fawadss1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c45fe02de9da2a3ac49ca0e2f62a3397d4010f97 -
Trigger Event:
release
-
Statement type: