Composable Scrapy middlewares: API/session headers, debug, captcha polling/webhook, proxy rotation, smart retry.
Project description
scrapyx-mw
Composable Scrapy middlewares packaged for reuse across projects:
- API / Session headers injection
- Debug logging
- Captcha (polling via 2captcha or CapSolver) or Captcha (webhook + sidecar SQLite store)
- Proxy rotation with health checking
- Smart retry with exponential backoff
This package is compatible with projects that store per-spider config under
SERVICES[SPIDER_NAME_UPPER] (e.g., CAPTCHA_REQUIRED, SITE_KEY, HEADERS),
as used in your compliance_scraper project.
📚 Documentation
- Full Documentation - Complete documentation index
- Deployment Guide - Production deployment instructions
- Migration Guide - Migrate existing projects
Install (with uv)
From a project that consumes this plugin:
uv add -e ./libs/scrapyx-mw
# or publish and: uv add scrapyx-mw
🚀 Features
Core Middlewares
- Session Management: Automatic session handling with configurable headers
- API Request: Inject API-specific headers from spider configuration
- Debug Logging: Comprehensive request/response logging for development
Captcha Solving
- Multi-Provider Support: 2captcha and CapSolver providers
- Polling Mode: Traditional polling-based captcha solving
- Webhook Mode: Webhook-based captcha solving with fallback to polling
- Configurable Timeouts: Customizable polling intervals and timeouts
Production Hardening
- Telemetry: Track captcha solve attempts, success rates, and costs
- Guardrails: Rate limiting, budget controls, and circuit breaker
- Log Redaction: Automatically redact sensitive information from logs
- Configuration Validation: Fail-fast validation of captcha settings
Advanced Features
- Proxy Rotation: Intelligent proxy rotation with health checking
- Smart Retry: Exponential backoff with jitter and circuit breaker
- Scrapy Add-on: One-switch enablement of all features
Enable in your Scrapy project
In your project settings.py:
from scrapyx_mw import default_config, apply_downloader_middlewares, apply_spider_middlewares
# Toggle features here
SCRAPYX = default_config(
api_request=True,
session=True,
debug=False,
captcha="none", # "none" | "polling" | "webhook"
captcha_enabled=False, # also controlled by env var CAPTCHA_ENABLED
captcha_api_key="", # required if captcha_enabled
captcha_webhook_url="http://127.0.0.1:6801/webhook",
session_headers={"Accept": "application/json"},
)
DOWNLOADER_MIDDLEWARES.update(apply_downloader_middlewares(globals(), SCRAPYX))
SPIDER_MIDDLEWARES.update(apply_spider_middlewares(globals(), SCRAPYX))
# Per-spider service configs, same shape as in your compliance_scraper:
SERVICES = {
"EXAMPLE_SERVICE": {
"CAPTCHA_REQUIRED": True,
"SITE_KEY": "6Lxxxxxx...your-site-key...",
"HEADERS": {"Accept": "text/html,application/xhtml+xml"}
}
}
Note: Your spiders can keep using
spider.service_config,spider.captcha_needed, andspider.site_key(as in yourBaseComplianceSpider); this plugin is designed to read those fields.
Captcha options
- Polling (
captcha="polling"): middleware submits & polls 2captcha or CapSolver. - Webhook (
captcha="webhook"): middleware submits with acallbackUrland waits for the sidecar webhook receiver to store codes in SQLite for pickup.
Supported Providers
- 2captcha: Traditional polling-based captcha solving
- CapSolver: Modern API-based captcha solving with JSON endpoints
Provider Configuration
# For 2captcha
CAPTCHA_PROVIDER = "2captcha"
CAPTCHA_API_KEY = "your-2captcha-key"
CAPTCHA_2CAPTCHA_BASE = "https://2captcha.com" # optional
CAPTCHA_2CAPTCHA_METHOD = "userrecaptcha" # optional
# For CapSolver
CAPTCHA_PROVIDER = "capsolver"
CAPTCHA_API_KEY = "your-capsolver-key"
CAPTCHA_CAPSOLVER_BASE = "https://api.capsolver.com" # optional
CAPTCHA_CAPSOLVER_TASK_TYPE = "ReCaptchaV2TaskProxyLess" # optional
Webhook sidecar
Run the included receiver next to Scrapyd (bind to 127.0.0.1 by default):
uv run python -m scrapyx_mw.scrapyd.webhook_service
# Health: GET http://127.0.0.1:6801/health
# Webhook: POST from 2captcha to /webhook (id, code)
DB file path:
/var/lib/scrapyd/webhook_solutions.db(same as your original project for drop-in compatibility).Note: CapSolver webhook support is limited - it falls back to polling behavior in webhook mode.
Environment variables
CAPTCHA_ENABLED=false
CAPTCHA_API_KEY=
CAPTCHA_PROVIDER=2captcha
CAPTCHA_WEBHOOK_URL=http://127.0.0.1:6801/webhook
CAPTCHA_TOKEN_TTL_SECONDS=110
CAPTCHA_POLL_INITIAL_S=4.0
CAPTCHA_POLL_MAX_S=45.0
CAPTCHA_POLL_MAX_TIME_S=180.0
CAPTCHA_HTTP_TIMEOUT_S=15.0
CAPTCHA_HTTP_RETRIES=2
# 2captcha specific
CAPTCHA_2CAPTCHA_BASE=https://2captcha.com
CAPTCHA_2CAPTCHA_METHOD=userrecaptcha
# CapSolver specific
CAPTCHA_CAPSOLVER_BASE=https://api.capsolver.com
CAPTCHA_CAPSOLVER_TASK_TYPE=ReCaptchaV2TaskProxyLess
Enable via Scrapy Add-on (recommended)
Instead of editing middleware dicts manually, turn on the package with one line:
# settings.py
ADDONS = {
"scrapyx_mw.addon.ScrapyxAddon": 0,
}
Then tweak behavior using flags:
| Setting | Type | Default | Notes |
|---|---|---|---|
SCRAPYX_SESSION_ENABLED |
bool | True |
Enable default session headers middleware |
SCRAPYX_API_REQUEST_ENABLED |
bool | True |
Enable API header injector middleware |
SCRAPYX_DEBUG_ENABLED |
bool | False |
Log outgoing requests at DEBUG |
SCRAPYX_CAPTCHA_MODE |
str | "none" |
"none" | "polling" | "webhook" |
SCRAPYX_CAPTCHA_ENABLED |
bool | False |
Master switch; CAPTCHA_ENABLED is also honored |
CAPTCHA_API_KEY |
str | "" |
Required if captcha is enabled |
CAPTCHA_WEBHOOK_URL |
str | http://127.0.0.1:6801/webhook |
Webhook receiver URL |
CAPTCHA_* knobs |
mixed | see defaults | TTL, polling delays, HTTP timeouts/retries |
SESSION_HEADERS |
dict | {} |
Global default headers (overridden by per-spider service_config["HEADERS"]) |
The Add-on composes DOWNLOADER_MIDDLEWARES / SPIDER_MIDDLEWARES with "addon" priority and avoids overwriting user-set values.
Compatibility notes
- Reads per-spider config from
SERVICES[SPIDER_NAME_UPPER](same as yourcompliance_scraper). - Honors
spider.captcha_needed,spider.site_key, and injectsrequest.meta["recaptcha_solution"]. - Webhook DB path remains
/var/lib/scrapyd/webhook_solutions.dbfor drop-in compatibility.
Prefer Add-on mode for fleet-wide consistency. The original
presets.pyhelpers still work if you want to compose stacks manually.
Notes on compatibility
- Reads
SERVICES[SPIDER_NAME_UPPER]forCAPTCHA_REQUIRED,SITE_KEY,HEADERS. - Honors
spider.captcha_neededandspider.site_key, and setsrequest.meta["recaptcha_solution"]. - Webhook SQLite schema and path matches the original (
/var/lib/scrapyd/webhook_solutions.db). - Session/API middlewares merge with
service_config["HEADERS"](exactly like your project).
How to use the new Add-on
In your Scrapy project (that consumes libs/scrapyx-mw):
# settings.py
ADDONS = {
"scrapyx_mw.addon.ScrapyxAddon": 0,
}
# Optional toggles
SCRAPYX_SESSION_ENABLED = True
SCRAPYX_API_REQUEST_ENABLED = True
SCRAPYX_DEBUG_ENABLED = False
# Captcha
SCRAPYX_CAPTCHA_MODE = "polling" # or "webhook" or "none"
SCRAPYX_CAPTCHA_ENABLED = True # or set CAPTCHA_ENABLED=True
CAPTCHA_API_KEY = "your-2captcha-key"
# CAPTCHA_WEBHOOK_URL = "http://127.0.0.1:6801/webhook"
# (keep your existing SERVICES mapping)
SERVICES = {
"EXAMPLE_SERVICE": {
"CAPTCHA_REQUIRED": True,
"SITE_KEY": "6Lxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"HEADERS": {"Accept": "text/html,application/xhtml+xml"}
}
}
If you choose webhook mode, run the sidecar on the Scrapyd host:
uv run python -m scrapyx_mw.scrapyd.webhook_service
# Health: curl http://127.0.0.1:6801/health
ConfigValidator Extension
This extension validates CAPTCHA-related configuration on startup and aborts misconfigured crawls early.
Enabled automatically by the Add-on. To use independently:
EXTENSIONS = {
"scrapyx_mw.extensions.config_validator.ConfigValidator": 10,
}
Checks performed
- If
CAPTCHA_ENABLED=TruebutCAPTCHA_API_KEYis missing → fail. - For each spider where
SERVICES[SPIDER_NAME_UPPER]["CAPTCHA_REQUIRED"]=True:- Ensure
SITE_KEYis set.
- Ensure
Useful in CI and Scrapyd to prevent broken deployments.
🔄 Proxy Rotation Middleware
The proxy rotation middleware provides intelligent proxy management:
Configuration
# Enable proxy rotation
SCRAPYX_PROXY_ROTATION_ENABLED = True
# Proxy sources
SCRAPYX_PROXY_LIST = [
"http://proxy1:8080",
"http://proxy2:8080",
"socks5://proxy3:1080"
]
# Or from environment variable
SCRAPYX_PROXY_ENV_VAR = "SCRAPYX_PROXY_LIST" # Default
# export SCRAPYX_PROXY_LIST="http://proxy1:8080,http://proxy2:8080"
# Or from file
SCRAPYX_PROXY_FILE = "proxies.txt"
# Rotation strategy
SCRAPYX_PROXY_ROTATION_STRATEGY = "round_robin" # "round_robin" | "random" | "weighted"
# Health checking
SCRAPYX_PROXY_HEALTH_CHECK = True
SCRAPYX_PROXY_HEALTH_CHECK_INTERVAL = 300 # seconds
SCRAPYX_PROXY_MAX_FAILURES = 3
# Session persistence
SCRAPYX_PROXY_SESSION_PERSISTENCE = True
Features
- Multiple Sources: Load proxies from settings, environment variables, or files
- Health Checking: Automatically remove failed proxies
- Load Balancing: Round-robin, random, or weighted selection
- Session Persistence: Use same proxy for requests with same session_id
- Performance Tracking: Monitor proxy success rates and response times
🔁 Smart Retry Middleware
The smart retry middleware provides intelligent retry logic with exponential backoff:
Configuration
# Enable smart retry
SCRAPYX_SMART_RETRY_ENABLED = True
# Retry configuration
SCRAPYX_RETRY_MAX_TIMES = 3
SCRAPYX_RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# Backoff configuration
SCRAPYX_RETRY_BASE_BACKOFF = 1.0 # seconds
SCRAPYX_RETRY_MAX_BACKOFF = 60.0 # seconds
SCRAPYX_RETRY_BACKOFF_MULTIPLIER = 2.0
SCRAPYX_RETRY_JITTER_ENABLED = True
SCRAPYX_RETRY_JITTER_RANGE = 0.1
# Circuit breaker
SCRAPYX_RETRY_CIRCUIT_BREAKER_ENABLED = True
SCRAPYX_RETRY_CIRCUIT_BREAKER_THRESHOLD = 5 # failures before circuit opens
SCRAPYX_RETRY_CIRCUIT_BREAKER_TIMEOUT = 60 # seconds before circuit resets
# Priority handling
SCRAPYX_RETRY_PRIORITY_ENABLED = True
SCRAPYX_RETRY_PRIORITY_MULTIPLIER = 1.5 # reduce delay for high-priority requests
Features
- Exponential Backoff: Increasing delays between retries
- Jitter: Random variation to prevent thundering herd
- Circuit Breaker: Stop retrying failed domains temporarily
- Priority Support: Faster retries for high-priority requests
- Statistics: Track retry success rates and delays per domain
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapyx_mw-0.1.0.tar.gz.
File metadata
- Download URL: scrapyx_mw-0.1.0.tar.gz
- Upload date:
- Size: 33.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4de6c7f828d1a2b530da4b2642340c79047e5a169749ee4600f0d4e7bc377829
|
|
| MD5 |
c27ddcef087bba9e2927b95ac484206d
|
|
| BLAKE2b-256 |
47f6cfc8cfb9a399c8d571ca0edc417fcd4af5d9e2261e683b9b71866b6a2173
|
Provenance
The following attestation bundles were made for scrapyx_mw-0.1.0.tar.gz:
Publisher:
release.yml on ArmanAvanesyan/scrapyx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapyx_mw-0.1.0.tar.gz -
Subject digest:
4de6c7f828d1a2b530da4b2642340c79047e5a169749ee4600f0d4e7bc377829 - Sigstore transparency entry: 642593976
- Sigstore integration time:
-
Permalink:
ArmanAvanesyan/scrapyx@0944ceb78fa113f962e5c6c1c2c8a8616d08084b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ArmanAvanesyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0944ceb78fa113f962e5c6c1c2c8a8616d08084b -
Trigger Event:
push
-
Statement type:
File details
Details for the file scrapyx_mw-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scrapyx_mw-0.1.0-py3-none-any.whl
- Upload date:
- Size: 35.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01e5b9790a3f1f11d363d2631117938f9152a673940b338da7dc1f504c5672d7
|
|
| MD5 |
4fa9828325ea14ea87d8e6442ebd1bab
|
|
| BLAKE2b-256 |
9e31f02c298ee068d668b244ddf03a1d85cc40431413c57e9c986fc2a79026e0
|
Provenance
The following attestation bundles were made for scrapyx_mw-0.1.0-py3-none-any.whl:
Publisher:
release.yml on ArmanAvanesyan/scrapyx
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapyx_mw-0.1.0-py3-none-any.whl -
Subject digest:
01e5b9790a3f1f11d363d2631117938f9152a673940b338da7dc1f504c5672d7 - Sigstore transparency entry: 642594060
- Sigstore integration time:
-
Permalink:
ArmanAvanesyan/scrapyx@0944ceb78fa113f962e5c6c1c2c8a8616d08084b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ArmanAvanesyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0944ceb78fa113f962e5c6c1c2c8a8616d08084b -
Trigger Event:
push
-
Statement type: