Composable Scrapy middlewares: API/session headers, debug, captcha polling/webhook, proxy rotation, smart retry.

Project description

scrapyx-mw

Composable Scrapy middlewares packaged for reuse across projects:

API / Session headers injection
Debug logging
Captcha (polling via 2captcha or CapSolver) or Captcha (webhook + sidecar SQLite store)
Proxy rotation with health checking
Smart retry with exponential backoff

This package is compatible with projects that store per-spider config under SERVICES[SPIDER_NAME_UPPER] (e.g., CAPTCHA_REQUIRED, SITE_KEY, HEADERS), as used in your compliance_scraper project.

📚 Documentation

Full Documentation - Complete documentation index
Deployment Guide - Production deployment instructions
Migration Guide - Migrate existing projects

Install (with uv)

From a project that consumes this plugin:

uv add -e ./libs/scrapyx-mw
# or publish and: uv add scrapyx-mw

🚀 Features

Core Middlewares

Session Management: Automatic session handling with configurable headers
API Request: Inject API-specific headers from spider configuration
Debug Logging: Comprehensive request/response logging for development

Captcha Solving

Multi-Provider Support: 2captcha and CapSolver providers
Polling Mode: Traditional polling-based captcha solving
Webhook Mode: Webhook-based captcha solving with fallback to polling
Configurable Timeouts: Customizable polling intervals and timeouts

Production Hardening

Telemetry: Track captcha solve attempts, success rates, and costs
Guardrails: Rate limiting, budget controls, and circuit breaker
Log Redaction: Automatically redact sensitive information from logs
Configuration Validation: Fail-fast validation of captcha settings

Advanced Features

Proxy Rotation: Intelligent proxy rotation with health checking
Smart Retry: Exponential backoff with jitter and circuit breaker
Scrapy Add-on: One-switch enablement of all features

Enable in your Scrapy project

In your project settings.py:

from scrapyx_mw import default_config, apply_downloader_middlewares, apply_spider_middlewares

# Toggle features here
SCRAPYX = default_config(
    api_request=True,
    session=True,
    debug=False,
    captcha="none",         # "none" | "polling" | "webhook"
    captcha_enabled=False,  # also controlled by env var CAPTCHA_ENABLED
    captcha_api_key="",     # required if captcha_enabled
    captcha_webhook_url="http://127.0.0.1:6801/webhook",
    session_headers={"Accept": "application/json"},
)

DOWNLOADER_MIDDLEWARES.update(apply_downloader_middlewares(globals(), SCRAPYX))
SPIDER_MIDDLEWARES.update(apply_spider_middlewares(globals(), SCRAPYX))

# Per-spider service configs, same shape as in your compliance_scraper:
SERVICES = {
    "EXAMPLE_SERVICE": {
        "CAPTCHA_REQUIRED": True,
        "SITE_KEY": "6Lxxxxxx...your-site-key...",
        "HEADERS": {"Accept": "text/html,application/xhtml+xml"}
    }
}

Note: Your spiders can keep using spider.service_config, spider.captcha_needed, and spider.site_key (as in your BaseComplianceSpider); this plugin is designed to read those fields.

Captcha options

Polling (captcha="polling"): middleware submits & polls 2captcha or CapSolver.
Webhook (captcha="webhook"): middleware submits with a callbackUrl and waits for the sidecar webhook receiver to store codes in SQLite for pickup.

Supported Providers

2captcha: Traditional polling-based captcha solving
CapSolver: Modern API-based captcha solving with JSON endpoints

Provider Configuration

# For 2captcha
CAPTCHA_PROVIDER = "2captcha"
CAPTCHA_API_KEY = "your-2captcha-key"
CAPTCHA_2CAPTCHA_BASE = "https://2captcha.com"  # optional
CAPTCHA_2CAPTCHA_METHOD = "userrecaptcha"       # optional

# For CapSolver
CAPTCHA_PROVIDER = "capsolver"
CAPTCHA_API_KEY = "your-capsolver-key"
CAPTCHA_CAPSOLVER_BASE = "https://api.capsolver.com"           # optional
CAPTCHA_CAPSOLVER_TASK_TYPE = "ReCaptchaV2TaskProxyLess"       # optional

Webhook sidecar

Run the included receiver next to Scrapyd (bind to 127.0.0.1 by default):

uv run python -m scrapyx_mw.scrapyd.webhook_service
# Health: GET http://127.0.0.1:6801/health
# Webhook: POST from 2captcha to /webhook (id, code)

DB file path: /var/lib/scrapyd/webhook_solutions.db (same as your original project for drop-in compatibility).

Note: CapSolver webhook support is limited - it falls back to polling behavior in webhook mode.

Environment variables

CAPTCHA_ENABLED=false
CAPTCHA_API_KEY=
CAPTCHA_PROVIDER=2captcha
CAPTCHA_WEBHOOK_URL=http://127.0.0.1:6801/webhook
CAPTCHA_TOKEN_TTL_SECONDS=110
CAPTCHA_POLL_INITIAL_S=4.0
CAPTCHA_POLL_MAX_S=45.0
CAPTCHA_POLL_MAX_TIME_S=180.0
CAPTCHA_HTTP_TIMEOUT_S=15.0
CAPTCHA_HTTP_RETRIES=2

# 2captcha specific
CAPTCHA_2CAPTCHA_BASE=https://2captcha.com
CAPTCHA_2CAPTCHA_METHOD=userrecaptcha

# CapSolver specific
CAPTCHA_CAPSOLVER_BASE=https://api.capsolver.com
CAPTCHA_CAPSOLVER_TASK_TYPE=ReCaptchaV2TaskProxyLess

Enable via Scrapy Add-on (recommended)

Instead of editing middleware dicts manually, turn on the package with one line:

# settings.py
ADDONS = {
  "scrapyx_mw.addon.ScrapyxAddon": 0,
}

Then tweak behavior using flags:

Setting	Type	Default	Notes
`SCRAPYX_SESSION_ENABLED`	bool	`True`	Enable default session headers middleware
`SCRAPYX_API_REQUEST_ENABLED`	bool	`True`	Enable API header injector middleware
`SCRAPYX_DEBUG_ENABLED`	bool	`False`	Log outgoing requests at DEBUG
`SCRAPYX_CAPTCHA_MODE`	str	`"none"`	`"none" \| "polling" \| "webhook"`
`SCRAPYX_CAPTCHA_ENABLED`	bool	`False`	Master switch; `CAPTCHA_ENABLED` is also honored
`CAPTCHA_API_KEY`	str	`""`	Required if captcha is enabled
`CAPTCHA_WEBHOOK_URL`	str	`http://127.0.0.1:6801/webhook`	Webhook receiver URL
`CAPTCHA_*` knobs	mixed	see defaults	TTL, polling delays, HTTP timeouts/retries
`SESSION_HEADERS`	dict	`{}`	Global default headers (overridden by per-spider `service_config["HEADERS"]`)

The Add-on composes DOWNLOADER_MIDDLEWARES / SPIDER_MIDDLEWARES with "addon" priority and avoids overwriting user-set values.

Compatibility notes

Reads per-spider config from SERVICES[SPIDER_NAME_UPPER] (same as your compliance_scraper).
Honors spider.captcha_needed, spider.site_key, and injects request.meta["recaptcha_solution"].
Webhook DB path remains /var/lib/scrapyd/webhook_solutions.db for drop-in compatibility.

Prefer Add-on mode for fleet-wide consistency. The original presets.py helpers still work if you want to compose stacks manually.

Notes on compatibility

Reads SERVICES[SPIDER_NAME_UPPER] for CAPTCHA_REQUIRED, SITE_KEY, HEADERS.
Honors spider.captcha_needed and spider.site_key, and sets request.meta["recaptcha_solution"].
Webhook SQLite schema and path matches the original (/var/lib/scrapyd/webhook_solutions.db).
Session/API middlewares merge with service_config["HEADERS"] (exactly like your project).

How to use the new Add-on

In your Scrapy project (that consumes libs/scrapyx-mw):

# settings.py
ADDONS = {
    "scrapyx_mw.addon.ScrapyxAddon": 0,
}

# Optional toggles
SCRAPYX_SESSION_ENABLED = True
SCRAPYX_API_REQUEST_ENABLED = True
SCRAPYX_DEBUG_ENABLED = False

# Captcha
SCRAPYX_CAPTCHA_MODE = "polling"   # or "webhook" or "none"
SCRAPYX_CAPTCHA_ENABLED = True     # or set CAPTCHA_ENABLED=True
CAPTCHA_API_KEY = "your-2captcha-key"
# CAPTCHA_WEBHOOK_URL = "http://127.0.0.1:6801/webhook"

# (keep your existing SERVICES mapping)
SERVICES = {
    "EXAMPLE_SERVICE": {
        "CAPTCHA_REQUIRED": True,
        "SITE_KEY": "6Lxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "HEADERS": {"Accept": "text/html,application/xhtml+xml"}
    }
}

If you choose webhook mode, run the sidecar on the Scrapyd host:

uv run python -m scrapyx_mw.scrapyd.webhook_service
# Health: curl http://127.0.0.1:6801/health

ConfigValidator Extension

This extension validates CAPTCHA-related configuration on startup and aborts misconfigured crawls early.

Enabled automatically by the Add-on. To use independently:

EXTENSIONS = {
    "scrapyx_mw.extensions.config_validator.ConfigValidator": 10,
}

Checks performed

If CAPTCHA_ENABLED=True but CAPTCHA_API_KEY is missing → fail.
For each spider where SERVICES[SPIDER_NAME_UPPER]["CAPTCHA_REQUIRED"]=True:
- Ensure SITE_KEY is set.

Useful in CI and Scrapyd to prevent broken deployments.

🔄 Proxy Rotation Middleware

The proxy rotation middleware provides intelligent proxy management:

Configuration

# Enable proxy rotation
SCRAPYX_PROXY_ROTATION_ENABLED = True

# Proxy sources
SCRAPYX_PROXY_LIST = [
    "http://proxy1:8080",
    "http://proxy2:8080",
    "socks5://proxy3:1080"
]

# Or from environment variable
SCRAPYX_PROXY_ENV_VAR = "SCRAPYX_PROXY_LIST"  # Default
# export SCRAPYX_PROXY_LIST="http://proxy1:8080,http://proxy2:8080"

# Or from file
SCRAPYX_PROXY_FILE = "proxies.txt"

# Rotation strategy
SCRAPYX_PROXY_ROTATION_STRATEGY = "round_robin"  # "round_robin" | "random" | "weighted"

# Health checking
SCRAPYX_PROXY_HEALTH_CHECK = True
SCRAPYX_PROXY_HEALTH_CHECK_INTERVAL = 300  # seconds
SCRAPYX_PROXY_MAX_FAILURES = 3

# Session persistence
SCRAPYX_PROXY_SESSION_PERSISTENCE = True

Features

Multiple Sources: Load proxies from settings, environment variables, or files
Health Checking: Automatically remove failed proxies
Load Balancing: Round-robin, random, or weighted selection
Session Persistence: Use same proxy for requests with same session_id
Performance Tracking: Monitor proxy success rates and response times

🔁 Smart Retry Middleware

The smart retry middleware provides intelligent retry logic with exponential backoff:

Configuration

# Enable smart retry
SCRAPYX_SMART_RETRY_ENABLED = True

# Retry configuration
SCRAPYX_RETRY_MAX_TIMES = 3
SCRAPYX_RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

# Backoff configuration
SCRAPYX_RETRY_BASE_BACKOFF = 1.0  # seconds
SCRAPYX_RETRY_MAX_BACKOFF = 60.0  # seconds
SCRAPYX_RETRY_BACKOFF_MULTIPLIER = 2.0
SCRAPYX_RETRY_JITTER_ENABLED = True
SCRAPYX_RETRY_JITTER_RANGE = 0.1

# Circuit breaker
SCRAPYX_RETRY_CIRCUIT_BREAKER_ENABLED = True
SCRAPYX_RETRY_CIRCUIT_BREAKER_THRESHOLD = 5  # failures before circuit opens
SCRAPYX_RETRY_CIRCUIT_BREAKER_TIMEOUT = 60  # seconds before circuit resets

# Priority handling
SCRAPYX_RETRY_PRIORITY_ENABLED = True
SCRAPYX_RETRY_PRIORITY_MULTIPLIER = 1.5  # reduce delay for high-priority requests

Features

Exponential Backoff: Increasing delays between retries
Jitter: Random variation to prevent thundering herd
Circuit Breaker: Stop retrying failed domains temporarily
Priority Support: Faster retries for high-priority requests
Statistics: Track retry success rates and delays per domain

Project details

Release history Release notifications | RSS feed

0.2.2

Feb 9, 2026

0.2.1

Feb 9, 2026

0.2.0

Feb 2, 2026

0.1.1

Oct 27, 2025

This version

0.1.0

Oct 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapyx_mw-0.1.0.tar.gz (33.0 kB view details)

Uploaded Oct 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapyx_mw-0.1.0-py3-none-any.whl (35.9 kB view details)

Uploaded Oct 27, 2025 Python 3

File details

Details for the file scrapyx_mw-0.1.0.tar.gz.

File metadata

Download URL: scrapyx_mw-0.1.0.tar.gz
Upload date: Oct 27, 2025
Size: 33.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapyx_mw-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4de6c7f828d1a2b530da4b2642340c79047e5a169749ee4600f0d4e7bc377829`
MD5	`c27ddcef087bba9e2927b95ac484206d`
BLAKE2b-256	`47f6cfc8cfb9a399c8d571ca0edc417fcd4af5d9e2261e683b9b71866b6a2173`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapyx_mw-0.1.0.tar.gz:

Publisher: release.yml on ArmanAvanesyan/scrapyx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapyx_mw-0.1.0.tar.gz
- Subject digest: 4de6c7f828d1a2b530da4b2642340c79047e5a169749ee4600f0d4e7bc377829
- Sigstore transparency entry: 642593976
- Sigstore integration time: Oct 27, 2025
Source repository:
- Permalink: ArmanAvanesyan/scrapyx@0944ceb78fa113f962e5c6c1c2c8a8616d08084b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ArmanAvanesyan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0944ceb78fa113f962e5c6c1c2c8a8616d08084b
- Trigger Event: push

File details

Details for the file scrapyx_mw-0.1.0-py3-none-any.whl.

File metadata

Download URL: scrapyx_mw-0.1.0-py3-none-any.whl
Upload date: Oct 27, 2025
Size: 35.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapyx_mw-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`01e5b9790a3f1f11d363d2631117938f9152a673940b338da7dc1f504c5672d7`
MD5	`4fa9828325ea14ea87d8e6442ebd1bab`
BLAKE2b-256	`9e31f02c298ee068d668b244ddf03a1d85cc40431413c57e9c986fc2a79026e0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapyx_mw-0.1.0-py3-none-any.whl:

Publisher: release.yml on ArmanAvanesyan/scrapyx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapyx_mw-0.1.0-py3-none-any.whl
- Subject digest: 01e5b9790a3f1f11d363d2631117938f9152a673940b338da7dc1f504c5672d7
- Sigstore transparency entry: 642594060
- Sigstore integration time: Oct 27, 2025
Source repository:
- Permalink: ArmanAvanesyan/scrapyx@0944ceb78fa113f962e5c6c1c2c8a8616d08084b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ArmanAvanesyan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0944ceb78fa113f962e5c6c1c2c8a8616d08084b
- Trigger Event: push

scrapyx-mw 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

scrapyx-mw

📚 Documentation

Install (with uv)

🚀 Features

Core Middlewares

Captcha Solving

Production Hardening

Advanced Features

Enable in your Scrapy project

Captcha options

Supported Providers

Provider Configuration

Webhook sidecar

Environment variables

Enable via Scrapy Add-on (recommended)

Compatibility notes

Notes on compatibility

How to use the new Add-on

ConfigValidator Extension

Checks performed

🔄 Proxy Rotation Middleware

Configuration

Features

🔁 Smart Retry Middleware

Configuration

Features

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance