Skip to main content

Composable Scrapy middlewares: API/session headers, debug, captcha polling/webhook, proxy rotation, smart retry.

Project description

scrapyx-mw

Composable Scrapy middlewares packaged for reuse across projects:

  • API / Session headers injection
  • Debug logging
  • Captcha (polling via 2captcha or CapSolver) or Captcha (webhook + sidecar SQLite store)
  • Proxy rotation with health checking
  • Smart retry with exponential backoff

This package is compatible with projects that store per-spider config under SERVICES[SPIDER_NAME_UPPER] (e.g., CAPTCHA_REQUIRED, SITE_KEY, HEADERS), as used in your compliance_scraper project.

📚 Documentation

Install (with uv)

From a project that consumes this plugin:

uv add -e ./libs/scrapyx-mw
# or publish and: uv add scrapyx-mw

Only Scrapy and Twisted are required; the resolver will reuse already-installed versions. For browser impersonation (curl_cffi), use: uv add scrapyx-mw[curl-cffi].

🚀 Features

Core Middlewares

  • Session Management: Automatic session handling with configurable headers
  • API Request: Inject API-specific headers from spider configuration
  • Debug Logging: Comprehensive request/response logging for development

Captcha Solving

  • Multi-Provider Support: 2captcha and CapSolver providers
  • Polling Mode: Traditional polling-based captcha solving
  • Webhook Mode: Webhook-based captcha solving with fallback to polling
  • Configurable Timeouts: Customizable polling intervals and timeouts

Production Hardening

  • Telemetry: Track captcha solve attempts, success rates, and costs
  • Guardrails: Rate limiting, budget controls, and circuit breaker
  • Log Redaction: Automatically redact sensitive information from logs
  • Configuration Validation: Fail-fast validation of captcha settings

Advanced Features

  • Proxy Rotation: Intelligent proxy rotation with health checking
  • Smart Retry: Exponential backoff with jitter and circuit breaker
  • Scrapy Add-on: One-switch enablement of all features

Enable in your Scrapy project

In your project settings.py:

from scrapyx_mw import default_config, apply_downloader_middlewares, apply_spider_middlewares

# Toggle features here
SCRAPYX = default_config(
    api_request=True,
    session=True,
    debug=False,
    captcha="none",         # "none" | "polling" | "webhook"
    captcha_enabled=False,  # also controlled by env var CAPTCHA_ENABLED
    captcha_api_key="",     # required if captcha_enabled
    captcha_webhook_url="http://127.0.0.1:6801/webhook",
    session_headers={"Accept": "application/json"},
)

DOWNLOADER_MIDDLEWARES.update(apply_downloader_middlewares(globals(), SCRAPYX))
SPIDER_MIDDLEWARES.update(apply_spider_middlewares(globals(), SCRAPYX))

# Per-spider service configs, same shape as in your compliance_scraper:
SERVICES = {
    "EXAMPLE_SERVICE": {
        "CAPTCHA_REQUIRED": True,
        "SITE_KEY": "6Lxxxxxx...your-site-key...",
        "HEADERS": {"Accept": "text/html,application/xhtml+xml"}
    }
}

Note: Your spiders can keep using spider.service_config, spider.captcha_needed, and spider.site_key (as in your BaseComplianceSpider); this plugin is designed to read those fields.

Captcha options

  • Polling (captcha="polling"): middleware submits & polls 2captcha or CapSolver.
  • Webhook (captcha="webhook"): middleware submits with a callbackUrl and waits for the sidecar webhook receiver to store codes in SQLite for pickup.

Supported Providers

  • 2captcha: Traditional polling-based captcha solving
  • CapSolver: Modern API-based captcha solving with JSON endpoints

Provider Configuration

# For 2captcha
CAPTCHA_PROVIDER = "2captcha"
CAPTCHA_API_KEY = "your-2captcha-key"
CAPTCHA_2CAPTCHA_BASE = "https://2captcha.com"  # optional
CAPTCHA_2CAPTCHA_METHOD = "userrecaptcha"       # optional

# For CapSolver
CAPTCHA_PROVIDER = "capsolver"
CAPTCHA_API_KEY = "your-capsolver-key"
CAPTCHA_CAPSOLVER_BASE = "https://api.capsolver.com"           # optional
CAPTCHA_CAPSOLVER_TASK_TYPE = "ReCaptchaV2TaskProxyLess"       # optional

Webhook sidecar

Run the included receiver next to Scrapyd (bind to 127.0.0.1 by default):

uv run python -m scrapyx_mw.scrapyd.webhook_service
# Health: GET http://127.0.0.1:6801/health
# Webhook: POST from 2captcha to /webhook (id, code)

DB file path: /var/lib/scrapyd/webhook_solutions.db (same as your original project for drop-in compatibility).

Note: CapSolver webhook support is limited - it falls back to polling behavior in webhook mode.

Environment variables

CAPTCHA_ENABLED=false
CAPTCHA_API_KEY=
CAPTCHA_PROVIDER=2captcha
CAPTCHA_WEBHOOK_URL=http://127.0.0.1:6801/webhook
CAPTCHA_TOKEN_TTL_SECONDS=110
CAPTCHA_POLL_INITIAL_S=4.0
CAPTCHA_POLL_MAX_S=45.0
CAPTCHA_POLL_MAX_TIME_S=180.0
CAPTCHA_HTTP_TIMEOUT_S=15.0
CAPTCHA_HTTP_RETRIES=2

# 2captcha specific
CAPTCHA_2CAPTCHA_BASE=https://2captcha.com
CAPTCHA_2CAPTCHA_METHOD=userrecaptcha

# CapSolver specific
CAPTCHA_CAPSOLVER_BASE=https://api.capsolver.com
CAPTCHA_CAPSOLVER_TASK_TYPE=ReCaptchaV2TaskProxyLess

Enable via Scrapy Add-on (recommended)

Instead of editing middleware dicts manually, turn on the package with one line:

# settings.py
ADDONS = {
  "scrapyx_mw.addon.ScrapyxAddon": 0,
}

Then tweak behavior using flags:

Setting Type Default Notes
SCRAPYX_SESSION_ENABLED bool True Enable default session headers middleware
SCRAPYX_API_REQUEST_ENABLED bool True Enable API header injector middleware
SCRAPYX_DEBUG_ENABLED bool False Log outgoing requests at DEBUG
SCRAPYX_CAPTCHA_MODE str "none" "none" | "polling" | "webhook"
SCRAPYX_CAPTCHA_ENABLED bool False Master switch; CAPTCHA_ENABLED is also honored
CAPTCHA_API_KEY str "" Required if captcha is enabled
CAPTCHA_WEBHOOK_URL str http://127.0.0.1:6801/webhook Webhook receiver URL
CAPTCHA_* knobs mixed see defaults TTL, polling delays, HTTP timeouts/retries
SESSION_HEADERS dict {} Global default headers (overridden by per-spider service_config["HEADERS"])
SCRAPYX_CURL_CFFI_ENABLED bool False Enable CurlCffi download handler (opt-in: per request or global)
SCRAPYX_CURL_CFFI_MIDDLEWARE_ENABLED bool False Enable CurlCffi middleware (opt-out: all requests use curl_cffi unless disabled in meta)

The Add-on composes DOWNLOADER_MIDDLEWARES / SPIDER_MIDDLEWARES / DOWNLOAD_HANDLERS with "addon" priority and avoids overwriting user-set values.

CurlCffi (browser impersonation)

Optional support for curl_cffi for browser impersonation on anti-bot sites. Install the extra when using the handler or middleware: uv pip install scrapyx-mw[curl-cffi] or uv pip install curl-cffi.

  • Handler (SCRAPYX_CURL_CFFI_ENABLED=True): opt-in per request via request.meta['use_curl_cffi'] = True or globally; keeps the full download pipeline.
  • Middleware (SCRAPYX_CURL_CFFI_MIDDLEWARE_ENABLED=True): opt-out—all requests use curl_cffi unless request.meta['use_curl_cffi'] = False. Choose one approach per project (handler or middleware).

Per-request meta keys (both handler and middleware):

Meta key Description
curl_cffi_impersonate Browser target, e.g. "chrome110" (default).
curl_cffi_http_version Force HTTP version: "v1" (HTTP/1.1), "v2" (HTTP/2), "v3", "v3only". Use "v1" to avoid HTTP/2 stream closure errors (curl err 92) on some servers.
curl_cffi_curl_options Optional dict of libcurl options, e.g. {CURLOPT_HTTP_VERSION: CURL_HTTP_VERSION_1_1}.

Compatibility notes

  • Reads per-spider config from SERVICES[SPIDER_NAME_UPPER] (same as your compliance_scraper).
  • Honors spider.captcha_needed, spider.site_key, and injects request.meta["recaptcha_solution"].
  • Webhook DB path remains /var/lib/scrapyd/webhook_solutions.db for drop-in compatibility.

Prefer Add-on mode for fleet-wide consistency. The original presets.py helpers still work if you want to compose stacks manually.


Notes on compatibility

  • Reads SERVICES[SPIDER_NAME_UPPER] for CAPTCHA_REQUIRED, SITE_KEY, HEADERS.
  • Honors spider.captcha_needed and spider.site_key, and sets request.meta["recaptcha_solution"].
  • Webhook SQLite schema and path matches the original (/var/lib/scrapyd/webhook_solutions.db).
  • Session/API middlewares merge with service_config["HEADERS"] (exactly like your project).

How to use the new Add-on

In your Scrapy project (that consumes libs/scrapyx-mw):

# settings.py
ADDONS = {
    "scrapyx_mw.addon.ScrapyxAddon": 0,
}

# Optional toggles
SCRAPYX_SESSION_ENABLED = True
SCRAPYX_API_REQUEST_ENABLED = True
SCRAPYX_DEBUG_ENABLED = False

# Captcha
SCRAPYX_CAPTCHA_MODE = "polling"   # or "webhook" or "none"
SCRAPYX_CAPTCHA_ENABLED = True     # or set CAPTCHA_ENABLED=True
CAPTCHA_API_KEY = "your-2captcha-key"
# CAPTCHA_WEBHOOK_URL = "http://127.0.0.1:6801/webhook"

# (keep your existing SERVICES mapping)
SERVICES = {
    "EXAMPLE_SERVICE": {
        "CAPTCHA_REQUIRED": True,
        "SITE_KEY": "6Lxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "HEADERS": {"Accept": "text/html,application/xhtml+xml"}
    }
}

If you choose webhook mode, run the sidecar on the Scrapyd host:

uv run python -m scrapyx_mw.scrapyd.webhook_service
# Health: curl http://127.0.0.1:6801/health

ConfigValidator Extension

This extension validates CAPTCHA-related configuration on startup and aborts misconfigured crawls early.

Enabled automatically by the Add-on. To use independently:

EXTENSIONS = {
    "scrapyx_mw.extensions.config_validator.ConfigValidator": 10,
}

Checks performed

  • If CAPTCHA_ENABLED=True but CAPTCHA_API_KEY is missing → fail.
  • For each spider where SERVICES[SPIDER_NAME_UPPER]["CAPTCHA_REQUIRED"]=True:
    • Ensure SITE_KEY is set.

Useful in CI and Scrapyd to prevent broken deployments.

🔄 Proxy Rotation Middleware

The proxy rotation middleware provides intelligent proxy management:

Configuration

# Enable proxy rotation
SCRAPYX_PROXY_ROTATION_ENABLED = True

# Proxy sources
SCRAPYX_PROXY_LIST = [
    "http://proxy1:8080",
    "http://proxy2:8080",
    "socks5://proxy3:1080"
]

# Or from environment variable
SCRAPYX_PROXY_ENV_VAR = "SCRAPYX_PROXY_LIST"  # Default
# export SCRAPYX_PROXY_LIST="http://proxy1:8080,http://proxy2:8080"

# Or from file
SCRAPYX_PROXY_FILE = "proxies.txt"

# Rotation strategy
SCRAPYX_PROXY_ROTATION_STRATEGY = "round_robin"  # "round_robin" | "random" | "weighted"

# Health checking
SCRAPYX_PROXY_HEALTH_CHECK = True
SCRAPYX_PROXY_HEALTH_CHECK_INTERVAL = 300  # seconds
SCRAPYX_PROXY_MAX_FAILURES = 3

# Session persistence
SCRAPYX_PROXY_SESSION_PERSISTENCE = True

Features

  • Multiple Sources: Load proxies from settings, environment variables, or files
  • Health Checking: Automatically remove failed proxies
  • Load Balancing: Round-robin, random, or weighted selection
  • Session Persistence: Use same proxy for requests with same session_id
  • Performance Tracking: Monitor proxy success rates and response times

🔁 Smart Retry Middleware

The smart retry middleware provides intelligent retry logic with exponential backoff:

Configuration

# Enable smart retry
SCRAPYX_SMART_RETRY_ENABLED = True

# Retry configuration
SCRAPYX_RETRY_MAX_TIMES = 3
SCRAPYX_RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

# Backoff configuration
SCRAPYX_RETRY_BASE_BACKOFF = 1.0  # seconds
SCRAPYX_RETRY_MAX_BACKOFF = 60.0  # seconds
SCRAPYX_RETRY_BACKOFF_MULTIPLIER = 2.0
SCRAPYX_RETRY_JITTER_ENABLED = True
SCRAPYX_RETRY_JITTER_RANGE = 0.1

# Circuit breaker
SCRAPYX_RETRY_CIRCUIT_BREAKER_ENABLED = True
SCRAPYX_RETRY_CIRCUIT_BREAKER_THRESHOLD = 5  # failures before circuit opens
SCRAPYX_RETRY_CIRCUIT_BREAKER_TIMEOUT = 60  # seconds before circuit resets

# Priority handling
SCRAPYX_RETRY_PRIORITY_ENABLED = True
SCRAPYX_RETRY_PRIORITY_MULTIPLIER = 1.5  # reduce delay for high-priority requests

Features

  • Exponential Backoff: Increasing delays between retries
  • Jitter: Random variation to prevent thundering herd
  • Circuit Breaker: Stop retrying failed domains temporarily
  • Priority Support: Faster retries for high-priority requests
  • Statistics: Track retry success rates and delays per domain

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapyx_mw-0.2.2.tar.gz (116.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapyx_mw-0.2.2-py3-none-any.whl (42.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapyx_mw-0.2.2.tar.gz.

File metadata

  • Download URL: scrapyx_mw-0.2.2.tar.gz
  • Upload date:
  • Size: 116.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapyx_mw-0.2.2.tar.gz
Algorithm Hash digest
SHA256 c79389da7e28a5743e491c8f6ef5036ef82e215d0de3148860ba94bbe4931995
MD5 2f3966990b3b315f776b89619386872b
BLAKE2b-256 69c838db4afa697f4ec9b44a4c01090877632ebc0d1ee448b241f507a0bac026

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapyx_mw-0.2.2.tar.gz:

Publisher: release.yml on ArmanAvanesyan/scrapyx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapyx_mw-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: scrapyx_mw-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 42.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapyx_mw-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 82080500680d2a6e53768091d05e0a41170331821f120b910fa3233a62829646
MD5 bc33b75cbac2857c278a2f7a54bf5cd8
BLAKE2b-256 a5bb5aeff641adc76721a415a0694aab28019cf7c84b8962ccd872059feef450

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapyx_mw-0.2.2-py3-none-any.whl:

Publisher: release.yml on ArmanAvanesyan/scrapyx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page