Skip to main content

Scrapy downloader middleware that rotates proxies and retries on Cloudflare/DataDome/PerimeterX bans.

Project description

scrapy-rotating-proxy-middleware

PyPI version Python versions License: MIT

A drop-in Scrapy downloader middleware that rotates proxies and retries on bans403, 429, Cloudflare "Just a moment", DataDome, and PerimeterX challenges. Point it at a static proxy list or a single rotating gateway and your spider stops dying on blocks.

pip install scrapy-rotating-proxy-middleware

Why

Scrapy's built-in HttpProxyMiddleware assigns one proxy and never reacts when that exit IP gets blocked. In practice most anti-bot blocks aren't about your spider logic — they're about the IP and its TLS fingerprint being scored before your request reaches the page. This middleware:

  • assigns a proxy per request (random from a list, or a rotating gateway),
  • detects bans by status code and response-body signature (Cloudflare / DataDome / PerimeterX),
  • transparently rotates to a fresh proxy and retries, with a per-request retry budget,
  • moves inline user:pass credentials into the Proxy-Authorization header automatically.

Setup

Enable it in settings.py and disable Scrapy's default proxy middleware:

DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": None,
    "scrapy_rotating_proxy.middleware.RotatingProxyMiddleware": 610,
}

Option A — a rotating residential gateway (recommended)

A residential gateway gives you a new exit IP on every connection from a single URL, so you don't manage a list at all:

# settings.py
ROTATING_PROXY_GATEWAY = "http://USERNAME:PASSWORD@us.jibaoproxy.com:913"

Option B — a static proxy list

ROTATING_PROXY_LIST = [
    "http://USERNAME:PASSWORD@proxy-a.example.com:8000",
    "http://USERNAME:PASSWORD@proxy-b.example.com:8000",
    "socks5://USERNAME:PASSWORD@proxy-c.example.com:1080",
]

That's it — run your spider as usual.

Configuration

Setting Default Description
ROTATING_PROXY_GATEWAY Single rotating-gateway URL.
ROTATING_PROXY_LIST List of proxy URLs (used if no gateway).
ROTATING_PROXY_BAN_CODES 403, 407, 429, 503 Status codes treated as bans.
ROTATING_PROXY_MAX_RETRIES 5 Proxy rotations per request before giving up.

Set a proxy on a single request explicitly and the middleware leaves it alone:

yield scrapy.Request(url, meta={"proxy": "http://USERNAME:PASSWORD@host:port"})

Ban detection

A response counts as a ban when its status is in ROTATING_PROXY_BAN_CODES, or the first 4 KB of the body matches a known anti-bot signature (cf-chl, Just a moment, Attention Required, captcha-delivery/DataDome, px-captcha/PerimeterX). On a ban the request is re-scheduled with a fresh proxy and dont_filter=True, up to the retry budget.

If you keep hitting bans after rotation, the exit IPs themselves are the problem — datacenter ranges get scored as bot traffic at the ASN level. Residential exits with clean ASN reputation are what actually pass. We build JiBao Proxy for exactly this: 72M+ residential IPs across 200+ countries, sticky sessions, and SOCKS5/HTTP gateways. The middleware works with any provider, though.

Related

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_rotating_proxy_middleware-0.1.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file scrapy_rotating_proxy_middleware-0.1.0.tar.gz.

File metadata

File hashes

Hashes for scrapy_rotating_proxy_middleware-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4fa01e170265f2d843544548b5e1536ba27da00b4975d9cc00147e65e34f46c0
MD5 1a738b4203b6893186b7c837c3497a67
BLAKE2b-256 93ed69df6f719b18e7dba354f6253ecbebeff858cd778e7c2fcdb0852cddc922

See more details on using hashes here.

File details

Details for the file scrapy_rotating_proxy_middleware-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_rotating_proxy_middleware-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 64d865fffb881041898a8b5606f3c001ac32995281901ac0d83b6101da1ec45e
MD5 7939124e849f90e2af89832868770dc9
BLAKE2b-256 b3bd3a3696b3f8139f1a59c833858bd11ece1be9e64c873b4c3cb709cbde1671

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page