Skip to main content

A composite Scrapy download handler that integrates cloudscraper, curl_cffi (via scrapy-impersonate), and Twisted HTTP/1.1 into one handler with per-request routing via request.meta.

Project description

scrapy-common-downloadhandler

A composite Scrapy download handler that integrates cloudscraper, curl_cffi (via scrapy-impersonate), and Twisted HTTP/1.1 into a single handler with per-request routing via request.meta.

Inheritance Chain

HTTP11DownloadHandler             <- Twisted HTTP/1.1 (fallback)
  └── ImpersonateDownloadHandler  <- curl_cffi (when meta["impersonate"] is set)
        └── CommonDownloadHandler <- cloudscraper (when meta["use_cloudscraper"] is True)

Installation

pip install scrapy-common-downloadhandler

Quick Start

1. Configure the download handler

In your project's settings.py or spider's custom_settings:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_common_downloadhandler.CommonDownloadHandler",
    "https": "scrapy_common_downloadhandler.CommonDownloadHandler",
}
USER_AGENT = ""

USER_AGENT must be set to an empty string. This prevents Scrapy's UserAgentMiddleware from injecting a default User-Agent header (e.g. Scrapy/x.x.x), which would conflict with the browser User-Agent that curl_cffi automatically provides during impersonation — resulting in a TLS fingerprint / User-Agent mismatch detectable by anti-bot systems.

No other additional settings or flags are needed. All three download modes are available once the handler is configured.

2. Use in your spider

import scrapy

class MySpider(scrapy.Spider):
    name = "example"

    def start_requests(self):
        # cloudscraper
        yield scrapy.Request(url, meta={"use_cloudscraper": True}, callback=self.parse)

        # curl_cffi impersonate
        yield scrapy.Request(url, meta={"impersonate": "chrome"}, callback=self.parse)

        # default Twisted HTTP/1.1
        yield scrapy.Request(url, callback=self.parse)

Usage

cloudscraper Requests

# Basic
yield scrapy.Request(url, meta={"use_cloudscraper": True}, callback=self.parse)

# With create_scraper() parameter passthrough
yield scrapy.Request(url, meta={
    "use_cloudscraper": True,
    "cloudscraper_args": {
        "browser": {"browser": "chrome", "mobile": False, "platform": "windows"},
        "delay": 10,
        "interpreter": "nodejs",
    },
}, callback=self.parse)

All keys in cloudscraper_args are passed directly to cloudscraper.create_scraper(**args).

curl_cffi impersonate Requests

# Basic
yield scrapy.Request(url, meta={"impersonate": "chrome"}, callback=self.parse)

# With parameter passthrough
yield scrapy.Request(url, meta={
    "impersonate": "chrome",
    "impersonate_args": {"timeout": 30},
}, callback=self.parse)

See scrapy-impersonate for full details on impersonate_args.

Default Twisted HTTP/1.1 Requests

# No special meta needed
yield scrapy.Request(url, callback=self.parse)

Parameter Passthrough Reference

Mode meta flag passthrough key passthrough target
cloudscraper use_cloudscraper: True cloudscraper_args: {} cloudscraper.create_scraper(**args)
curl_cffi impersonate: "chrome" impersonate_args: {} curl_cffi request method
Twisted (none) (none) Scrapy default settings

Proxy Support

Proxy middlewares that set request.meta["proxy"] work seamlessly:

  • cloudscraper: converts to proxies={"http": proxy, "https": proxy}
  • curl_cffi: read by ImpersonateDownloadHandler's RequestParser
  • Twisted: handled by Scrapy's built-in HttpProxyMiddleware

scrapy-redis Compatibility

Fully compatible. scrapy-redis only handles scheduling and deduplication, which is independent of the download handler layer.

Response Flags

Responses carry a flag indicating which download mode was used:

  • "cloudscraper" in response.flags — downloaded via cloudscraper
  • "impersonate" in response.flags — downloaded via curl_cffi
  • Neither — downloaded via Twisted HTTP/1.1

Notes

  • USER_AGENT = "" is required. Without it, Scrapy's UserAgentMiddleware will set the User-Agent header before the request reaches the download handler, overriding the browser-matched User-Agent that curl_cffi provides during impersonation.
  • cloudscraper is a synchronous library (based on requests). The handler uses deferToThread to run it in a thread pool, avoiding reactor blocking.
  • Internal redirects are disabled (allow_redirects=False) in cloudscraper mode. Redirects are handled by Scrapy's RedirectMiddleware.
  • The Content-Encoding header is stripped from cloudscraper responses. Decompression is handled by Scrapy's HttpCompressionMiddleware.
  • Scrapy's default reactor is AsyncioSelectorReactor. No additional TWISTED_REACTOR configuration is needed.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_common_downloadhandler-0.1.0.tar.gz (87.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_common_downloadhandler-0.1.0-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_common_downloadhandler-0.1.0.tar.gz.

File metadata

File hashes

Hashes for scrapy_common_downloadhandler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5e9c39a4328b44c71ffbaba23825d248ae87a859a481a5351e13989ace058506
MD5 cebc8d9771614b49020981ead030f0cd
BLAKE2b-256 87a0d0592d229a7fd074cb4e825795b24fa8ce028a0d91960c4be61a0c33ccc4

See more details on using hashes here.

File details

Details for the file scrapy_common_downloadhandler-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_common_downloadhandler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3d3ddd3fb455fa458a257bc02afc36b0b00d2001c4c21e5f25b40d1528d0e850
MD5 e9e35ffa9e40f7c3642cf5aeb70f1943
BLAKE2b-256 f69a105399d7222e175a6e22be3f7e9e8a34581ec23efbc11f9f802909e6f7c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page