A composite Scrapy download handler that integrates cloudscraper, curl_cffi (via scrapy-impersonate), and Twisted HTTP/1.1 into one handler with per-request routing via request.meta.
Project description
scrapy-common-downloadhandler
A composite Scrapy download handler that integrates cloudscraper, curl_cffi (via scrapy-impersonate), and Twisted HTTP/1.1 into a single handler with per-request routing via request.meta.
Inheritance Chain
HTTP11DownloadHandler <- Twisted HTTP/1.1 (fallback)
└── ImpersonateDownloadHandler <- curl_cffi (when meta["impersonate"] is set)
└── CommonDownloadHandler <- cloudscraper (when meta["use_cloudscraper"] is True)
Installation
pip install scrapy-common-downloadhandler
Quick Start
1. Configure the download handler
In your project's settings.py or spider's custom_settings:
DOWNLOAD_HANDLERS = {
"http": "scrapy_common_downloadhandler.CommonDownloadHandler",
"https": "scrapy_common_downloadhandler.CommonDownloadHandler",
}
USER_AGENT = ""
USER_AGENT must be set to an empty string. This prevents Scrapy's UserAgentMiddleware from injecting a default User-Agent header (e.g. Scrapy/x.x.x), which would conflict with the browser User-Agent that curl_cffi automatically provides during impersonation — resulting in a TLS fingerprint / User-Agent mismatch detectable by anti-bot systems.
No other additional settings or flags are needed. All three download modes are available once the handler is configured.
2. Use in your spider
import scrapy
class MySpider(scrapy.Spider):
name = "example"
def start_requests(self):
# cloudscraper
yield scrapy.Request(url, meta={"use_cloudscraper": True}, callback=self.parse)
# curl_cffi impersonate
yield scrapy.Request(url, meta={"impersonate": "chrome"}, callback=self.parse)
# default Twisted HTTP/1.1
yield scrapy.Request(url, callback=self.parse)
Usage
cloudscraper Requests
# Basic
yield scrapy.Request(url, meta={"use_cloudscraper": True}, callback=self.parse)
# With create_scraper() parameter passthrough
yield scrapy.Request(url, meta={
"use_cloudscraper": True,
"cloudscraper_args": {
"browser": {"browser": "chrome", "mobile": False, "platform": "windows"},
"delay": 10,
"interpreter": "nodejs",
},
}, callback=self.parse)
All keys in cloudscraper_args are passed directly to cloudscraper.create_scraper(**args).
curl_cffi impersonate Requests
# Basic
yield scrapy.Request(url, meta={"impersonate": "chrome"}, callback=self.parse)
# With parameter passthrough
yield scrapy.Request(url, meta={
"impersonate": "chrome",
"impersonate_args": {"timeout": 30},
}, callback=self.parse)
See scrapy-impersonate for full details on impersonate_args.
Default Twisted HTTP/1.1 Requests
# No special meta needed
yield scrapy.Request(url, callback=self.parse)
Parameter Passthrough Reference
| Mode | meta flag | passthrough key | passthrough target |
|---|---|---|---|
| cloudscraper | use_cloudscraper: True |
cloudscraper_args: {} |
cloudscraper.create_scraper(**args) |
| curl_cffi | impersonate: "chrome" |
impersonate_args: {} |
curl_cffi request method |
| Twisted | (none) | (none) | Scrapy default settings |
Proxy Support
Proxy middlewares that set request.meta["proxy"] work seamlessly:
- cloudscraper: converts to
proxies={"http": proxy, "https": proxy} - curl_cffi: read by
ImpersonateDownloadHandler'sRequestParser - Twisted: handled by Scrapy's built-in
HttpProxyMiddleware
scrapy-redis Compatibility
Fully compatible. scrapy-redis only handles scheduling and deduplication, which is independent of the download handler layer.
Response Flags
Responses carry a flag indicating which download mode was used:
"cloudscraper"inresponse.flags— downloaded via cloudscraper"impersonate"inresponse.flags— downloaded via curl_cffi- Neither — downloaded via Twisted HTTP/1.1
Notes
USER_AGENT = ""is required. Without it, Scrapy'sUserAgentMiddlewarewill set the User-Agent header before the request reaches the download handler, overriding the browser-matched User-Agent that curl_cffi provides during impersonation.- cloudscraper is a synchronous library (based on requests). The handler uses
deferToThreadto run it in a thread pool, avoiding reactor blocking. - Internal redirects are disabled (
allow_redirects=False) in cloudscraper mode. Redirects are handled by Scrapy'sRedirectMiddleware. - The
Content-Encodingheader is stripped from cloudscraper responses. Decompression is handled by Scrapy'sHttpCompressionMiddleware. - Scrapy's default reactor is
AsyncioSelectorReactor. No additionalTWISTED_REACTORconfiguration is needed.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_common_downloadhandler-0.1.0.tar.gz.
File metadata
- Download URL: scrapy_common_downloadhandler-0.1.0.tar.gz
- Upload date:
- Size: 87.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e9c39a4328b44c71ffbaba23825d248ae87a859a481a5351e13989ace058506
|
|
| MD5 |
cebc8d9771614b49020981ead030f0cd
|
|
| BLAKE2b-256 |
87a0d0592d229a7fd074cb4e825795b24fa8ce028a0d91960c4be61a0c33ccc4
|
File details
Details for the file scrapy_common_downloadhandler-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scrapy_common_downloadhandler-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d3ddd3fb455fa458a257bc02afc36b0b00d2001c4c21e5f25b40d1528d0e850
|
|
| MD5 |
e9e35ffa9e40f7c3642cf5aeb70f1943
|
|
| BLAKE2b-256 |
f69a105399d7222e175a6e22be3f7e9e8a34581ec23efbc11f9f802909e6f7c0
|