Skip to main content

Incubator for Scrapy download handlers

Project description

Overview

This is a collection of semi-official download handlers for Scrapy. See the Scrapy download handler documentation for more information.

They should work and some of them may be later promoted to the official status, but here they are provided as-is and no support or stability promises are given. The documentation, including limitations and unsupported features, is also provided as-is and may be incomplete.

As this code may intentionally use private Scrapy APIs, it specifies a tight dependency on Scrapy. Currently only the unreleased 2.15.0 version is supported.

Features overview

The baseline for these handlers is the default Scrapy handler, HTTP11DownloadHandler, which uses Twisted and supports HTTP/1.1. Feature parity with it is an explicit goal but it’s not always possible and not all possible features are implemented in all handlers (which may change in the future). Certain popular features not supported by HTTP11DownloadHandler, like HTTP/2 support, and features unique to some handlers, may or may not be implemented. Please see the sections for individual handlers for more details.

The following table summarizes the most important differences:

Handler

HTTP/2

HTTP/3

Proxies

TLS logging

Impersonation

TLS version limits

(HTTP11DownloadHandler)

Not possible

Not possible

Yes

Yes

Not possible

No

AiohttpDownloadHandler

Not possible

Not possible

Yes

Partial

Not possible

No

CurlCffiDownloadHandler

Yes

Yes (not tested)

Yes

Not possible

No

Not possible

HttpxDownloadHandler

Yes

Not possible

Yes

Yes

Not possible

No

NiquestsDownloadHandler

Yes

No

Yes

Yes

Not possible

Not possible

PyreqwestDownloadHandler

Yes

Not possible

Not possible

Not possible

Not possible

No

The following basic features are supported by all handlers unless mentioned in their docs:

  • Native asyncio integration without requiring a Twisted reactor

  • HTTP/1.1 for http and https schemes

  • Unified download handler exceptions

  • Proxies, including HTTP and HTTPS proxies for HTTP and HTTPS destinations

  • Proxy authentication via HttpProxyMiddleware

  • IPv6 destinations

  • DOWNLOAD_MAXSIZE, DOWNLOAD_WARNSIZE and the respective request meta keys

  • DOWNLOAD_TIMEOUT and the respective request meta key

  • DOWNLOAD_FAIL_ON_DATALOSS and the "dataloss" flag

  • Setting the download_latency request meta

  • DOWNLOAD_BIND_ADDRESS

  • DOWNLOAD_VERIFY_CERTIFICATES

  • headers_received and bytes_received signals

  • Not reading the proxy configuration from the environment variables

  • Not handling cookies, redirects, compression and other things handled by Scrapy itself

Handlers

AiohttpDownloadHandler

This handler supports HTTP/1.1 and uses the aiohttp library.

Install it with:

pip install scrapy-download-handlers-incubator[aiohttp]

Enable it with:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_download_handlers_incubator.AiohttpDownloadHandler",
    "https": "scrapy_download_handlers_incubator.AiohttpDownloadHandler",
}

Features and limitations

Proxies

Yes (HTTPS proxies for HTTPS destinations are not supported on Python < 3.11)

HTTP/2

No (not supported by the library)

TLS verbose logging

Partial (skipped for small responses)

response.ip_address

Partial (skipped for small responses)

response.certificate

Partial (DER bytes; skipped for small responses)

Per-request bindaddress

No (not supported by the library)

Proxy certificate verification

Follows DOWNLOAD_VERIFY_CERTIFICATES

Notable features supported by the library but not implemented:

  • DNS resolving settings

  • Custom DNS resolvers

CurlCffiDownloadHandler

This handler supports HTTP/1.1 and HTTP/2 and uses the curl_cffi library.

Install it with:

pip install scrapy-download-handlers-incubator[curl-cffi]

Enable it with:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_download_handlers_incubator.CurlCffiDownloadHandler",
    "https": "scrapy_download_handlers_incubator.CurlCffiDownloadHandler",
}

Features and limitations

Proxies

Yes

HTTP/2

Yes

HTTP/3

Yes (but not tested)

TLS verbose logging

No (not supported by the library)

response.ip_address

Yes

response.certificate

No (not supported by the library)

Per-request bindaddress

No (not supported by the library)

Proxy certificate verification

Follows DOWNLOAD_VERIFY_CERTIFICATES

Notable features supported by the library but not implemented:

  • Impersonation

  • Advanced libcurl tunables

Settings

  • CURL_CFFI_HTTP_VERSION (str, default: "v1", corresponding to “Enforce HTTP/1.1”): The HTTP version to use. The value is passed directly to the library so the possible values are set by curl_cffi.requests.utils.normalize_http_version() and the meanings of the underlying constants can be seen in libcurl docs (CURLOPT_HTTP_VERSION). Set this to "v2tls" or "v2" to enable HTTP/2 for HTTPS requests or for all requests respectively. Set this to "v3" to enable HTTP/3.

HttpxDownloadHandler

This is an updated copy of the official scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler handler. It supports HTTP/1.1 and HTTP/2 and uses the httpx library.

Install it with:

pip install scrapy-download-handlers-incubator[httpx]

Enable it with:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_download_handlers_incubator.HttpxDownloadHandler",
    "https": "scrapy_download_handlers_incubator.HttpxDownloadHandler",
}

Features and limitations

Proxies

Yes (separate connection pool per proxy)

HTTP/2

Yes

HTTP/3

No (not supported by the library)

TLS verbose logging

Yes

response.ip_address

Yes

response.certificate

Yes (DER bytes)

Per-request bindaddress

No (not supported by the library)

Proxy certificate verification

Follows DOWNLOAD_VERIFY_CERTIFICATES

Notable features supported by the library but not implemented:

  • SOCKS5 proxies

  • Alternative transports

  • Limiting the number of per-proxy connection pool to save resources

Settings

  • HTTPX_HTTP2_ENABLED (bool, default: False): Whether to enable HTTP/2.

NiquestsDownloadHandler

This handler supports HTTP/1.1 and HTTP/2 and uses the niquests library.

Install it with:

pip install scrapy-download-handlers-incubator[niquests]

Enable it with:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_download_handlers_incubator.NiquestsDownloadHandler",
    "https": "scrapy_download_handlers_incubator.NiquestsDownloadHandler",
}

Features and limitations

Proxies

Yes

HTTP/2

Yes

HTTP/3

No (not implemented)

TLS verbose logging

Yes

response.ip_address

Yes

response.certificate

Yes (DER bytes)

Per-request bindaddress

No (not supported by the library)

Proxy certificate verification

Follows DOWNLOAD_VERIFY_CERTIFICATES

Notable features supported by the library but not implemented:

  • Custom DNS resolvers

  • SOCKS5 proxies

  • HTTP/2 tunables

Settings

  • NIQUESTS_HTTP2_ENABLED (bool, default: False): Whether to enable HTTP/2.

PyreqwestDownloadHandler

This handler supports HTTP/1.1 and HTTP/2 and uses the pyreqwest library.

Install it with:

pip install scrapy-download-handlers-incubator[pyreqwest]

Enable it with:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_download_handlers_incubator.PyreqwestDownloadHandler",
    "https": "scrapy_download_handlers_incubator.PyreqwestDownloadHandler",
}

Features and limitations

Proxies

No (not supported by the library)

HTTP/2

Yes

HTTP/3

No (not supported by the library)

TLS verbose logging

No (not supported by the library)

response.ip_address

No (not supported by the library)

response.certificate

No (not supported by the library)

Per-request bindaddress

No (not supported by the library)

Notable features supported by the library but not implemented:

  • HTTP/2 tunables

Settings

  • PYREQWEST_HTTP2_ENABLED (bool, default: False): Whether to enable HTTP/2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_download_handlers_incubator-0.1.0.tar.gz (45.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file scrapy_download_handlers_incubator-0.1.0.tar.gz.

File metadata

File hashes

Hashes for scrapy_download_handlers_incubator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4df59da61aac695b2390dde318e232101badd038df7c377a47d3c61e3a29dea7
MD5 3dd3257d7bd8065cdde71b6e041b929e
BLAKE2b-256 c1d004e6ec59ea2ff83d074c17898cc4a35fafdabf6dcadb14be07ae1d072394

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_download_handlers_incubator-0.1.0.tar.gz:

Publisher: publish.yml on scrapy-plugins/scrapy-download-handlers-incubator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapy_download_handlers_incubator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_download_handlers_incubator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 482ddc3aeb9e1c64828ddf5774476abbfe1aba4a58231309efc3f80789ecf46e
MD5 92fa63baf66b004cf0fb842a74857ec7
BLAKE2b-256 6bc3f8e2907fe97e59d07bbe6226df2a9e3b24a6425eec2f80bd3a0305fe57c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_download_handlers_incubator-0.1.0-py3-none-any.whl:

Publisher: publish.yml on scrapy-plugins/scrapy-download-handlers-incubator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page