Skip to main content

Incubator for Scrapy download handlers

Project description

PyPI version Supported Python versions Tests - Ubuntu Tests - macOS Tests - Windows Coverage

Overview

This is a collection of semi-official download handlers for Scrapy. See the Scrapy download handler documentation for more information.

They should work and some of them may be later promoted to the official status, but here they are provided as-is and no support or stability promises are given. The documentation, including limitations and unsupported features, is also provided as-is and may be incomplete.

As this code intentionally uses private Scrapy APIs, it specifies a tight dependency on Scrapy. This version of the package only supports Scrapy 2.16.x.

Features overview

The baseline for these handlers is the default Scrapy handler, HTTP11DownloadHandler, which uses Twisted and supports HTTP/1.1. Feature parity with it is an explicit goal but it’s not always possible and not all possible features are implemented in all handlers (which may change in the future). Certain popular features not supported by HTTP11DownloadHandler, like HTTP/2 support, and features unique to some handlers, may or may not be implemented. Please see the sections for individual handlers for more details.

The following table summarizes the most important differences:

Handler

HTTP/2

HTTP/3

Proxies

TLS logging

Impersonation

TLS version limits

(HTTP11DownloadHandler)

Not possible

Not possible

Yes

Yes

Not possible

No

AiohttpDownloadHandler

Not possible

Not possible

Yes

Yes

Not possible

No

CurlCffiDownloadHandler

Yes

Yes (not tested)

Yes

Not possible

No

Not possible

HttpxDownloadHandler

Yes

Not possible

Yes

Yes

Not possible

No

NiquestsDownloadHandler

Yes

No

Yes

Yes

Not possible

Not possible

PyreqwestDownloadHandler

Yes

Not possible

Not possible

Not possible

Not possible

No

The following basic features are supported by all handlers unless mentioned in their docs:

  • Native asyncio integration without requiring a Twisted reactor

  • HTTP/1.1 for http and https schemes

  • Unified download handler exceptions

  • Proxies, including HTTP and HTTPS proxies for HTTP and HTTPS destinations

  • Proxy authentication via HttpProxyMiddleware

  • IPv6 destinations

  • DOWNLOAD_MAXSIZE, DOWNLOAD_WARNSIZE and the respective request meta keys

  • DOWNLOAD_TIMEOUT and the respective request meta key

  • DOWNLOAD_FAIL_ON_DATALOSS and the "dataloss" flag

  • Setting the download_latency request meta

  • DOWNLOAD_BIND_ADDRESS

  • DOWNLOAD_VERIFY_CERTIFICATES

  • headers_received and bytes_received signals

  • Not reading the proxy configuration from the environment variables

  • Not handling cookies, redirects, compression and other things handled by Scrapy itself

Handlers

AiohttpDownloadHandler

This handler supports HTTP/1.1 and uses the aiohttp library.

Install it with:

pip install scrapy-download-handlers-incubator[aiohttp]

Enable it with:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_download_handlers_incubator.AiohttpDownloadHandler",
    "https": "scrapy_download_handlers_incubator.AiohttpDownloadHandler",
}

Features and limitations

HTTP proxies

Yes (HTTPS proxies for HTTPS destinations are not supported on Python < 3.11)

SOCKS proxies

No (not supported by the library)

HTTP/2

No (not supported by the library)

TLS verbose logging

Yes

response.ip_address

Yes

response.certificate

Yes (DER bytes)

Per-request bindaddress

No (not supported by the library)

Proxy certificate verification

Follows DOWNLOAD_VERIFY_CERTIFICATES

Notable features supported by the library but not implemented:

  • DNS resolving settings

  • Custom DNS resolvers

CurlCffiDownloadHandler

This handler supports HTTP/1.1 and HTTP/2 and uses the curl_cffi library.

Install it with:

pip install scrapy-download-handlers-incubator[curl-cffi]

Enable it with:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_download_handlers_incubator.CurlCffiDownloadHandler",
    "https": "scrapy_download_handlers_incubator.CurlCffiDownloadHandler",
}

Features and limitations

HTTP proxies

Yes

SOCKS proxies

Yes (SOCKS4, SOCKS5)

HTTP/2

Yes

HTTP/3

Yes (but not tested)

TLS verbose logging

No (not supported by the library)

response.ip_address

Yes

response.certificate

No (not supported by the library)

Per-request bindaddress

No (not supported by the library)

Proxy certificate verification

Follows DOWNLOAD_VERIFY_CERTIFICATES

Notable features supported by the library but not implemented:

  • Impersonation

  • Advanced libcurl tunables

Settings

  • CURL_CFFI_HTTP_VERSION (str, default: "v1", corresponding to “Enforce HTTP/1.1”): The HTTP version to use. The value is passed directly to the library so the possible values are set by curl_cffi.requests.utils.normalize_http_version() and the meanings of the underlying constants can be seen in libcurl docs (CURLOPT_HTTP_VERSION). Set this to "v2tls" or "v2" to enable HTTP/2 for HTTPS requests or for all requests respectively. Set this to "v3" to enable HTTP/3.

HttpxDownloadHandler

This is an updated copy of the official scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler handler. It supports HTTP/1.1 and HTTP/2 and uses the httpx library.

Install it with:

pip install scrapy-download-handlers-incubator[httpx]

Enable it with:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_download_handlers_incubator.HttpxDownloadHandler",
    "https": "scrapy_download_handlers_incubator.HttpxDownloadHandler",
}

Features and limitations

HTTP proxies

Yes (separate connection pool per proxy)

SOCKS proxies

Yes (SOCKS5; separate connection pool per proxy; requires httpx[socks])

HTTP/2

Yes (requires httpx[http2])

HTTP/3

No (not supported by the library)

TLS verbose logging

Yes

response.ip_address

Yes

response.certificate

Yes (DER bytes)

Per-request bindaddress

No (not supported by the library)

Proxy certificate verification

Follows DOWNLOAD_VERIFY_CERTIFICATES

Notable features supported by the library but not implemented:

  • Alternative transports

  • Limiting the number of per-proxy connection pool to save resources

Settings

  • HTTPX_HTTP2_ENABLED (bool, default: False): Whether to enable HTTP/2.

NiquestsDownloadHandler

This handler supports HTTP/1.1 and HTTP/2 and uses the niquests library.

Install it with:

pip install scrapy-download-handlers-incubator[niquests]

Enable it with:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_download_handlers_incubator.NiquestsDownloadHandler",
    "https": "scrapy_download_handlers_incubator.NiquestsDownloadHandler",
}

Features and limitations

HTTP proxies

Yes

SOCKS proxies

Yes (SOCKS4, SOCKS5; requires niquests[socks])

HTTP/2

Yes

HTTP/3

No (not implemented)

TLS verbose logging

Yes

response.ip_address

Yes

response.certificate

Yes (DER bytes)

Per-request bindaddress

No (not supported by the library)

Proxy certificate verification

Follows DOWNLOAD_VERIFY_CERTIFICATES

Notable features supported by the library but not implemented:

  • Custom DNS resolvers

  • HTTP/2 tunables

Settings

  • NIQUESTS_HTTP2_ENABLED (bool, default: False): Whether to enable HTTP/2.

PyreqwestDownloadHandler

This handler supports HTTP/1.1 and HTTP/2 and uses the pyreqwest library.

Install it with:

pip install scrapy-download-handlers-incubator[pyreqwest]

Enable it with:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_download_handlers_incubator.PyreqwestDownloadHandler",
    "https": "scrapy_download_handlers_incubator.PyreqwestDownloadHandler",
}

Features and limitations

Proxies

No (not supported by the library)

HTTP/2

Yes

HTTP/3

No (not supported by the library)

TLS verbose logging

No (not supported by the library)

response.ip_address

No (not supported by the library)

response.certificate

No (not supported by the library)

Per-request bindaddress

No (not supported by the library)

Notable features supported by the library but not implemented:

  • HTTP/2 tunables

Settings

  • PYREQWEST_HTTP2_ENABLED (bool, default: False): Whether to enable HTTP/2.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_download_handlers_incubator-0.2.0.tar.gz (45.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file scrapy_download_handlers_incubator-0.2.0.tar.gz.

File metadata

File hashes

Hashes for scrapy_download_handlers_incubator-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5008f4c1f7af1949879e7baebb3fb09f69d57e35b72a08d5d436281b45712f52
MD5 df859a814ad0cdd5eeea2b72d19e6506
BLAKE2b-256 f8a1df2850809d37b17ba1b57c78f29ce4fe880e5292b96a7c7118263b5c507b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_download_handlers_incubator-0.2.0.tar.gz:

Publisher: publish.yml on scrapy-plugins/scrapy-download-handlers-incubator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapy_download_handlers_incubator-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_download_handlers_incubator-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d131024832dbf591dd0aaf3421afed668233302de1f3ce86a9efc2dfd9646774
MD5 c4bfaa1cc799697c8f0a8504ecbf7d98
BLAKE2b-256 1fd135dd9ccba08ef5adf87a52a97d0db2ac21e880aa420e1c79153c49c299e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_download_handlers_incubator-0.2.0-py3-none-any.whl:

Publisher: publish.yml on scrapy-plugins/scrapy-download-handlers-incubator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page