scrapy-zyte-api

Client library to process URLs through Zyte API

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

Requirements

Python 3.7+
Scrapy 2.6+

Installation

pip install scrapy-zyte-api

This package requires Python 3.7+.

Configuration

Replace the default http and https in Scrapy’s DOWNLOAD_HANDLERS in the settings.py of your Scrapy project.

You also need to set the ZYTE_API_KEY.

Lastly, make sure to install the asyncio-based Twisted reactor in the settings.py file as well.

Here’s an example of the things needed inside a Scrapy project’s settings.py file:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler",
    "https": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler"
}

# Having the following in the env var would also work.
ZYTE_API_KEY = "<your API key>"

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Usage

To enable a scrapy.Request to go through Zyte Data API, the zyte_api key in Request.meta must be present and contain a dict with Zyte API parameters:

import scrapy


class SampleQuotesSpider(scrapy.Spider):
    name = "sample_quotes"

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/",
            callback=self.parse,
            meta={
                "zyte_api": {
                    "browserHtml": True,
                }
            },
        )

    def parse(self, response):
        yield {"URL": response.url, "HTML": response.body}

        print(response.raw_api_response)
        # {
        #     'url': 'https://quotes.toscrape.com/',
        #     'browserHtml': '<html> ... </html>',
        # }

You can see the full list of parameters in the Zyte Data API Specification. The url parameter is filled automatically from request.url, other parameters should be set explicitly.

The raw Zyte Data API response can be accessed via the raw_api_response attribute of the response object.

When you use the Zyte Data API parameters browserHtml, httpResponseBody, or httpResponseHeaders, the response body and headers are set accordingly.

Note that, for Zyte Data API requests, the spider gets responses of ZyteAPIResponse and ZyteAPITextResponse types, which are respectively subclasses of scrapy.http.Response and scrapy.http.TextResponse.

If multiple requests target the same URL with different Zyte Data API parameters, pass dont_filter=True to Request.

Setting default parameters

Often the same configuration needs to be used for all Zyte API requests. For example, all requests may need to set the same geolocation, or the spider only uses browserHtml requests.

To set the default parameters for Zyte API enabled requests, you can set the following in the settings.py file or any other settings within Scrapy:

ZYTE_API_DEFAULT_PARAMS = {
    "browserHtml": True,
    "geolocation": "US",
}

ZYTE_API_DEFAULT_PARAMS works if the zyte_api key in Request.meta is set, i.e. having ZYTE_API_DEFAULT_PARAMS doesn’t make all requests to go through Zyte Data API. Parameters in ZYTE_API_DEFAULT_PARAMS are merged with parameters set via the zyte_api meta key, with the values in meta taking priority.

import scrapy


class SampleQuotesSpider(scrapy.Spider):
    name = "sample_quotes"

    custom_settings = {
        "ZYTE_API_DEFAULT_PARAMS": {
            "geolocation": "US",  # You can set any Geolocation region you want.
        }
    }

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/",
            callback=self.parse,
            meta={
                "zyte_api": {
                    "browserHtml": True,
                    "javascript": True,
                    "echoData": {"some_value_I_could_track": 123},
                }
            },
        )

    def parse(self, response):
        yield {"URL": response.url, "HTML": response.body}

        print(response.raw_api_response)
        # {
        #     'url': 'https://quotes.toscrape.com/',
        #     'browserHtml': '<html> ... </html>',
        #     'echoData': {'some_value_I_could_track': 123},
        # }

        print(response.request.meta)
        # {
        #     'zyte_api': {
        #         'browserHtml': True,
        #         'geolocation': 'US',
        #         'javascript': True,
        #         'echoData': {'some_value_I_could_track': 123}
        #     },
        #     'download_timeout': 180.0,
        #     'download_slot': 'quotes.toscrape.com'
        # }

There is a shortcut, in case a request uses the same parameters as defined in the ZYTE_API_DEFAULT_PARAMS setting, without any further customization - the zyte_api meta key can be set to True or {}:

import scrapy


class SampleQuotesSpider(scrapy.Spider):
    name = "sample_quotes"

    custom_settings = {
        "ZYTE_API_DEFAULT_PARAMS": {
            "browserHtml": True,
        }
    }

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/",
            callback=self.parse,
            meta={"zyte_api": True},
        )

    def parse(self, response):
        yield {"URL": response.url, "HTML": response.body}

        print(response.raw_api_response)
        # {
        #     'url': 'https://quotes.toscrape.com/',
        #     'browserHtml': '<html> ... </html>',
        # }

        print(response.request.meta)
        # {
        #     'zyte_api': {
        #         'browserHtml': True,
        #     },
        #     'download_timeout': 180.0,
        #     'download_slot': 'quotes.toscrape.com'
        # }

Customizing the retry policy

API requests are retried automatically using the default retry policy of python-zyte-api.

API requests that exceed retries are dropped. You cannot manage API request retries through Scrapy downloader middlewares.

Use the ZYTE_API_RETRY_POLICY setting or the zyte_api_retry_policy request meta key to override the default python-zyte-api retry policy with a custom retry policy.

A custom retry policy must be an instance of tenacity.AsyncRetrying.

For example, to also retry HTTP 521 errors the same as HTTP 520 errors, you can subclass RetryFactory as follows:

# settings.py
from tenacity import retry_if_exception
from zyte_api.aio.retry import RetryFactory

def is_http_521(exc: BaseException) -> bool:
    return isinstance(exc, RequestError) and exc.status == 521

class CustomRetryFactory(RetryFactory):

    retry_condition = (
        RetryFactory.retry_condition
        | retry_if_exception(is_http_521)
    )

    def wait(self, retry_state: RetryCallState) -> float:
        if is_http_521(retry_state.outcome.exception()):
            return self.temporary_download_error_wait(retry_state=retry_state)
        return super().wait(retry_state)

    def stop(self, retry_state: RetryCallState) -> bool:
        if is_http_521(retry_state.outcome.exception()):
            return self.temporary_download_error_stop(retry_state)
        return super().stop(retry_state)

ZYTE_API_RETRY_POLICY = CustomRetryFactory().build()

Stats

Stats from python-zyte-api are exposed as Scrapy stats with the scrapy-zyte-api prefix.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.18.2

Apr 25, 2024

0.18.1

Apr 19, 2024

0.18.0

Apr 17, 2024

0.17.3

Mar 18, 2024

0.17.2

Mar 14, 2024

0.17.1

Mar 11, 2024

0.17.0

Mar 5, 2024

0.16.1

Feb 23, 2024

0.16.0

Feb 8, 2024

0.15.0

Jan 31, 2024

0.14.1

Jan 17, 2024

0.14.0

Jan 15, 2024

0.13.0

Dec 13, 2023

0.12.2

Oct 19, 2023

0.12.1

Sep 29, 2023

0.12.0

Sep 26, 2023

0.11.1

Aug 25, 2023

0.11.0

Aug 7, 2023

0.10.0

Jul 14, 2023

0.9.0

Jun 13, 2023

0.8.4

May 26, 2023

0.8.3

May 17, 2023

0.8.2

May 2, 2023

0.8.1

Apr 13, 2023

0.8.0

Mar 28, 2023

0.7.1

Jan 25, 2023

0.7.0

Dec 9, 2022

0.6.0

Oct 20, 2022

0.5.1

Sep 20, 2022

0.5.0

Aug 25, 2022

0.4.2

Aug 3, 2022

0.4.1

Aug 1, 2022

This version

0.4.0

Aug 1, 2022

0.3.0

Jul 22, 2022

0.2.0

May 31, 2022

0.1.0

Feb 3, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-zyte-api-0.4.0.tar.gz (9.4 kB view hashes)

Uploaded Aug 1, 2022 Source

Built Distribution

scrapy_zyte_api-0.4.0-py3-none-any.whl (8.3 kB view hashes)

Uploaded Aug 1, 2022 Python 3

Hashes for scrapy-zyte-api-0.4.0.tar.gz

Hashes for scrapy-zyte-api-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`f678a39d425d812aee63751cd3f26c416ed48855f7d237c286283cc951c75c3e`
MD5	`68f71c2fe0909f9827596d9ab764e218`
BLAKE2b-256	`587dd33ed99bbb08c0f4eac8090349b6c96c8c6f649b1f2d17011d6cca0b4a0c`

Hashes for scrapy_zyte_api-0.4.0-py3-none-any.whl

Hashes for scrapy_zyte_api-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1699d86f17d6c8f4d7a79add60e8342392c45e975f245524f68358e7118df6b8`
MD5	`fdbe626205603e1dba8c77a3f2d817b1`
BLAKE2b-256	`8830ca8b613f09bd3ee6c95a7d522308330f2f26cc00b14fb39d4f0e8f7b3f3d`