Skip to main content

Promise-style workflow for Scrapy

Project description

scrapy-promise

Promise API for making Scrapy requests.

Usage & Examples

from scrapy_promise import fetch

The Promise here works like Promise in JavaScript. If you are new to Promise, a great starting point would be MDN's Promise API reference and the guide to Using Promises.

Creating and making requests

fetch() accepts all arguments that scrapy.http.Request accepts, except for callback and errback

>>> fetch('https://example.org/login', method='POST', meta={'username': 'admin'})

fetch() returns a Promise object, which is an iterator/generator. You can return it directly in start_requests, or yield from it in an existing callback.

Adding handlers

If you only call fetch() and yield from it, then all it does is storing the response once the request is finished:

request = fetch('https://httpbin.org/ip')
yield from request
# When the request is done
>>> request.is_fulfilled
True
>>> request.get()
<200 https://httpbin.org/ip>

fetch() returns a Promise object. Call its .then() method and provide a callable, and Promise will call it once there is a response.

.then() returns another Promise that you can yield from:

def on_fulfill(response: TextResponse):
    # You can yield items from your handler
    # just like you would in a Scrapy callback
    yield Item(response)

>>> yield from fetch(...).then(on_fulfill)

You can also attach an error handler with .catch(), which will receive either a Twisted Failure or an Exception:

def on_reject(exc: Union[Failure, Exception]):
    if isinstance(exc, Failure):
        exc = exc.value
    ...

>>> yield from fetch(...).then(on_fulfill).catch(on_reject)
# will catch both exceptions during the request
# and exceptions raised in on_fulfill

Branching and chaining

Because .then() and .catch() return another Promise, you can chain additional handlers.

Subsequent handlers will receive the return value of the previous handler. This is different from an ordinary Scrapy callback, where returning a value has no effect:

yield from (
    fetch('https://httpbin.org/ip')
    .then(parse_json)   # returns dict
    .then(create_item)  # will be passed the dict from the previous handler
    .catch(lambda exc: logging.getLogger().error(exc)))

Dynamic chaining: If you return another fetch() request in your handler, that request will be scheduled, and the next handler will be called with the Response of this new request. This lets you schedule multiple requests in order.

yield from (
    fetch('https://httpbin.org/ip')
    # A second Request is created from the response of the first one and is scheduled.
    .then(lambda response: fetch(json.loads(response.text)['origin']))
    .then(lambda response: (yield Item(response)))
    .catch(lambda exc: logging.getLogger().error(exc)))

Note that only the request you returned will be connected to subsequent handlers, Requests that are yielded in the middle of the handler will be scheduled directly by Scrapy.

You can also attach multiple handlers to one request, and they will be evaluated in the order they were declared:

resource = fetch(...)
resource.then(save_token)
resource.then(parse_html).catch(log_error)
resource.then(next_page).catch(stop_spider)
yield from resource  # Evaluating any Promise in a chain/branch causes
                     # the entire Promise tree to be evaluated.

Promise aggregation functions

Promise provides several aggregation functions that let you better control the how the requests are scheduled.

from notcallback import Promise  # dependency

Promise.all() will only fulfill when all requests are made successfully, and will reject as soon as one of the requests failed. If all the requests succeed, the handler will receive a list of Responses:

def parse_pages(responses: Tuple[TextResponse]):
    for r in responses:
        ...

yield from Promise.all(*[fetch(url) for url in urls]).then(parse_pages)

Promise.race() will fulfill as soon as one of the requests is fulfilled/rejected.

def select_fastest_cdn():
    yield from (
        Promise.race(*[fetch(url, method='HEAD') for url in cdn_list])
        .then(crawl_server))

Promise.all_settled() always fulfills when all requests are finished, regardless of whether or not they are successful. The handler will receive a list of Promises whose value (the response) can be accessed with the .get() method:

def report(promises: Tuple[Promise]):
    for promise in promises:
        result = promise.get()
        if isinstance(result, Response):
            log.info(f'Crawled {result.url}')
        else:
            log.warn(f'Encountered error {result}')

yield from Promise.all_settled(*[fetch(u) for u in urls]).then(report)

Promise.any() fulfills with the first request that fulfills, and rejects if no request is successful:

def download(response):
    ...

yield from (
    Promise.any(*[fetch(u) for u in urls])
    .then(download)
    .catch(lambda exc: log.warn('No valid URL!')))

For more info on the Promise API, see notcallback

See also

Other ways to schedule requests within a callback:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_promise-0.0.6.tar.gz (6.5 kB view hashes)

Uploaded Source

Built Distribution

scrapy_promise-0.0.6-py3-none-any.whl (7.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page