Nodriver integration for Scrapy

These details have not been verified by PyPI

Project links

Homepage

Project description

scrapy-nodriver: Nodriver integration for Scrapy

A Scrapy Download Handler which performs requests using Nodriver. It can be used to handle pages that require JavaScript (among other things), while adhering to the regular Scrapy workflow (i.e. without interfering with request scheduling, item processing, etc).

What makes this package different from package like Scrapy-Playwright, is the optimization to stay undetected for most anti-bot solutions. CDP communication provides even better resistance against web applicatinon firewalls (WAF’s), while performance gets a massive boost.

Requirements

After the release of version 2.0, which includes coroutine syntax support and asyncio support, Scrapy allows to integrate asyncio-based projects such as Nodriver.

Minimum required versions

Python >= 3.8
Scrapy >= 2.0 (!= 2.4.0)

Installation

scrapy-nodriver is available on PyPI and can be installed with pip:

pip install scrapy-nodriver

nodriver is defined as a dependency so it gets installed automatically,

Activation

Download handler

Replace the default http and/or https Download Handlers through DOWNLOAD_HANDLERS:

# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_nodriver.handler.ScrapyNodriverDownloadHandler",
    "https": "scrapy_nodriver.handler.ScrapyNodriverDownloadHandler",
}

Note that the ScrapyNodriverDownloadHandler class inherits from the default http/https handler. Unless explicitly marked (see Basic usage), requests will be processed by the regular Scrapy download handler.

Twisted reactor

Install the asyncio-based Twisted reactor:

# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

This is the default in new projects since Scrapy 2.7.

Basic usage

Set the nodriver Request.meta key to download a request using Nodriver:

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        # GET request
        yield scrapy.Request("https://httpbin.org/get", meta={"nodriver": True})

    def parse(self, response, **kwargs):
        # 'response' contains the page as seen by the browser
        return {"url": response.url}

`NODRIVER_MAX_CONCURRENT_PAGES`

Type Optional[int], defaults to the value of Scrapy's CONCURRENT_REQUESTS setting

Maximum amount of allowed concurrent Nodriver pages.

NODRIVER_MAX_CONCURRENT_PAGES = 8

`NODRIVER_BLOCKED_URLS`

Type Optional[List], default None

Block resources on the page.

NODRIVER_BLOCKED_URLS = [
    "*/*.jpg",
    "*/*.png",
    "*/*.gif",
    "*/*.webp",
    "*/*.svg",
    "*/*.ico"
]

`NODRIVER_HEADLESS`

Type Optional[bool], default True

NODRIVER_HEADLESS = True

Supported `Request.meta` keys

`nodriver`

Type bool, default False

If set to a value that evaluates to True the request will be processed by Nodriver.

return scrapy.Request("https://example.org", meta={"nodriver": True})

`nodriver_include_page`

Type bool, default False

If True, the [Nodriver page] that was used to download the request will be available in the callback at response.meta['nodriver_page']. If False (or unset) the page will be closed immediately after processing the request.

Important!

This meta key is entirely optional, it's NOT necessary for the page to load or for any asynchronous operation to be performed (specifically, it's NOT necessary for PageMethod objects to be applied). Use it only if you need access to the Page object in the callback that handles the response.

For more information and important notes see Receiving Page objects in callbacks.

return scrapy.Request(
    url="https://example.org",
    meta={"nodriver": True, "nodriver_include_page": True},
)

`nodriver_page_methods`

Type Iterable[PageMethod], default ()

An iterable of scrapy_nodriver.page.PageMethod objects to indicate actions to be performed on the page before returning the final response. See Executing actions on pages.

`nodriver_page`

Type Optional[nodriver.Tab], default None

A Nodriver page to be used to download the request. If unspecified, a new page is created for each request. This key could be used in conjunction with nodriver_include_page to make a chain of requests using the same page. For instance:

from nodriver import Tab

def start_requests(self):
    yield scrapy.Request(
        url="https://httpbin.org/get",
        meta={"nodriver": True, "nodriver_include_page": True},
    )

def parse(self, response, **kwargs):
    page: Tab = response.meta["nodriver_page"]
    yield scrapy.Request(
        url="https://httpbin.org/headers",
        callback=self.parse_headers,
        meta={"nodriver": True, "nodriver_page": page},
    )

from nodriver import Tab
import scrapy

class AwesomeSpiderWithPage(scrapy.Spider):
    name = "page_spider"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            callback=self.parse_first,
            meta={"nodriver": True, "nodriver_include_page": True},
            errback=self.errback_close_page,
        )

    def parse_first(self, response):
        page: Tab = response.meta["nodriver_page"]
        return scrapy.Request(
            url="https://example.com",
            callback=self.parse_second,
            meta={"nodriver": True, "nodriver_include_page": True, "nodriver_page": page},
            errback=self.errback_close_page,
        )

    async def parse_second(self, response):
        page: Tab = response.meta["nodriver_page"]
        title = await page.title()  # "Example Domain"
        await page.close()
        return {"title": title}

    async def errback_close_page(self, failure):
        page: Tab = failure.request.meta["nodriver_page"]
        await page.close()

Notes:

When passing nodriver_include_page=True, make sure pages are always closed when they are no longer used. It's recommended to set a Request errback to make sure pages are closed even if a request fails (if nodriver_include_page=False pages are automatically closed upon encountering an exception). This is important, as open pages count towards the limit set by NODRIVER_MAX_CONCURRENT_PAGES and crawls could freeze if the limit is reached and pages remain open indefinitely.
Defining callbacks as async def is only necessary if you need to await things, it's NOT necessary if you just need to pass over the Page object from one callback to another (see the example above).
Any network operations resulting from awaiting a coroutine on a Page object (get, etc) will be executed directly by Nodriver, bypassing the Scrapy request workflow (Scheduler, Middlewares, etc).

Executing actions on pages

A sorted iterable (e.g. list, tuple) of PageMethod objects could be passed in the nodriver_page_methods Request.meta key to request methods to be invoked on the Page object before returning the final Response to the callback.

This is useful when you need to perform certain actions on a page (like scrolling down or clicking links) and you want to handle only the final result in your callback.

`PageMethod` class

`scrapy_nodriver.page.PageMethod(method: str, *args, **kwargs)`:

Represents a method to be called (and awaited if necessary) on a nodriver.Tab object (e.g. "select", "save_screenshot", "evaluate", etc). method is the name of the method, *args and **kwargs are passed when calling such method. The return value will be stored in the PageMethod.result attribute.

For instance:

def start_requests(self):
    yield Request(
        url="https://example.org",
        meta={
            "nodriver": True,
            "nodriver_page_methods": [
                PageMethod("save_screenshot", filename="example.jpeg", full_page=True),
            ],
        },
    )

def parse(self, response, **kwargs):
    screenshot = response.meta["nodriver_page_methods"][0]
    # screenshot.result contains the image file path

produces the same effect as:

def start_requests(self):
    yield Request(
        url="https://example.org",
        meta={"nodriver": True, "nodriver_include_page": True},
    )

async def parse(self, response, **kwargs):
    page = response.meta["nodriver_page"]
    filepath = await page.save_screenshot(filename="example.jpeg", full_page=True)
    await page.close()

Supported methods

Refer to the upstream docs for the Tab class to see available methods.

Scroll down on an infinite scroll page, take a screenshot of the full page

class ScrollSpider(scrapy.Spider):
    name = "scroll"

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/scroll",
            meta=dict(
                nodriver=True,
                nodriver_include_page=True,
                nodriver_page_methods=[
                    PageMethod("wait_for", "div.quote"),
                    PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageMethod("wait_for", "div.quote:nth-child(11)"),  # 10 per page
                ],
            ),
        )

    async def parse(self, response, **kwargs):
        page = response.meta["nodriver_page"]
        await page.save_screenshot(filename="quotes.jpeg", full_page=True)
        await page.close()
        return {"quote_count": len(response.css("div.quote"))}  # quotes from several pages

Known issues

No proxy support

Specifying a proxy via the proxy Request meta key is not supported.

Reporting issues

Before opening an issue please make sure the unexpected behavior can only be observed by using this package and not with standalone Nodriver. To do this, translate your spider code to a reasonably close Nodriver script: if the issue also occurs this way, you should instead report it upstream. For instance:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta=dict(
                nodriver=True,
                nodriver_page_methods=[
                    PageMethod("save_screenshot", filename="example.jpeg", full_page=True),
                ],
            ),
        )

translates roughly to:

import asyncio
import nodriver as uc

async def main():
    browser = await uc.start()
    page = await browser.get("https://example.org")
    await page.save_screenshot(filename="example.jpeg", full_page=True)
    await page.close()

if __name__ == '__main__':
    uc.loop().run_until_complete(main())

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.6

Aug 24, 2024

0.0.5

Aug 24, 2024

0.0.4

Jul 29, 2024

0.0.3

Jul 29, 2024

0.0.2

Jul 29, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_nodriver-0.0.6.tar.gz (11.8 kB view hashes)

Uploaded Aug 24, 2024 Source

Built Distribution

scrapy_nodriver-0.0.6-py3-none-any.whl (9.4 kB view hashes)

Uploaded Aug 24, 2024 Python 3

Hashes for scrapy_nodriver-0.0.6.tar.gz

Hashes for scrapy_nodriver-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`93c944c7b170aaa7f02e50d0c2b7e5b3b2367576634848928d9828834e2af895`
MD5	`a219c879a19a1cdb3e2207be043812df`
BLAKE2b-256	`6a9d2d524676dd99b8dd7720ba658d123bf553fd76f16cb3a1f4e5d8471a23e6`

Hashes for scrapy_nodriver-0.0.6-py3-none-any.whl

Hashes for scrapy_nodriver-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dffd1bb425727064ba237e2d450c4525e99d1cb4d1d0918299136df44980812e`
MD5	`f4acf7f764f3e41de4b3b2017de10cca`
BLAKE2b-256	`f3b721caa5fd305ed3a533a7622078a7bb2cf99929c79c291c6e85bc7b692d58`

scrapy-nodriver 0.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scrapy-nodriver: Nodriver integration for Scrapy

Requirements

Minimum required versions

Installation

Activation

Download handler

Twisted reactor

Basic usage

NODRIVER_MAX_CONCURRENT_PAGES

NODRIVER_BLOCKED_URLS

NODRIVER_HEADLESS

Supported Request.meta keys

nodriver

nodriver_include_page

nodriver_page_methods

nodriver_page

Executing actions on pages

PageMethod class

scrapy_nodriver.page.PageMethod(method: str, *args, **kwargs):

Supported methods

Known issues

No proxy support

Reporting issues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`NODRIVER_MAX_CONCURRENT_PAGES`

`NODRIVER_BLOCKED_URLS`

`NODRIVER_HEADLESS`

Supported `Request.meta` keys

`nodriver`

`nodriver_include_page`

`nodriver_page_methods`

`nodriver_page`

`PageMethod` class

`scrapy_nodriver.page.PageMethod(method: str, *args, **kwargs)`: