Skip to main content

Playwright integration for Scrapy

Project description

Playwright integration for Scrapy

version pyversions actions codecov

This project provides a Scrapy Download Handler which performs requests using Playwright. It can be used to handle pages that require JavaScript. This package does not interfere with regular Scrapy workflows such as request scheduling or item processing.

Motivation

After the release of version 2.0, which includes partial coroutine syntax support and experimental asyncio support, Scrapy allows to integrate asyncio-based projects such as Playwright.

Requirements

  • Python 3.7+
  • Scrapy 2.0+
  • Playwright 0.7.0+

Installation

$ pip install scrapy-playwright

Configuration

Replace the default http and https Download Handlers through DOWNLOAD_HANDLERS:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

Note that the ScrapyPlaywrightDownloadHandler class inherits from the default http/https handler, and it will only use Playwright for requests that are explicitly marked (see the "Basic usage" section for details).

Also, be sure to install the asyncio-based Twisted reactor:

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

scrapy-playwright accepts the following settings:

  • PLAYWRIGHT_BROWSER_TYPE (type str, default chromium) The browser type to be launched. Valid values are (chromium, firefox, webkit). See the docs for the BrowserType class.

  • PLAYWRIGHT_LAUNCH_OPTIONS (type dict, default {})

    A dictionary with options to be passed when launching the Browser. See the docs for BrowserType.launch.

  • PLAYWRIGHT_CONTEXT_ARGS (type dict, default {})

    A dictionary with keyword arguments to be passed when creating the default Browser context. See the docs for Browser.new_context.

  • PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT (type Optional[int], default None)

    The timeout used when requesting pages by Playwright. If None or unset, the default value will be used (30000 ms at the time of writing this). See the docs for page.setDefaultNavigationTimeout.

Basic usage

Set the playwright Request.meta key to download a request using Playwright:

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        # GET request
        yield scrapy.Request("https://httpbin.org/get", meta={"playwright": True})
        # POST request
        yield scrapy.FormRequest(
            url="https://httpbin.org/post",
            formdata={"foo": "bar"},
            meta={"playwright": True},
        )

    def parse(self, response):
        # 'response' contains the page as seen by the browser
        yield {"url": response.url}

Page coroutines

A sorted iterable (list, tuple or dict, for instance) could be passed in the playwright_page_coroutines Request.meta key to request coroutines to be awaited on the Page before returning the final Response to the callback.

This is useful when you need to perform certain actions on a page, like scrolling down or clicking links, and you want everything to count as a single Scrapy Response, containing the final result.

Supported actions

  • scrapy_playwright.page.PageCoroutine(method: str, *args, **kwargs):

    Represents a coroutine to be awaited on a playwright.page.Page object, such as "click", "screenshot", "evaluate", etc. method should be the name of the coroutine, *args and **kwargs are passed to the function call.

    The coroutine result will be stored in the PageCoroutine.result attribute

    For instance,

    PageCoroutine("screenshot", options={"path": "quotes.png", "fullPage": True})
    

    produces the same effect as:

    # 'page' is a playwright.async_api.Page object
    await page.screenshot(options={"path": "quotes.png", "fullPage": True})
    

Receiving the Page object in the callback

Specifying a non-False value for the playwright_include_page meta key for a request will result in the corresponding playwright.async_api.Page object being available in the playwright_page meta key in the request callback. In order to be able to await coroutines on the provided Page object, the callback needs to be defined as a coroutine function (async def).

import scrapy
import playwright

class AwesomeSpiderWithPage(scrapy.Spider):
    name = "page"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta={"playwright": True, "playwright_include_page": True},
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        title = await page.title()  # "Example Domain"
        yield {"title": title}
        await page.close()

Notes:

  • In order to avoid memory issues, it is recommended to manually close the page by awaiting the Page.close coroutine.
  • Any network operations resulting from awaiting a coroutine on a Page object (goto, goBack, etc) will be executed directly by Playwright, bypassing the Scrapy request workflow (Scheduler, Middlewares, etc).

Examples

Click on a link, save the resulting page as PDF

class ClickAndSavePdfSpider(scrapy.Spider):
    name = "pdf"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta=dict(
                playwright=True,
                playwright_page_coroutines={
                    "click": PageCoroutine("click", selector="a"),
                    "pdf": PageCoroutine("pdf", options={"path": "/tmp/file.pdf"}),
                },
            ),
        )

    def parse(self, response):
        pdf_bytes = response.meta["playwright_page_coroutines"]["pdf"].result
        with open("iana.pdf", "wb") as fp:
            fp.write(pdf_bytes)
        yield {"url": response.url}  # response.url is "https://www.iana.org/domains/reserved"

Scroll down on an infinite scroll page, take a screenshot of the full page

class ScrollSpider(scrapy.Spider):
    name = "scroll"

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/scroll",
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_page_coroutines=[
                    PageCoroutine("waitForSelector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageCoroutine("waitForSelector", "div.quote:nth-child(11)"),  # 10 per page
                ],
            ),
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.screenshot(options={"path": "quotes.png", "fullPage": True})
        yield {"quote_count": len(response.css("div.quote"))}  # quotes from several pages
        await page.close()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-playwright-0.0.2.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

scrapy_playwright-0.0.2-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-playwright-0.0.2.tar.gz.

File metadata

  • Download URL: scrapy-playwright-0.0.2.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.2 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.6

File hashes

Hashes for scrapy-playwright-0.0.2.tar.gz
Algorithm Hash digest
SHA256 40d39a58da621ba552c61ee9a52d159d264a720e487927f652653056d04dea73
MD5 7b1fd4557895d0c88f6e2e31a40b6c25
BLAKE2b-256 76e6e4a44e67001634c29e7b10a892bf03768ea2f20fc02dd0f8a8abdb143e9d

See more details on using hashes here.

File details

Details for the file scrapy_playwright-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: scrapy_playwright-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.2 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.6

File hashes

Hashes for scrapy_playwright-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2b8dede3ce0465b008bfe64f60004a30785146bf58be52aecfce070cb7ec6522
MD5 287d2a99adcb084419600eeeec36a110
BLAKE2b-256 39d3ec6114515b716ee3a9d3ce469684be5b1d19bb64a6980b4ae65d55da4479

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page