Playwright integration for Scrapy

These details have not been verified by PyPI

Project links

Homepage

Project description

Playwright integration for Scrapy

This project provides a Scrapy Download Handler which performs requests using Playwright for Python. It can be used to handle pages that require JavaScript. This package does not interfere with regular Scrapy workflows such as request scheduling or item processing.

Motivation

After the release of version 2.0, which includes partial coroutine syntax support and experimental asyncio support, Scrapy allows to integrate asyncio-based projects such as Playwright.

Requirements

Python >= 3.7
Scrapy >= 2.0 (!= 2.4.0)
Playwright >= 1.8.0a1

Installation

$ pip install scrapy-playwright

Changelog

Please see the changelog.md file.

Configuration

Replace the default http and https Download Handlers through DOWNLOAD_HANDLERS:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

Note that the ScrapyPlaywrightDownloadHandler class inherits from the default http/https handler, and it will only use Playwright for requests that are explicitly marked (see the "Basic usage" section for details).

Also, be sure to install the asyncio-based Twisted reactor:

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Settings

scrapy-playwright accepts the following settings:

PLAYWRIGHT_BROWSER_TYPE (type str, default chromium) The browser type to be launched. Valid values are (chromium, firefox, webkit).
PLAYWRIGHT_LAUNCH_OPTIONS (type dict, default {})

A dictionary with options to be passed when launching the Browser. See the docs for BrowserType.launch.
PLAYWRIGHT_CONTEXT_ARGS (type dict, default {})

A dictionary with default keyword arguments to be passed when creating the "default" Browser context.

Deprecated: use PLAYWRIGHT_CONTEXTS instead
PLAYWRIGHT_CONTEXTS (type dict[str, dict], default {})

A dictionary which defines Browser contexts to be created on startup. It should be a mapping of (name, keyword arguments) For instance:
```
{
    "first": {
        "context_arg1": "value",
        "context_arg2": "value",
    },
    "second": {
        "context_arg1": "value",
    },
}
```
If no contexts are defined, a default context (called default) is created. The arguments passed here take precedence over the ones defined in PLAYWRIGHT_CONTEXT_ARGS. See the docs for Browser.new_context.
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT (type Optional[int], default None)

The timeout used when requesting pages by Playwright. If None or unset, the default value will be used (30000 ms at the time of writing this). See the docs for BrowserContext.set_default_navigation_timeout.

Basic usage

Set the playwright Request.meta key to download a request using Playwright:

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        # GET request
        yield scrapy.Request("https://httpbin.org/get", meta={"playwright": True})
        # POST request
        yield scrapy.FormRequest(
            url="https://httpbin.org/post",
            formdata={"foo": "bar"},
            meta={"playwright": True},
        )

    def parse(self, response):
        # 'response' contains the page as seen by the browser
        yield {"url": response.url}

Notes about the User-Agent header

By default, outgoing requests include the User-Agent set by Scrapy (either with the USER_AGENT or DEFAULT_REQUEST_HEADERS settings or via the Request.headers attribute). This could cause some sites to react in unexpected ways, for instance if the user agent does not match the Browser being used. If you prefer to send the User-Agent from the Browser, set the Scrapy user agent to None.

Receiving the Page object in the callback

Specifying a non-False value for the playwright_include_page meta key for a request will result in the corresponding playwright.async_api.Page object being available in the playwright_page meta key in the request callback. In order to be able to await coroutines on the provided Page object, the callback needs to be defined as a coroutine function (async def).

import scrapy
import playwright

class AwesomeSpiderWithPage(scrapy.Spider):
    name = "page"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta={"playwright": True, "playwright_include_page": True},
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        title = await page.title()  # "Example Domain"
        await page.close()
        return {"title": title}

Notes:

In order to avoid memory issues, it is recommended to manually close the page by awaiting the Page.close coroutine.
Any network operations resulting from awaiting a coroutine on a Page object (goto, go_back, etc) will be executed directly by Playwright, bypassing the Scrapy request workflow (Scheduler, Middlewares, etc).

Multiple browser contexts

Multiple browser contexts to be launched at startup can be defined via the PLAYWRIGHT_CONTEXTS setting.

Choosing a specific context for a request

Pass the name of the desired context in the playwright_context meta key:

yield scrapy.Request(
    url="https://example.org",
    meta={"playwright": True, "playwright_context": "first"},
)

Creating a context during a crawl

If the context specified in the playwright_context meta key does not exist, it will be created. You can specify keyword arguments to be passed to Browser.new_context in the playwright_context_kwargs meta key:

yield scrapy.Request(
    url="https://example.org",
    meta={
        "playwright": True,
        "playwright_context": "new",
        "playwright_context_kwargs": {
            "java_script_enabled": False,
            "ignore_https_errors": True,
            "proxy": {
                "server": "http://myproxy.com:3128",
                "username": "user",
                "password": "pass",
            },
        },
    },
)

Please note that if a context with the specified name already exists, that context is used and playwright_context_kwargs are ignored.

Closing a context during a crawl

After receiving the Page object in your callback, you can access a context though the corresponding Page.context attribute, and await close on it.

def parse(self, response):
    yield scrapy.Request(
        url="https://example.org",
        callback=self.parse_in_new_context,
        meta={"playwright": True, "playwright_context": "new", "playwright_include_page": True},
    )

async def parse_in_new_context(self, response):
    page = response.meta["playwright_page"]
    title = await page.title()
    await page.context.close()  # close the context
    await page.close()
    return {"title": title}

Page coroutines

A sorted iterable (list, tuple or dict, for instance) could be passed in the playwright_page_coroutines Request.meta key to request coroutines to be awaited on the Page before returning the final Response to the callback.

This is useful when you need to perform certain actions on a page, like scrolling down or clicking links, and you want everything to count as a single Scrapy Response, containing the final result.

Supported actions

scrapy_playwright.page.PageCoroutine(method: str, *args, **kwargs):

Represents a coroutine to be awaited on a playwright.page.Page object, such as "click", "screenshot", "evaluate", etc. method should be the name of the coroutine, *args and **kwargs are passed to the function call.

The coroutine result will be stored in the PageCoroutine.result attribute

For instance,
```
PageCoroutine("screenshot", path="quotes.png", fullPage=True)
```
produces the same effect as:
```
# 'page' is a playwright.async_api.Page object
await page.screenshot(path="quotes.png", fullPage=True)
```

Page events

A dictionary of Page event handlers can be specified in the playwright_page_event_handlers Request.meta key. Keys are the name of the event to be handled (dialog, download, etc). Values can be either callables or strings (in which case a spider method with the name will be looked up).

Example:

from playwright.async_api import Dialog

async def handle_dialog(dialog: Dialog) -> None:
    logging.info(f"Handled dialog with message: {dialog.message}")
    await dialog.dismiss()

class EventSpider(scrapy.Spider):
    name = "event"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta=dict(
                playwright=True,
                playwright_page_event_handlers={
                    "dialog": handle_dialog,
                    "response": "handle_response",
                },
            ),
        )

    async def handle_response(self, response: PlaywrightResponse) -> None:
        logging.info(f"Received response with URL {response.url}")

See the upstream Page docs for a list of the accepted events and the arguments passed to their handlers.

Note: keep in mind that, unless they are removed later, these handlers will remain attached to the page and will be called for subsequent downloads using the same page. This is usually not a problem, since by default requests are performed in single-use pages.

Examples

Click on a link, save the resulting page as PDF

class ClickAndSavePdfSpider(scrapy.Spider):
    name = "pdf"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta=dict(
                playwright=True,
                playwright_page_coroutines={
                    "click": PageCoroutine("click", selector="a"),
                    "pdf": PageCoroutine("pdf", path="/tmp/file.pdf"),
                },
            ),
        )

    def parse(self, response):
        pdf_bytes = response.meta["playwright_page_coroutines"]["pdf"].result
        with open("iana.pdf", "wb") as fp:
            fp.write(pdf_bytes)
        yield {"url": response.url}  # response.url is "https://www.iana.org/domains/reserved"

Scroll down on an infinite scroll page, take a screenshot of the full page

class ScrollSpider(scrapy.Spider):
    name = "scroll"

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/scroll",
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_page_coroutines=[
                    PageCoroutine("wait_for_selector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
                ],
            ),
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.screenshot(path="quotes.png", fullPage=True)
        await page.close()
        return {"quote_count": len(response.css("div.quote"))}  # quotes from several pages

For more examples, please see the scripts in the examples directory.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.46

Jan 21, 2026

0.0.45

Jan 16, 2026

0.0.44

Aug 13, 2025

0.0.43

Feb 22, 2025

0.0.42

Nov 6, 2024

0.0.41

Aug 13, 2024

0.0.40

Jul 16, 2024

0.0.39

Jul 11, 2024

0.0.38

Jul 6, 2024

0.0.37

Jul 3, 2024

0.0.36

Jun 24, 2024

0.0.35

Jun 1, 2024

0.0.34

Jan 1, 2024

0.0.33

Oct 19, 2023

0.0.32

Sep 4, 2023

0.0.31

Aug 28, 2023

0.0.30

Aug 17, 2023

0.0.29

Aug 11, 2023

0.0.28

Aug 5, 2023

0.0.27

Jul 24, 2023

0.0.26

Feb 1, 2023

0.0.25

Jan 24, 2023

0.0.24

Dec 4, 2022

0.0.23

Nov 27, 2022

0.0.22

Oct 9, 2022

0.0.21

Aug 8, 2022

0.0.20

Aug 3, 2022

0.0.19

Jul 17, 2022

0.0.18

Jun 18, 2022

0.0.17

May 22, 2022

0.0.16

May 14, 2022

0.0.15

May 9, 2022

0.0.14

Mar 26, 2022

0.0.13

Mar 24, 2022

0.0.12

Mar 15, 2022

0.0.11

Mar 12, 2022

0.0.10

Mar 2, 2022

0.0.9

Jan 27, 2022

0.0.8

Jan 13, 2022

This version

0.0.7

Oct 20, 2021

0.0.6

Oct 19, 2021

0.0.5

Aug 20, 2021

0.0.4

Jul 16, 2021

0.0.3

Feb 22, 2021

0.0.2

Jan 13, 2021

0.0.1

Dec 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-playwright-0.0.7.tar.gz (14.1 kB view details)

Uploaded Oct 20, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapy_playwright-0.0.7-py3-none-any.whl (10.6 kB view details)

Uploaded Oct 20, 2021 Python 3

File details

Details for the file scrapy-playwright-0.0.7.tar.gz.

File metadata

Download URL: scrapy-playwright-0.0.7.tar.gz
Upload date: Oct 20, 2021
Size: 14.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for scrapy-playwright-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`76a2510a6f59279c5269d5b4ba253fb22afe90892aae0e68569faf241f06f305`
MD5	`86cc06c1d02f6176143b0b07aabb0b5e`
BLAKE2b-256	`2b01f683d3e3121ac14b8ad6a1df051e2efca03eab92df59003b4d60b3e1f1ce`

See more details on using hashes here.

File details

Details for the file scrapy_playwright-0.0.7-py3-none-any.whl.

File metadata

Download URL: scrapy_playwright-0.0.7-py3-none-any.whl
Upload date: Oct 20, 2021
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for scrapy_playwright-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`39cf37b7dff12fcfb713044ec50f7c03aeba45b2071c3fc2bd7d4320d9b8fda2`
MD5	`80fa047731d6f791feb8e471393c2118`
BLAKE2b-256	`fc4d853f7c20df89b1c2d641d2120fd054e7d4fe206e749d7a0a38454f21f6eb`

See more details on using hashes here.

scrapy-playwright 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Playwright integration for Scrapy

Motivation

Requirements

Installation

Changelog

Configuration

Settings

Basic usage

Notes about the User-Agent header

Receiving the Page object in the callback

Multiple browser contexts

Choosing a specific context for a request

Creating a context during a crawl

Closing a context during a crawl

Page coroutines

Supported actions

Page events

Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes