A library to use Puppeteer-managed browser in Scrapy spiders

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language

Project description

Scrapy-puppeteer-client

This package aims to manage Chrome browser with Puppeteer from Scrapy spiders. This allows to scrape sites that require JS to function properly and to make the scraper more similar to humans. It is a client library for scrapy-puppeteer-service.

⚠️ This repository is under development.

This project is under development. Use it at your own risk.

Installation

Using pip (master branch):

$ pip install scrapy-puppeteer-client

Configuration

You should have scrapy-puppeteer-service started. Then add its URL to settings.py and enable puppeteer downloader middleware:

DOWNLOADER_MIDDLEWARES = {
    'scrapypuppeteer.middleware.PuppeteerServiceDownloaderMiddleware': 1042
}

PUPPETEER_SERVICE_URL = "http://localhost:3000"  # Not necessary in other execution methods

# To change the execution method, you must add the corresponding setting:
EXECUTION_METHOD = "Puppeteer"

Available methods: Puppeteer, Pyppeteer, Playwright

Pyppeteer and Playwright methods do not require a running service. They use the pyppeteer and playwright libraries for Python to interact with the browser. Actions such as CustomJsAction, RecaptchaSolver, and Har are not available when using these methods.

To use Pyppeteer or Playwright methods you need to install Chromium.

Basic usage

Use scrapypuppeteer.PuppeteerRequest instead of scrapy.Request to render URLs with Puppeteer:

import scrapy
from scrapypuppeteer import PuppeteerRequest

class MySpider(scrapy.Spider):
    ...
    def start_requests(self):
        yield PuppeteerRequest('https://exapmle.com', callback=self.parse)
    
    def parse(self, response):
        links = response.css(...)
        ...

Puppeter responses

There is a parent PuppeteerResponse class from which other response classes are inherited.

Here is a list of them all:

PuppeteerHtmlResponse - has html and cookies properties
PuppeteerScreenshotResponse - has screenshot property
PuppeteerHarResponse - has har property
PuppeteerJsonResponse - has data property and to_html() method which tries to transform itself to PuppeteerHtmlResponse
PuppeteerRecaptchaSolverResponse(PuppeteerJsonResponse, PuppeteerHtmlResponse) - has recaptcha_data property

Advanced usage

PuppeteerRequest's first argument is a browser action. Available actions are defined in scrapypuppeteer.actions module as subclasses of PuppeteerServiceAction. Passing a URL into request is a shortcut for GoTo(url) action.

Here is the list of available actions:

GoTo(url, options) - navigate to URL
GoForward(options) - navigate forward in history
GoBack(options) - navigate back in history
Click(selector, click_options, wait_options) - click on element on page
CaptchaSolver(solve_recaptcha, solve_cloudflare_captcha, close_on_empty, options) - solve Recaptcha and Cloudflare captchas on the page
Compose(*actions) - composition of several puppeteer action
Scroll(selector, wait_options) - scroll page
Screenshot(options) - take screenshot
Har() - to get the HAR file, pass the har_recording=True argument to PuppeteerRequest at the start of execution.
FillForm(input_mapping, submit_button) - to fill out and submit forms on page.
RecaptchaSolver(solve_recaptcha, close_on_empty, options) - find or solve recaptcha on page
CustomJsAction(js_function) - evaluate JS function on page

Available options essentially mirror service method parameters, which in turn mirror puppeteer API functions to some extent. See scrapypuppeteer.actions module for details.

You may pass close_page=False option to a request to retain browser tab and its state after request's completion. Then use response.follow to continue interacting with the same tab:

import scrapy
from scrapypuppeteer import PuppeteerRequest, PuppeteerHtmlResponse
from scrapypuppeteer.actions import Click

class MySpider(scrapy.Spider):
    ...
    def start_requests(self):
        yield PuppeteerRequest(
            'https://exapmle.com',  # will be transformed into GoTo action
            close_page=False,
            callback=self.parse,
        )

    def parse(self, response: PuppeteerHtmlResponse):
        ...
        # parse and yield some items
        ...
        next_page_selector = 'button.next-page-or-smth'
        if response.css(next_page_selector ):
            yield response.follow(
                Click(
                    next_page_selector,
                    wait_options={'selectorOrTimeout': 3000},  # wait 3 seconds
                ),
                close_page=False,
                callback=self.parse,
            )

You may also use follow_all method to continue interacting.

On your first request service will create new incognito browser context and new page in it. Their ids will be in returned in response object as context_id and page_id attributes. Following such response means passing context and page ids to next request. You also may specify requests context and page ids directly.

Right before your spider has done the crawling, the service middleware will take care of closing all used browser contexts with scrapypuppeteer.CloseContextRequest. It accepts a list of all browser contexts to be closed.

One may customize which PuppeteerRequest's headers will be sent to remote website by the service via include_headers attribute in request or globally with PUPPETEER_INCLUDE_HEADERS setting. Available values are True (all headers), False (no headers) or list of header names. By default, only cookies are sent.

You would also like to send meta with your request. By default, you are not allowed to do this in order to sustain backward compatibility. You can change this behaviour by setting PUPPETEER_INCLUDE_META to True.

Automatic recaptcha solving

Enable PuppeteerRecaptchaDownloaderMiddleware to automatically solve recaptcha during scraping. We do not recommend to use RecaptchaSolver action when the middleware works.

DOWNLOADER_MIDDLEWARES = {
    'scrapypuppeteer.middleware.PuppeteerRecaptchaDownloaderMiddleware': 1041,
    'scrapypuppeteer.middleware.PuppeteerServiceDownloaderMiddleware': 1042
}

Note that the number of RecaptchaMiddleware has to be lower than ServiceMiddleware's. You must provide some settings to use the middleware:

PUPPETEER_INCLUDE_META = True  # Essential to send meta

RECAPTCHA_ACTIVATION = True  # Enables the middleware
RECAPTCHA_SOLVING = False  # Automatic recaptcha solving
RECAPTCHA_SUBMIT_SELECTORS = {  # Selectors for "submit recaptcha" button
    'www.google.com/recaptcha/api2/demo': '',  # No selectors needed
}

If you set RECAPTCHA_SOLVING to False the middleware will try to find captcha and will notify you about number of found captchas on the page.

If you don't want the middleware to work on specific request you may provide special meta key: 'dont_recaptcha': True. In this case RecaptchaMiddleware will just skip the request.

TODO

skeleton that could handle goto, click, scroll, and actions
headers and cookies management
proxy support for puppeteer
error handling for requests
har support

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

0.4.0

Aug 1, 2025

0.3.9

Apr 10, 2025

0.3.8

Nov 1, 2024

0.3.5

Aug 22, 2024

0.3.4

Aug 9, 2024

0.3.2

Jul 22, 2024

0.3.1

Jul 5, 2024

0.3.0

Jul 2, 2024

0.2.0

Jun 27, 2024

0.1.5

Jan 19, 2024

0.1.4

Sep 11, 2023

0.1.3

Aug 31, 2023

0.1.2

Aug 24, 2023

0.1.1

Jul 26, 2023

0.1.0

Jul 26, 2023

0.0.8

May 24, 2023

0.0.7

May 2, 2023

0.0.6

Aug 2, 2022

0.0.5

Nov 23, 2020

0.0.3

Aug 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_puppeteer_client-0.4.0.tar.gz (26.7 kB view details)

Uploaded Aug 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapy_puppeteer_client-0.4.0-py3-none-any.whl (33.1 kB view details)

Uploaded Aug 1, 2025 Python 3

File details

Details for the file scrapy_puppeteer_client-0.4.0.tar.gz.

File metadata

Download URL: scrapy_puppeteer_client-0.4.0.tar.gz
Upload date: Aug 1, 2025
Size: 26.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for scrapy_puppeteer_client-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`b895200666b4a237bba07da39f74170fc1b3f2f66088277e51dc520ade0ff196`
MD5	`5414f14ae5be6819dbcab3491539088a`
BLAKE2b-256	`4052486390938b4f7015483a640b8638538fad81557fbb8149848f79d622599e`

See more details on using hashes here.

File details

Details for the file scrapy_puppeteer_client-0.4.0-py3-none-any.whl.

File metadata

Download URL: scrapy_puppeteer_client-0.4.0-py3-none-any.whl
Upload date: Aug 1, 2025
Size: 33.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for scrapy_puppeteer_client-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`16dc34a3118e28cce7640743adb633890d85d9f27587fe404090e858397c8b85`
MD5	`18e8e9c87d79e8a2339f5d51a7f015ea`
BLAKE2b-256	`03aafcf7c8f36143105986671516e0819fda938824ef671822662d3cd94a18e9`

See more details on using hashes here.

scrapy-puppeteer-client 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Scrapy-puppeteer-client

⚠️ This repository is under development.

Installation

Configuration

Basic usage

Puppeter responses

Advanced usage

Automatic recaptcha solving

TODO

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes