Skip to main content

Add your description here

Project description

scrapy-pydoll

A Scrapy Download Handler which performs requests using Pydoll. It can be used to handle pages that require JavaScript (among other things), while adhering to the regular Scrapy workflow (i.e. without interfering with request scheduling, item processing, etc).

Requirements

  • Python >= 3.12
  • Scrapy >= 2.0 (!= 2.4.0)
  • Pydoll-python >= 1.3.3
  • Google Chrome

Installation

pip install scrapy-pydoll

Basic Configuration

Add the following to your Scrapy project's settings:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_pydoll.handler.PydollDownloadHandler",
    "https": "scrapy_pydoll.handler.PydollDownloadHandler"
}

# Required for async support
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Settings Reference

Setting Type Default Description
PYDOLL_HEADLESS bool True Run Chrome in headless mode
PYDOLL_PROXY str None Proxy server URL (e.g. "http://proxy:8080")
PYDOLL_MAX_PAGES int 4 Maximum number of concurrent browser pages
PYDOLL_NAVIGATION_TIMEOUT int 30 Page navigation timeout in seconds
PYDOLL_ABORT_REQUEST str None Resource type to block (e.g. "image", "stylesheet", "script")

Usage Examples

Basic Usage

import scrapy
from scrapy_pydoll.page import PageMethod
from pydoll.constants import By

class MySpider(scrapy.Spider):
    name = "myspider"
    
    def start_requests(self):
        yield scrapy.Request(
            "https://example.com",
            meta={"pydoll": True}  # Enable Pydoll for this request
        )

Wait for Elements

yield scrapy.Request(
    url,
    meta={
        "pydoll": True,
        "pydoll_page_methods": [
            PageMethod("wait_element", By.XPATH, "//div[@class='content']"),
        ]
    }
)

Take Screenshots

yield scrapy.Request(
    url,
    meta={
        "pydoll": True,
        "pydoll_page_methods": [
            PageMethod("get_screenshot", "output.png"),
        ]
    }
)

Block Resource Types

To block specific resource types (like images or scripts), use the PYDOLL_ABORT_REQUEST setting:

# In your settings.py
PYDOLL_ABORT_REQUEST = "image"  # Blocks all image requests

Complete Example

Here's a complete spider example that scrapes a JavaScript-rendered website:

import scrapy
from scrapy_pydoll.page import PageMethod
from pydoll.constants import By

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        yield scrapy.Request(
            "http://example.com/js/",
            meta={
                "pydoll": True,
                "pydoll_page_methods": [
                    PageMethod("wait_element", By.XPATH, "//div[@class='quote']"),
                    PageMethod("get_screenshot", "quotes.png"),
                ]
            }
        )

    def parse(self, response):
        for quote in response.xpath("//div[@class='quote']"):
            yield {
                "text": quote.xpath(".//span[@class='text']/text()").get(),
                "author": quote.xpath(".//small[@class='author']/text()").get(),
            }

Supported Pydoll methods

Refer to the upstream docs for the Page class to see available methods.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_pydoll-0.0.4.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_pydoll-0.0.4-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_pydoll-0.0.4.tar.gz.

File metadata

  • Download URL: scrapy_pydoll-0.0.4.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.14

File hashes

Hashes for scrapy_pydoll-0.0.4.tar.gz
Algorithm Hash digest
SHA256 3ac2a56d12aa188678774815e0a7bf204578932772c73e07fdbae4482c0dc488
MD5 6b5d9408df48132ad5bc6008d85099f2
BLAKE2b-256 6cb3cda6154dbff4ee5f56eb0ca0c0309b55460bdc1ea10e73ac903eaf2506ec

See more details on using hashes here.

File details

Details for the file scrapy_pydoll-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_pydoll-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e808d92bc1b1a35887ee10a5f1ad4afaaa5cde1d9716602c2c75daef974dea6c
MD5 2f0d4fec5aacf8f6dfdeca27f3b74441
BLAKE2b-256 00cf8254d5c9270b2f20d09214728fce9522e5e778b4753511a5cf9747e216dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page