Skip to main content

Patchright integration for scrapy.

Project description

scrapy-patchwright: Patchright integration for Scrapy

version

scrapy-patchwright is a scrapy-patchwright variant version using patchright that become browser stealth against those anti-bot websites.

Installation

scrapy-playfox is available on PyPI and can be installed with pip:

pip install scrapy-patchwright

As patchright stated, it only supports patching Chromium browser

[!IMPORTANT]

Patchright only patches CHROMIUM based browsers. Firefox and Webkit are not supported.

Therefore installing Chromium browser is all we need.

patchright install chromium

Activation

Download handler

Replace the default http and/or https Download Handlers through DOWNLOAD_HANDLERS:

# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_patchwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_patchwright.handler.ScrapyPlaywrightDownloadHandler",
}

If this handler is only used by specified spiders, you can add custom settings in your spider like this:

import scrapy

class MySpider(scrapy.Spider):
    custom_settings = {
        'DOWNLOAD_HANDLERS': {
            "http": "scrapy_patchwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_patchwright.handler.ScrapyPlaywrightDownloadHandler",
        },
    }

Note that the ScrapyPlaywrightDownloadHandler class inherits from the default http/https handler. Unless explicitly marked, requests will be processed by the regular Scrapy download handler.

Twisted reactor

Install the asyncio-based Twisted reactor:

# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

This is the default in new projects since Scrapy 2.7.

Common Settings

This is the commonly use case in your spider. Learn more from scrapy-patchwright documentations.

from browserforge.fingerprints import Screen

custom_settings = {
    'PLAYWRIGHT_PROCESS_REQUEST_HEADERS': None, # This is mandatory!
    'PLAYWRIGHT_LAUNCH_OPTIONS': {
        'headless': False,
        'channel': 'chrome',
        'no_viewport': True,
    },
    'PLAYWRIGHT_CONTEXTS': {
        'persistent': {
            'user_data_dir': 'patchright_data',
            'ignore_https_errors': True,
        }
    }
}

Basic Usage

Same as scrapy-patchwright.

import scrapy

class MySpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        # GET request
        yield scrapy.Request("https://httpbin.org/get", meta={"patchright": True})
        # POST request
        yield scrapy.FormRequest(
            url="https://httpbin.org/post",
            formdata={"foo": "bar"},
            meta={"patchright": True},
        )

    def parse(self, response, **kwargs):
        # 'response' contains the page as seen by the browser
        return {"url": response.url}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_patchwright-0.0.46-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_patchwright-0.0.46-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_patchwright-0.0.46-py3-none-any.whl
Algorithm Hash digest
SHA256 e7d305abf1d9ed40c55a7486da04df7495d4ec8dfb564a1419c4fc58f79d2972
MD5 edfe60d20d31fa0e48bf6aa5031775bf
BLAKE2b-256 e35fbe9bb0df749bb8f3ba9193e99dbaeb3c128131b0793ade04e92c4d55df7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page