Patchright integration for scrapy.
Project description
scrapy-patchwright: Patchright integration for Scrapy
scrapy-patchwright is a scrapy-patchwright variant version using patchright that become browser stealth against those anti-bot websites.
Installation
scrapy-playfox is available on PyPI and can be installed with pip:
pip install scrapy-patchwright
As patchright stated, it only supports patching Chromium browser
[!IMPORTANT]
Patchright only patches CHROMIUM based browsers. Firefox and Webkit are not supported.
Therefore installing Chromium browser is all we need.
patchright install chromium
Activation
Download handler
Replace the default http and/or https Download Handlers through DOWNLOAD_HANDLERS:
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_patchwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_patchwright.handler.ScrapyPlaywrightDownloadHandler",
}
If this handler is only used by specified spiders, you can add custom settings in your spider like this:
import scrapy
class MySpider(scrapy.Spider):
custom_settings = {
'DOWNLOAD_HANDLERS': {
"http": "scrapy_patchwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_patchwright.handler.ScrapyPlaywrightDownloadHandler",
},
}
Note that the ScrapyPlaywrightDownloadHandler class inherits from the default http/https handler. Unless explicitly marked, requests will be processed by the regular Scrapy download handler.
Twisted reactor
Install the asyncio-based Twisted reactor:
# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
This is the default in new projects since Scrapy 2.7.
Common Settings
This is the commonly use case in your spider. Learn more from scrapy-patchwright documentations.
from browserforge.fingerprints import Screen
custom_settings = {
'PLAYWRIGHT_PROCESS_REQUEST_HEADERS': None, # This is mandatory!
'PLAYWRIGHT_LAUNCH_OPTIONS': {
'headless': False,
'channel': 'chrome',
'no_viewport': True,
},
'PLAYWRIGHT_CONTEXTS': {
'persistent': {
'user_data_dir': 'patchright_data',
'ignore_https_errors': True,
}
}
}
Basic Usage
Same as scrapy-patchwright.
import scrapy
class MySpider(scrapy.Spider):
name = "awesome"
def start_requests(self):
# GET request
yield scrapy.Request("https://httpbin.org/get", meta={"patchright": True})
# POST request
yield scrapy.FormRequest(
url="https://httpbin.org/post",
formdata={"foo": "bar"},
meta={"patchright": True},
)
def parse(self, response, **kwargs):
# 'response' contains the page as seen by the browser
return {"url": response.url}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_patchwright-0.0.46-py3-none-any.whl.
File metadata
- Download URL: scrapy_patchwright-0.0.46-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7d305abf1d9ed40c55a7486da04df7495d4ec8dfb564a1419c4fc58f79d2972
|
|
| MD5 |
edfe60d20d31fa0e48bf6aa5031775bf
|
|
| BLAKE2b-256 |
e35fbe9bb0df749bb8f3ba9193e99dbaeb3c128131b0793ade04e92c4d55df7e
|