Yet another camoufox integration for scrapy.
Project description
scrapy-playfox: Yet another camoufox integration for Scrapy
A Scrapy download handler "extended" on scrapy-playwright. scrapy-playfox plays a role like glue that sticks scrapy-playwright and camoufox together without touching any of them.
Why Camoufox?
Camoufox is the most modern, effective & future-proof open source solution for avoiding bot detection and intelligent fingerprint rotation. It's perfect choice for scraping sort of strong anti-bot websites.
Installation
scrapy-playfox is available on PyPI and can be installed with pip:
pip install scrapy-playfox
Camoufox is defined as a dependency so it gets installed automatically, however it might be necessary to install the browser that will be used:
camoufox fetch
Activation
Download handler
Replace the default http and/or https Download Handlers through DOWNLOAD_HANDLERS:
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playfox.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playfox.handler.ScrapyPlaywrightDownloadHandler",
}
If this handler is only used by specified spiders, you can add custom settings in your spider like this:
import scrapy
class MySpider(scrapy.Spider):
custom_settings = {
'DOWNLOAD_HANDLERS': {
'http': 'scrapy_playfox.handler.ScrapyPlaywrightDownloadHandler',
'https': 'scrapy_playfox.handler.ScrapyPlaywrightDownloadHandler',
},
}
Note that the ScrapyPlaywrightDownloadHandler class inherits from the default http/https handler. Unless explicitly marked, requests will be processed by the regular Scrapy download handler.
Twisted reactor
Install the asyncio-based Twisted reactor:
# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
This is the default in new projects since Scrapy 2.7.
Common Settings
This is the commonly used settings in your spider. Learn more from Camoufox and scrapy-playwright documentations.
from browserforge.fingerprints import Screen
custom_settings = {
'PLAYWRIGHT_LAUNCH_OPTIONS': {
'headless': False,
'humanize': True,
'screen': Screen(max_width=1280, max_height=800),
'geoip': False,
},
'PLAYWRIGHT_CONTEXTS': {
'persistent': {
'user_data_dir': 'playfox_data',
}
}
}
Basic Usage
Same as scrapy-playwright.
import scrapy
class MySpider(scrapy.Spider):
name = "awesome"
def start_requests(self):
# GET request
yield scrapy.Request("https://httpbin.org/get", meta={"playwright": True})
# POST request
yield scrapy.FormRequest(
url="https://httpbin.org/post",
formdata={"foo": "bar"},
meta={"playwright": True},
)
def parse(self, response, **kwargs):
# 'response' contains the page as seen by the browser
return {"url": response.url}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_playfox-0.0.1.tar.gz.
File metadata
- Download URL: scrapy_playfox-0.0.1.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d66d10739c26913acb3f1a01212262ae155474f63008554f6613dc41a43a480
|
|
| MD5 |
4b5ae677f40de4a73208e5f5342de86a
|
|
| BLAKE2b-256 |
3223a24f24d17763d4c595531be62fae1430a2bfab7c94b919c1334d6323c2e1
|
File details
Details for the file scrapy_playfox-0.0.1-py3-none-any.whl.
File metadata
- Download URL: scrapy_playfox-0.0.1-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
131653867097c643f52be0a27ffb7a50227cc0286f00ea82f25e2a14ef0cc5a8
|
|
| MD5 |
e5df624507e4d051ae913130fc859f2f
|
|
| BLAKE2b-256 |
781c7f69fb99bdf104032f4d8af79c0d3d591edb90cd30683be62c4421e4b6ad
|