Skip to main content

Yet another camoufox integration for scrapy.

Project description

scrapy-playfox: Yet another camoufox integration for Scrapy

version pyversions

A Scrapy download handler "extended" on scrapy-playwright. scrapy-playfox plays a role like glue that sticks scrapy-playwright and camoufox together without touching any of them.

Why Camoufox?

Camoufox is the most modern, effective & future-proof open source solution for avoiding bot detection and intelligent fingerprint rotation. It's perfect choice for scraping sort of strong anti-bot websites.

Installation

scrapy-playfox is available on PyPI and can be installed with pip:

pip install scrapy-playfox

Camoufox is defined as a dependency so it gets installed automatically, however it might be necessary to install the browser that will be used:

camoufox fetch

Activation

Download handler

Replace the default http and/or https Download Handlers through DOWNLOAD_HANDLERS:

# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playfox.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playfox.handler.ScrapyPlaywrightDownloadHandler",
}

If this handler is only used by specified spiders, you can add custom settings in your spider like this:

import scrapy

class MySpider(scrapy.Spider):
    custom_settings = {
        'DOWNLOAD_HANDLERS': {
            'http': 'scrapy_playfox.handler.ScrapyPlaywrightDownloadHandler',
            'https': 'scrapy_playfox.handler.ScrapyPlaywrightDownloadHandler',
        },
    }

Note that the ScrapyPlaywrightDownloadHandler class inherits from the default http/https handler. Unless explicitly marked, requests will be processed by the regular Scrapy download handler.

Twisted reactor

Install the asyncio-based Twisted reactor:

# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

This is the default in new projects since Scrapy 2.7.

Common Settings

This is the commonly used settings in your spider. Learn more from Camoufox and scrapy-playwright documentations.

from browserforge.fingerprints import Screen

custom_settings = {
    'PLAYWRIGHT_LAUNCH_OPTIONS': {
        'headless': False,
        'humanize': True,
        'screen': Screen(max_width=1280, max_height=800),
        'geoip': False,
    },
    'PLAYWRIGHT_CONTEXTS': {
        'persistent': {
            'user_data_dir': 'playfox_data',
        }
    }
}

Basic Usage

Same as scrapy-playwright.

import scrapy

class MySpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        # GET request
        yield scrapy.Request("https://httpbin.org/get", meta={"playwright": True})
        # POST request
        yield scrapy.FormRequest(
            url="https://httpbin.org/post",
            formdata={"foo": "bar"},
            meta={"playwright": True},
        )

    def parse(self, response, **kwargs):
        # 'response' contains the page as seen by the browser
        return {"url": response.url}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_playfox-0.0.1.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_playfox-0.0.1-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_playfox-0.0.1.tar.gz.

File metadata

  • Download URL: scrapy_playfox-0.0.1.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.15

File hashes

Hashes for scrapy_playfox-0.0.1.tar.gz
Algorithm Hash digest
SHA256 5d66d10739c26913acb3f1a01212262ae155474f63008554f6613dc41a43a480
MD5 4b5ae677f40de4a73208e5f5342de86a
BLAKE2b-256 3223a24f24d17763d4c595531be62fae1430a2bfab7c94b919c1334d6323c2e1

See more details on using hashes here.

File details

Details for the file scrapy_playfox-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_playfox-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 131653867097c643f52be0a27ffb7a50227cc0286f00ea82f25e2a14ef0cc5a8
MD5 e5df624507e4d051ae913130fc859f2f
BLAKE2b-256 781c7f69fb99bdf104032f4d8af79c0d3d591edb90cd30683be62c4421e4b6ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page