Scrapy downloader middleware that uses SeleniumBase's pure CDP mode to make requests.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

scrapy-seleniumbase-cdp

Scrapy downloader middleware that uses SeleniumBase's pure CDP mode to make requests, allowing to bypass most anti-bot protections (e.g. CloudFlare).

Using Selenium's pure CDP mode also makes the middleware more platform independent as no WebDriver is required.

Installation
Configuration
Usage
- Additional arguments
Error handling
Enabling debug logs
Tips for headless Linux environments
- Recording an Xvfb session with ffmpeg
- Connecting via VNC to an Xvfb session
Architecture
License

Installation

pip install scrapy-seleniumbase-cdp

Configuration

Add the SeleniumBaseAsyncCDPMiddleware to the downloader middlewares:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_seleniumbase_cdp.SeleniumBaseAsyncCDPMiddleware': 800
}

If needed, configuration can be provided to the SeleniumBase browser instance. For example, to enable the built-in ad blocker (blocks 30+ ad and tracking domains via CDP):
```
SELENIUMBASE_BROWSER_OPTIONS = {
    'ad_block': True,
}
```

Usage

To have SeleniumBase handle requests, use the scrapy_seleniumbase_cdp.SeleniumBaseRequest instead of Scrapy's built-in Request:

from scrapy_seleniumbase_cdp import SeleniumBaseRequest

async def start(self):
    yield SeleniumBaseRequest(url=url, callback=self.parse_result)

Additional arguments

The scrapy_seleniumbase_cdp.SeleniumBaseRequest accepts additional arguments. They are executed in the order presented below:

`page_load_timeout`

Maximum number of seconds to wait for both the HTTP response and the page load event before proceeding. If the timeout is reached, a warning is logged but the request continues. Defaults to 10.

Captcha handling

After navigating to a page, the middleware waits for both the HTTP response status and the page load event. It then attempts to solve any captcha present on the page using SeleniumBase's built-in solver, retrying up to a configurable maximum number of attempts.

The delay before the first solve attempt and between retries depends on the HTTP status code:

2xx responses: wait captcha_delay seconds (default 0)
Blocked responses (status in captcha_blocked_codes): wait captcha_blocked_delay seconds (default 4)

yield SeleniumBaseRequest(
    url=url,
    callback=self.parse_result,
    captcha_delay=1,
    captcha_blocked_delay=5,
    captcha_blocked_codes=[403, 429, 503],
    captcha_max_attempts=5)

Available captcha configuration:

captcha_delay: Seconds to wait before solving on a successful response. Defaults to 0.
captcha_blocked_delay: Seconds to wait before solving on a blocked response. Defaults to 4.
captcha_blocked_codes: List of HTTP status codes treated as blocked. Defaults to [403, 429, 503].
captcha_max_attempts: Maximum number of solve attempts. Defaults to 3. After exhausting all attempts the middleware continues normally but logs a warning.

`wait_for_element` / `element_timeout`

When used, SeleniumBase will wait for the element with the given CSS selector to appear. The default timeout value is of 10 seconds but can be changed if needed. If the element is not found within the timeout, a full-page error screenshot is captured and stored in request.meta['error_screenshot'], then the request is skipped (Scrapy's IgnoreRequest is raised). The screenshot image format is taken from the request's screenshot configuration if set, otherwise it defaults to PNG.

The error screenshot is accessible in the request's errback via failure.request.meta['error_screenshot']:

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError, TimeoutError

yield SeleniumBaseRequest(
    url=url,
    callback=self.parse_result,
    errback=self.handle_error,
    wait_for_element='h1.some-class',
    element_timeout=5)


def handle_error(self, failure):
    screenshot = failure.request.meta.get('error_screenshot')
    if screenshot:
        with open('error.png', 'wb') as f:
            f.write(screenshot)

`browser_callback`

If needed, it is possible to provide a callback to interact with the browser instance and/or its tabs. The return value of the async callback is stored in response.meta['callback'].

async def start(self):
    async def maximize_window(browser: Browser):
        await browser.main_tab.maximize()

    yield SeleniumBaseRequest(…, browser_callback=maximize_window)

`script`

When used, SeleniumBase will execute the provided JavaScript code.

yield SeleniumBaseRequest(
    # …
    script='window.scrollTo(0, document.body.scrollHeight)')

If the script returns a Promise, it is possible to await its result:

yield SeleniumBaseRequest(
    # …
    script={
        'await_promise': True,
        'script': '''
            document.getElementById('onetrust-accept-btn-handler').click()
            new Promise(resolve => setTimeout(resolve, 1000))
        '''
    })

The result of the JavaScript code is stored in response.meta['script'].

`screenshot`

When used, SeleniumBase will take a screenshot of the page and the binary data will be stored in response.meta['screenshot']:

yield SeleniumBaseRequest(url=url, callback=self.parse_result, screenshot=True)


def parse_result(self, response):
    # …
    with open('image.png', 'wb') as image_file:
        image_file.write(response.meta['screenshot'])

You can also specify additional configuration options:

yield SeleniumBaseRequest(…, screenshot={'format': 'jpg', 'full_page': False})

Or provide a path to automatically save the screenshot (in this case, the image data is not stored in the response):

yield SeleniumBaseRequest(…, screenshot={'path': 'output/image.png'})

Available configuration keys:

path: File path where screenshot will be saved. Use auto for SeleniumBase default path. Leave empty to return data in response meta.
format: Image format, defaults to png, jpg also available.
full_page: Capture full page or just viewport, defaults to True.

Error handling

The middleware checks the HTTP status code right after loading the page to determine captcha-solving behaviour (see Captcha handling above).

wait_for_element timeout: if the expected element is not found within element_timeout seconds, a full-page error screenshot is captured and stored in request.meta['error_screenshot'], then IgnoreRequest is raised, causing Scrapy to skip the request. The screenshot is accessible in the request's errback via failure.request.meta['error_screenshot'] (see wait_for_element for an example).

Tips for headless Linux environments

When running Scrapy with this middleware in headless mode using Xvfb on Linux, you may want to record or visually inspect browser sessions for debugging purposes. The examples below assume an Xvfb display at :1001 — adjust to match your setup.

Recording an Xvfb session with ffmpeg

Use ffmpeg to capture the virtual display as a video file:

ffmpeg -f x11grab -r 30 -s 1440x900 -i :1001 \
    -codec:v libx264 -preset ultrafast -pix_fmt yuv420p \
    /home/user/session_$(date +%Y%m%d_%H%M%S).mp4

Key flags:

-f x11grab — capture from an X11 display
-r 30 — frame rate (30 fps)
-s 1440x900 — resolution (must match your Xvfb geometry)
-i :1001 — X display to capture

Connecting via VNC to an Xvfb session

Use x11vnc to expose the virtual display over VNC for live inspection:

x11vnc -display :1001 -passwd secret -forever -xkb

Then connect from any VNC client to <host>:5900. Key flags:

-display :1001 — X display to share
-passwd secret — VNC password
-forever — keep the server running after the first client disconnects
-xkb — use XKEYBOARD extension for better keyboard handling

Enabling debug logs

The middleware logs operational details (page load events, captcha attempts, screenshot captures, etc.) at the DEBUG level. Log messages are emitted under the scrapy_seleniumbase_cdp.middleware_async logger name. Warnings and errors (page load timeouts, element wait timeouts, max captcha attempts reached) use higher log levels and are always visible.

To see all debug output, set Scrapy's global log level in your settings.py:

LOG_LEVEL = 'DEBUG'

If you prefer to keep Scrapy's own output at a higher level and only enable debug logging for this middleware, configure the parent logger directly:

# settings.py or spider __init__
import logging
logging.getLogger('scrapy_seleniumbase_cdp').setLevel(logging.DEBUG)

This works because the middleware uses a module-level logger named scrapy_seleniumbase_cdp.middleware_async, which inherits settings from the scrapy_seleniumbase_cdp parent logger.

You can also use Scrapy's per-module log configuration via the LOG_CATEGORIES setting (Scrapy ≥ 2.8):

LOG_CATEGORIES = {
    'scrapy_seleniumbase_cdp': 'DEBUG',
}

Architecture

See docs/ARCHITECTURE.md for a detailed overview of the middleware internals, including a sequence diagram of the request processing pipeline.

License

This project is licensed under the MIT License. It is a fork of Quartz-Core/scrapy-seleniumbase which was originally released under the WTFPL.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nyg

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.4

May 2, 2026

2.0.3

Apr 20, 2026

2.0.1

Apr 19, 2026

2.0.0

Mar 28, 2026

1.0.3

Mar 23, 2026

1.0.2

Mar 14, 2026

1.0.1

Mar 14, 2026

1.0.0

Dec 27, 2025

0.0.5

Dec 21, 2025

0.0.4

Dec 20, 2025

0.0.3

Dec 8, 2025

0.0.2

Dec 8, 2025

0.0.1

Dec 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_seleniumbase_cdp-2.0.4.tar.gz (16.0 kB view details)

Uploaded May 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapy_seleniumbase_cdp-2.0.4-py3-none-any.whl (11.9 kB view details)

Uploaded May 2, 2026 Python 3

File details

Details for the file scrapy_seleniumbase_cdp-2.0.4.tar.gz.

File metadata

Download URL: scrapy_seleniumbase_cdp-2.0.4.tar.gz
Upload date: May 2, 2026
Size: 16.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for scrapy_seleniumbase_cdp-2.0.4.tar.gz
Algorithm	Hash digest
SHA256	`6a7e65846033a608431951b6ce0cba7b6c62d71a212655b1eb523c8c964c95ec`
MD5	`b9103835aa90675fd18bdaaa7174d926`
BLAKE2b-256	`9c3394657c60510f43e8036656d8269e8198ec130c3c2086a11cbf1c5570bfe4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_seleniumbase_cdp-2.0.4.tar.gz:

Publisher: publish-to-pypi.yml on nyg/scrapy-seleniumbase-cdp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapy_seleniumbase_cdp-2.0.4.tar.gz
- Subject digest: 6a7e65846033a608431951b6ce0cba7b6c62d71a212655b1eb523c8c964c95ec
- Sigstore transparency entry: 1429581255
- Sigstore integration time: May 2, 2026
Source repository:
- Permalink: nyg/scrapy-seleniumbase-cdp@fa15a879d75722fa018126da713341937d88768d
- Branch / Tag: refs/tags/v2.0.4
- Owner: https://github.com/nyg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@fa15a879d75722fa018126da713341937d88768d
- Trigger Event: release

File details

Details for the file scrapy_seleniumbase_cdp-2.0.4-py3-none-any.whl.

File metadata

Download URL: scrapy_seleniumbase_cdp-2.0.4-py3-none-any.whl
Upload date: May 2, 2026
Size: 11.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for scrapy_seleniumbase_cdp-2.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6e7d0ef3aa281361b4d8f3ecd66efdb2d18f24314d2e217133c263c7aacc311e`
MD5	`3ce58c8de1cb4f807f627e2ca31754b4`
BLAKE2b-256	`97dcf0c4431c3e9cb2bfa42e8bf06caba2f98cebaa10cad4a670fa9959ef79f1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_seleniumbase_cdp-2.0.4-py3-none-any.whl:

Publisher: publish-to-pypi.yml on nyg/scrapy-seleniumbase-cdp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapy_seleniumbase_cdp-2.0.4-py3-none-any.whl
- Subject digest: 6e7d0ef3aa281361b4d8f3ecd66efdb2d18f24314d2e217133c263c7aacc311e
- Sigstore transparency entry: 1429581261
- Sigstore integration time: May 2, 2026
Source repository:
- Permalink: nyg/scrapy-seleniumbase-cdp@fa15a879d75722fa018126da713341937d88768d
- Branch / Tag: refs/tags/v2.0.4
- Owner: https://github.com/nyg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@fa15a879d75722fa018126da713341937d88768d
- Trigger Event: release

scrapy-seleniumbase-cdp 2.0.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

scrapy-seleniumbase-cdp

Table of contents

Installation

Configuration

Usage

Additional arguments

page_load_timeout

Captcha handling

wait_for_element / element_timeout

browser_callback

script

screenshot

Error handling

Tips for headless Linux environments

Recording an Xvfb session with ffmpeg

Connecting via VNC to an Xvfb session

Enabling debug logs

Architecture

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`page_load_timeout`

`wait_for_element` / `element_timeout`

`browser_callback`

`script`

`screenshot`