Scrapy downloader middleware that uses SeleniumBase's pure CDP mode to make requests.
Project description
scrapy-seleniumbase-cdp
Scrapy downloader middleware that uses SeleniumBase's pure CDP mode to make requests, allowing to bypass most anti-bot protections (e.g. CloudFlare).
Using Selenium's pure CDP mode also makes the middleware more platform independent as no WebDriver is required.
Installation
pip install scrapy-seleniumbase-cdp
Configuration
-
Add the
SeleniumBaseAsyncCDPMiddlewareto the downloader middlewares:DOWNLOADER_MIDDLEWARES = { 'scrapy_seleniumbase_cdp.SeleniumBaseAsyncCDPMiddleware': 800 }
-
If needed, configuration can be provided to the SeleniumBase browser instance. For example, to enable the built-in ad blocker (blocks 30+ ad and tracking domains via CDP):
SELENIUMBASE_BROWSER_OPTIONS = { 'ad_block': True, }
Usage
To have SeleniumBase handle requests, use the
scrapy_seleniumbase_cdp.SeleniumBaseRequest instead of Scrapy's built-in
Request:
from scrapy_seleniumbase_cdp import SeleniumBaseRequest
async def start(self):
yield SeleniumBaseRequest(url=url, callback=self.parse_result)
Additional arguments
The scrapy_seleniumbase_cdp.SeleniumBaseRequest accepts five additional
arguments. They are executed in the order presented below:
wait_for / wait_timeout
When used, SeleniumBase will wait for the element with the given CSS selector
to appear. The default timeout value is of 10 seconds but can be changed if
needed. If the element is not found within the timeout, the request is skipped
(Scrapy's IgnoreRequest is raised) and a full-page debug screenshot is saved
using SeleniumBase's default path.
yield SeleniumBaseRequest(
url=url,
callback=self.parse_result,
wait_for='h1.some-class',
wait_timeout=5))
browser_callback
If needed, it is possible to provide a callback to interact with the browser
instance and/or its tabs. The return value of the async callback is stored in
response.meta['callback'].
async def start(self):
async def maximize_window(browser: Browser):
await browser.main_tab.maximize()
yield SeleniumBaseRequest(…, browser_callback=maximize_window)
script
When used, SeleniumBase will execute the provided JavaScript code.
yield SeleniumBaseRequest(
# …
script='window.scrollTo(0, document.body.scrollHeight)')
If the script returns a Promise, it is possible to await its result:
yield SeleniumBaseRequest(
# …
script={
'await_promise': True,
'script': '''
document.getElementById('onetrust-accept-btn-handler').click()
new Promise(resolve => setTimeout(resolve, 1000))
'''
})
The result of the JavaScript code is stored in response.meta['script'].
screenshot
When used, SeleniumBase will take a screenshot of the page and the binary data
will be stored in response.meta['screenshot']:
yield SeleniumBaseRequest(url=url, callback=self.parse_result, screenshot=True)
def parse_result(self, response):
# …
with open('image.png', 'wb') as image_file:
image_file.write(response.meta['screenshot'])
You can also specify additional configuration options:
yield SeleniumBaseRequest(…, screenshot={'format': 'jpg', 'full_page': False})
Or provide a path to automatically save the screenshot (in this case, the image data is not stored in the response):
yield SeleniumBaseRequest(…, screenshot={'path': 'output/image.png'})
Available configuration keys:
path: File path where screenshot will be saved. Useautofor SeleniumBase default path. Leave empty to return data in responsemeta.format: Image format, defaults topng,jpgalso available.full_page: Capture full page or just viewport, defaults toTrue.
Error handling
The middleware checks the HTTP status code right after loading the page:
- Non-2xx responses:
wait_for,browser_callback, andscriptare skipped. A screenshot is still taken if configured. The response is returned with the real status code. wait_fortimeout: if the expected element is not found withinwait_timeoutseconds, a full-page debug screenshot is saved using SeleniumBase's default path andIgnoreRequestis raised, causing Scrapy to skip the request.
License
This project is licensed under the MIT License. It is a fork of Quartz-Core/scrapy-seleniumbase which was originally released under the WTFPL.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_seleniumbase_cdp-1.0.3.tar.gz.
File metadata
- Download URL: scrapy_seleniumbase_cdp-1.0.3.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c287e0b7bbebbef6f747d6ca5d0f6a3db86577feff044f88d7137b5d57cd7a76
|
|
| MD5 |
904299f60511e1b66a90944332af1050
|
|
| BLAKE2b-256 |
765f7abbd1847efb4266347ce624f0365f2ab7ea51c751d1c5f00106f4786c3e
|
Provenance
The following attestation bundles were made for scrapy_seleniumbase_cdp-1.0.3.tar.gz:
Publisher:
publish.yml on nyg/scrapy-seleniumbase-cdp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapy_seleniumbase_cdp-1.0.3.tar.gz -
Subject digest:
c287e0b7bbebbef6f747d6ca5d0f6a3db86577feff044f88d7137b5d57cd7a76 - Sigstore transparency entry: 1164998736
- Sigstore integration time:
-
Permalink:
nyg/scrapy-seleniumbase-cdp@38f2274fcf799999bf4064e8bc08dc3b25f6aa5d -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/nyg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@38f2274fcf799999bf4064e8bc08dc3b25f6aa5d -
Trigger Event:
release
-
Statement type:
File details
Details for the file scrapy_seleniumbase_cdp-1.0.3-py3-none-any.whl.
File metadata
- Download URL: scrapy_seleniumbase_cdp-1.0.3-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
900d78bcb3f77e654ed46edcb6a1b59072dfce61ee86960e8a30ff20639df917
|
|
| MD5 |
6bf004170f7bf9e2d142789c1b1788c9
|
|
| BLAKE2b-256 |
60bd3fd366a0735ab8dce1f84cc945efc76e24c0d207a912bf06e59d097da364
|
Provenance
The following attestation bundles were made for scrapy_seleniumbase_cdp-1.0.3-py3-none-any.whl:
Publisher:
publish.yml on nyg/scrapy-seleniumbase-cdp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapy_seleniumbase_cdp-1.0.3-py3-none-any.whl -
Subject digest:
900d78bcb3f77e654ed46edcb6a1b59072dfce61ee86960e8a30ff20639df917 - Sigstore transparency entry: 1164998797
- Sigstore integration time:
-
Permalink:
nyg/scrapy-seleniumbase-cdp@38f2274fcf799999bf4064e8bc08dc3b25f6aa5d -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/nyg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@38f2274fcf799999bf4064e8bc08dc3b25f6aa5d -
Trigger Event:
release
-
Statement type: