Unofficial scrapy plugin for Selenium Grid
Project description
A Scrapy Download Handler which performs requests using Selenium Grid (aiohttp). It can be used to handle pages that require JavaScript (among other things), while adhering to the regular Scrapy workflow (i.e. without interfering with request scheduling, item processing, etc).
This is unofficial scrapy plugin and unofficial selenium scrapy plugin.
The development of this module is heavily inspired by scrapy-playwright and asyncselenium.
Requirements
After the release of version 2.0, which includes coroutine syntax
support and asyncio support,
Scrapy allows to integrate asyncio
-based projects such as aiohttp.
Minimum required versions
Python >= 3.8
Scrapy >= 2.0
aiohttp
Installation
scrapy-selenium-grid
is available on PyPI and can be installed with
pip
:
pip install scrapy-selenium-grid
Activation
Replace the default http
and/or https
Download Handlers through
DOWNLOAD_HANDLERS:
DOWNLOAD_HANDLERS = {
'http': 'scrapy_selenium_grid.download_handler.ScrapyDownloadHandler',
'https': 'scrapy_selenium_grid.download_handler.ScrapyDownloadHandler',
}
Note that the ScrapyDownloadHandler
class inherits from the default
http/https
handler. Unless explicitly marked (see Basic Usage),
requests will be processed by the regular Scrapy download handler.
Also, be sure to install the asyncio-based Twisted reactor:
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
Basic Usage
Set the selenium_grid key to download a request using Selenium Grid:
import scrapy
class AwesomeSpider(scrapy.Spider):
name = "awesome"
def start_requests(self):
# GET request
yield scrapy.Request("https://httpbin.org/get", meta={"selenium_grid": True})
# POST request
yield scrapy.FormRequest(
url="https://httpbin.org/post",
formdata={"foo": "bar"},
meta={"selenium_grid": True},
)
def parse(self, response, **kwargs):
# 'response' contains the page as seen by the browser
return {"url": response.url}
Supported Settings
SELENIUM_GRID_BROWSER_NAME
Type str
, default chrome
The browser type to be used in Selenium Grid, e.g. chrome
, edge
,
firefox
, ie
, safari
.
SELENIUM_GRID_URL
Type str
, default http://127.0.0.1:4444
The Selenium Grid hub url.
SELENIUM_GRID_IMPLICIT_WAIT_INSEC
Type int
, default 0
Selenium has a built-in way to automatically wait for elements.
This is a global setting that applies to every element location call for the entire session. The default value is 0, which means that if the element is not found, it will immediately return an error. If an implicit wait is set, the driver will wait for the duration of the provided value before returning the error. Note that as soon as the element is located, the driver will return the element reference and the code will continue executing, so a larger implicit wait value won’t necessarily increase the duration of the session.
Supported Request Meta
selenium_grid
Type bool
, default False
If set to a value that evaluates to True
the request will be processed by
Selenium Grid.
return scrapy.Request("https://example.org", meta={"selenium_grid": True})
selenium_grid_driver
Type scrapy_selenium_grid.webdriver.WebDriver
This will be set with asynchronous Selenium Driver when you enabled selenium_grid in request meta.
import scrapy
from scrapy_selenium_grid.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
def start_requests(self):
yield scrapy.Request(
url="https://httpbin.org/get",
meta={"selenium_grid": True},
)
async def parse(self, response, **kwargs):
driver = response.meta["selenium_grid_driver"]
await ActionChains(driver).key_down(Keys.F12).key_up(Keys.F12).perform()
inp_userid = await driver.find_element(By.CSS_SELECTOR, 'input[name="userid"]')
assert await inp_userid.is_displayed() == True
await inp_userid.send_keys("Username")
print(await driver.get_log('browser'))
selenium_grid_browser
Type str
, default None
Same values as SELENIUM_GRID_BROWSER_NAME
but you set it per request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy_selenium_grid-0.0.1.tar.gz
.
File metadata
- Download URL: scrapy_selenium_grid-0.0.1.tar.gz
- Upload date:
- Size: 35.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 120375744959ee8894ae527e89ab27355bef96c00d66152794f8f58d226cd06d |
|
MD5 | 3cb9b0469a87b43d79a633766286a6f8 |
|
BLAKE2b-256 | 64121323a415412a938b6bc7347ec95fc0dc73f211aa342ce6fae5c82dfab54b |
File details
Details for the file scrapy_selenium_grid-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: scrapy_selenium_grid-0.0.1-py3-none-any.whl
- Upload date:
- Size: 49.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3827f62a9baff6e7b278c824bd1a7a8f0dd106dd12fd272da859601807ae2434 |
|
MD5 | c4f190de1c03162bfd98a465d98fb36d |
|
BLAKE2b-256 | 3624667a92ccfe66292c4aae7fc8fdf01c3083db99b3df524b28eac207d55128 |