Skip to main content

Unofficial scrapy plugin for Selenium Grid

Project description

A Scrapy Download Handler which performs requests using Selenium Grid (aiohttp). It can be used to handle pages that require JavaScript (among other things), while adhering to the regular Scrapy workflow (i.e. without interfering with request scheduling, item processing, etc).

This is unofficial scrapy plugin and unofficial selenium scrapy plugin.

The development of this module is heavily inspired by scrapy-playwright and asyncselenium.

Requirements

After the release of version 2.0, which includes coroutine syntax support and asyncio support, Scrapy allows to integrate asyncio-based projects such as aiohttp.

Minimum required versions

  • Python >= 3.8

  • Scrapy >= 2.0

  • aiohttp

Installation

scrapy-selenium-grid is available on PyPI and can be installed with pip:

pip install scrapy-selenium-grid

Activation

Replace the default http and/or https Download Handlers through DOWNLOAD_HANDLERS:

DOWNLOAD_HANDLERS = {
    'http': 'scrapy_selenium_grid.download_handler.ScrapyDownloadHandler',
    'https': 'scrapy_selenium_grid.download_handler.ScrapyDownloadHandler',
}

Note that the ScrapyDownloadHandler class inherits from the default http/https handler. Unless explicitly marked (see Basic Usage), requests will be processed by the regular Scrapy download handler.

Also, be sure to install the asyncio-based Twisted reactor:

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

Basic Usage

Set the selenium_grid key to download a request using Selenium Grid:

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        # GET request
        yield scrapy.Request("https://httpbin.org/get", meta={"selenium_grid": True})
        # POST request
        yield scrapy.FormRequest(
            url="https://httpbin.org/post",
            formdata={"foo": "bar"},
            meta={"selenium_grid": True},
        )

    def parse(self, response, **kwargs):
        # 'response' contains the page as seen by the browser
        return {"url": response.url}

Supported Settings

SELENIUM_GRID_BROWSER_NAME

Type str, default chrome

The browser type to be used in Selenium Grid, e.g. chrome, edge, firefox, ie, safari.

SELENIUM_GRID_URL

Type str, default http://127.0.0.1:4444

The Selenium Grid hub url.

SELENIUM_GRID_IMPLICIT_WAIT_INSEC

Type int, default 0

Selenium has a built-in way to automatically wait for elements.

This is a global setting that applies to every element location call for the entire session. The default value is 0, which means that if the element is not found, it will immediately return an error. If an implicit wait is set, the driver will wait for the duration of the provided value before returning the error. Note that as soon as the element is located, the driver will return the element reference and the code will continue executing, so a larger implicit wait value won’t necessarily increase the duration of the session.

Supported Request Meta

selenium_grid

Type bool, default False

If set to a value that evaluates to True the request will be processed by Selenium Grid.

return scrapy.Request("https://example.org", meta={"selenium_grid": True})

selenium_grid_driver

Type scrapy_selenium_grid.webdriver.WebDriver

This will be set with asynchronous Selenium Driver when you enabled selenium_grid in request meta.

import scrapy
from scrapy_selenium_grid.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

def start_requests(self):
    yield scrapy.Request(
        url="https://httpbin.org/get",
        meta={"selenium_grid": True},
    )

async def parse(self, response, **kwargs):
    driver = response.meta["selenium_grid_driver"]

    await ActionChains(driver).key_down(Keys.F12).key_up(Keys.F12).perform()

    inp_userid = await driver.find_element(By.CSS_SELECTOR, 'input[name="userid"]')
    assert await inp_userid.is_displayed() == True
    await inp_userid.send_keys("Username")

    print(await driver.get_log('browser'))

selenium_grid_browser

Type str, default None

Same values as SELENIUM_GRID_BROWSER_NAME but you set it per request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_selenium_grid-0.0.1.tar.gz (35.8 kB view details)

Uploaded Source

Built Distribution

scrapy_selenium_grid-0.0.1-py3-none-any.whl (49.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_selenium_grid-0.0.1.tar.gz.

File metadata

  • Download URL: scrapy_selenium_grid-0.0.1.tar.gz
  • Upload date:
  • Size: 35.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for scrapy_selenium_grid-0.0.1.tar.gz
Algorithm Hash digest
SHA256 120375744959ee8894ae527e89ab27355bef96c00d66152794f8f58d226cd06d
MD5 3cb9b0469a87b43d79a633766286a6f8
BLAKE2b-256 64121323a415412a938b6bc7347ec95fc0dc73f211aa342ce6fae5c82dfab54b

See more details on using hashes here.

File details

Details for the file scrapy_selenium_grid-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_selenium_grid-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3827f62a9baff6e7b278c824bd1a7a8f0dd106dd12fd272da859601807ae2434
MD5 c4f190de1c03162bfd98a465d98fb36d
BLAKE2b-256 3624667a92ccfe66292c4aae7fc8fdf01c3083db99b3df524b28eac207d55128

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page