Skip to main content

Unofficial scrapy plugin for Selenium Grid

Project description

A Scrapy Download Handler which performs requests using Selenium Grid (aiohttp). It can be used to handle pages that require JavaScript (among other things), while adhering to the regular Scrapy workflow (i.e. without interfering with request scheduling, item processing, etc).

This is unofficial scrapy plugin and unofficial selenium scrapy plugin.

The development of this module is heavily inspired by scrapy-playwright and asyncselenium.

Requirements

After the release of version 2.0, which includes coroutine syntax support and asyncio support, Scrapy allows to integrate asyncio-based projects such as aiohttp.

Minimum required versions

  • Python >= 3.8

  • Scrapy >= 2.0

  • aiohttp

Installation

scrapy-selenium-grid is available on PyPI and can be installed with pip:

pip install scrapy-selenium-grid

Activation

Replace the default http and/or https Download Handlers through DOWNLOAD_HANDLERS:

DOWNLOAD_HANDLERS = {
    'http': 'scrapy_selenium_grid.download_handler.ScrapyDownloadHandler',
    'https': 'scrapy_selenium_grid.download_handler.ScrapyDownloadHandler',
}

Note that the ScrapyDownloadHandler class inherits from the default http/https handler. Unless explicitly marked (see Basic Usage), requests will be processed by the regular Scrapy download handler.

Also, be sure to install the asyncio-based Twisted reactor:

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

Basic Usage

Set the selenium_grid key to download a request using Selenium Grid:

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        # GET request
        yield scrapy.Request("https://httpbin.org/get", meta={"selenium_grid": True})
        # POST request
        yield scrapy.FormRequest(
            url="https://httpbin.org/post",
            formdata={"foo": "bar"},
            meta={"selenium_grid": True},
        )

    def parse(self, response, **kwargs):
        # 'response' contains the page as seen by the browser
        return {"url": response.url}

Supported Settings

SELENIUM_GRID_BROWSER_NAME

Type str, default chrome

The browser type to be used in Selenium Grid, e.g. chrome, edge, firefox, ie, safari.

SELENIUM_GRID_URL

Type str, default http://127.0.0.1:4444

The Selenium Grid hub url.

SELENIUM_GRID_IMPLICIT_WAIT_INSEC

Type int, default 0

Selenium has a built-in way to automatically wait for elements.

This is a global setting that applies to every element location call for the entire session. The default value is 0, which means that if the element is not found, it will immediately return an error. If an implicit wait is set, the driver will wait for the duration of the provided value before returning the error. Note that as soon as the element is located, the driver will return the element reference and the code will continue executing, so a larger implicit wait value won’t necessarily increase the duration of the session.

Supported Request Meta

selenium_grid

Type bool, default False

If set to a value that evaluates to True the request will be processed by Selenium Grid.

return scrapy.Request("https://example.org", meta={"selenium_grid": True})

selenium_grid_driver

Type scrapy_selenium_grid.webdriver.WebDriver

This will be set with asynchronous Selenium Driver when you enabled selenium_grid in request meta.

import scrapy
from scrapy_selenium_grid.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

def start_requests(self):
    yield scrapy.Request(
        url="https://httpbin.org/get",
        meta={"selenium_grid": True},
    )

async def parse(self, response, **kwargs):
    driver = response.meta["selenium_grid_driver"]

    await ActionChains(driver).key_down(Keys.F12).key_up(Keys.F12).perform()

    inp_userid = await driver.find_element(By.CSS_SELECTOR, 'input[name="userid"]')
    assert await inp_userid.is_displayed() == True
    await inp_userid.send_keys("Username")

    print(await driver.get_log('browser'))

selenium_grid_browser

Type str, default None

Same values as SELENIUM_GRID_BROWSER_NAME but you set it per request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_selenium_grid-0.0.1.tar.gz (35.8 kB view hashes)

Uploaded Source

Built Distribution

scrapy_selenium_grid-0.0.1-py3-none-any.whl (49.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page