Skip to main content

Middleware for scrapy using selenium webdriver.

Project description

Scralenium

GitHub repo size GitHub License PyPI - Downloads GitHub Last Commit CI CI GitHub repo size codecov

Project name is a scralenium that allows use selenium webdriver with scrapy to do scrape web data from dynamic web pages. The name is actually really clever, if you didn't notice it is scrapy + selenium = scralenium. Genius right? :)

Prerequisites

Before you begin, ensure you have met the following requirements:

  • You have installed the latest version of python 3
  • You are familiar with the scrapy framework
  • You are familiar with selenium
  • You have a webdriver installed/available

Requirements:

  • scrapy
  • selenium

Installing

To install scralenium, follow these steps:

git clone https://github.com/alexpdev/scralenium.git
cd scralenium
pip install .

From PyPi

pip install scralenium

License

This project uses the following license: Apache 2.0.

Usage

Using scralenium is really simple.

In your scrappy settings set the SELENIUM_DRIVER_NAME and SELENIUM_DRIVER_EXECUTABLE fields. scralenium currently supports chrome SELENIUM_DRIVER_NAME field. If the webdriver executable is already on path then it can be omitted. You also need to enable the ScraleniumDownloaderMiddleware in the DOWNLOADER_MIDDLEWARES feed.

from shutil import which
SELENIUM_DRIVER_EXECUTABLE = which("chromedriver")
SELENIUM_DRIVER_NAME = "chrome"
DOWNLOADER_MIDDLEWARES = {
    "scralenium.ScraleniumDownloaderMiddleware" : 950
}

Once you have added the settings to the settings.py file or in the spider's custom_settings attribute all that is needed is to use ScraleniumRequest when yielding from the start_requests method or from your parse callback methods. The pause argument can be used to set the webdrivers implicit wait value. And the response argument in the parse callback methods gives you full access to the normal scrapy response as well as all the features of the webdriver.

import scrapy
from scralenium import ScraleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

class MySpider(scrapy.Spider):
    """Example of using scrapy.Spider with ScraleniumRequest"""
    ...
     
    def start_requests(self):
        for url in self.start_urls:
            yield ScreleniumRequest(url, callback=self.parse, pause=4)
    
    def parse(self, response):
        html = response.text
        title = response.xpath("//title/text()").get()
        element = response.find_element(By.ID, "submit-button")
        element.send_keys(Keys.Return)
        next_page = response.xpath("//a[@class='next-page-link']/@href").get()
        next_url = response.urljoin(next_page)
        yield ScraleniumRequest(next_url, callback=self.parse, pause=4)
        yield {"title": title}

I have added some additional features but am behind on documenting them.

TODO

[x] add features
[] document them

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scralenium-0.1.3.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

scralenium-0.1.3-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file scralenium-0.1.3.tar.gz.

File metadata

  • Download URL: scralenium-0.1.3.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for scralenium-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c38a43d5c37dd0bed6cd4f862f874d20a7e41215792ef2059e3797389c07f999
MD5 e265f71fa6564c451f4f1bba913c1541
BLAKE2b-256 525b853d435b1080ed966e711478707781fd9b339354a75904957efbc270dcc8

See more details on using hashes here.

File details

Details for the file scralenium-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: scralenium-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for scralenium-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 52abaa98d1669244a8a9babccce2cc2fd2ed438ce10e3dde89dfc035a4d6b41a
MD5 3e15610a6bf27bf631c73b5bc40e28ec
BLAKE2b-256 7693e490111b058c4cfaa1f5ec157245b2f55c1eb16c04c8ef37303eac283c92

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page