Skip to main content

Middleware for scrapy using selenium webdriver.

Project description

Scralenium

GitHub repo size GitHub License PyPI - Downloads GitHub Last Commit CI CI GitHub repo size codecov

Project name is a scralenium that allows use selenium webdriver with scrapy to do scrape web data from dynamic web pages. The name is actually really clever, if you didn't notice it is scrapy + selenium = scralenium. Genius right? :)

Prerequisites

Before you begin, ensure you have met the following requirements:

  • You have installed the latest version of python 3
  • You are familiar with the scrapy framework
  • You are familiar with selenium
  • You have a webdriver installed/available

Requirements:

  • scrapy
  • selenium

Installing

To install scralenium, follow these steps:

git clone https://github.com/alexpdev/scralenium.git
cd scralenium
pip install .

From PyPi

pip install scralenium

License

This project uses the following license: Apache 2.0.

Usage

Using scralenium is really simple.

In your scrappy settings set the SELENIUM_DRIVER_NAME and SELENIUM_DRIVER_EXECUTABLE fields. scralenium currently supports chrome SELENIUM_DRIVER_NAME field. If the webdriver executable is already on path then it can be omitted. You also need to enable the ScraleniumDownloaderMiddleware in the DOWNLOADER_MIDDLEWARES feed.

from shutil import which
SELENIUM_DRIVER_EXECUTABLE = which("chromedriver")
SELENIUM_DRIVER_NAME = "chrome"
DOWNLOADER_MIDDLEWARES = {
    "scralenium.ScraleniumDownloaderMiddleware" : 950
}

Once you have added the settings to the settings.py file or in the spider's custom_settings attribute all that is needed is to use ScraleniumRequest when yielding from the start_requests method or from your parse callback methods. The pause argument can be used to set the webdrivers implicit wait value. And the response argument in the parse callback methods gives you full access to the normal scrapy response as well as all the features of the webdriver.

import scrapy
from scralenium import ScraleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

class MySpider(scrapy.Spider):
    """Example of using scrapy.Spider with ScraleniumRequest"""
    ...
     
    def start_requests(self):
        for url in self.start_urls:
            yield ScreleniumRequest(url, callback=self.parse, pause=4)
    
    def parse(self, response):
        html = response.text
        title = response.xpath("//title/text()").get()
        element = response.find_element(By.ID, "submit-button")
        element.send_keys(Keys.Return)
        next_page = response.xpath("//a[@class='next-page-link']/@href").get()
        next_url = response.urljoin(next_page)
        yield ScraleniumRequest(next_url, callback=self.parse, pause=4)
        yield {"title": title}

I have added some additional features but am behind on documenting them.

TODO

[x] add features
[] document them

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scralenium-0.1.3.tar.gz (10.7 kB view hashes)

Uploaded Source

Built Distribution

scralenium-0.1.3-py3-none-any.whl (10.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page