Middleware for scrapy using selenium webdriver.
Project description
Scralenium
Project name is a scralenium
that allows use selenium webdriver with scrapy to do scrape web data from dynamic web pages. The name is actually really clever, if you didn't notice it is scrapy
+ selenium
= scralenium
. Genius right? :)
Prerequisites
Before you begin, ensure you have met the following requirements:
- You have installed the latest version of
python 3
- You are familiar with the scrapy framework
- You are familiar with selenium
- You have a webdriver installed/available
Requirements:
- scrapy
- selenium
Installing
To install scralenium
, follow these steps:
git clone https://github.com/alexpdev/scralenium.git
cd scralenium
pip install .
From PyPi
pip install scralenium
License
This project uses the following license: Apache 2.0.
Usage
Using scralenium
is really simple.
In your scrappy settings set the SELENIUM_DRIVER_NAME
and
SELENIUM_DRIVER_EXECUTABLE
fields. scralenium
currently supports
chrome SELENIUM_DRIVER_NAME field. If the webdriver executable is already
on path
then it can be omitted. You also need to enable the
ScraleniumDownloaderMiddleware
in the DOWNLOADER_MIDDLEWARES
feed.
from shutil import which
SELENIUM_DRIVER_EXECUTABLE = which("chromedriver")
SELENIUM_DRIVER_NAME = "chrome"
DOWNLOADER_MIDDLEWARES = {
"scralenium.ScraleniumDownloaderMiddleware" : 950
}
Once you have added the settings to the settings.py
file or in the
spider's custom_settings
attribute all that is needed is to use
ScraleniumRequest
when yielding from the start_requests
method or
from your parse callback methods. The pause
argument can be used to set
the webdrivers implicit wait value. And the response
argument in the
parse callback methods gives you full access to the normal scrapy response
as well as all the features of the webdriver.
import scrapy
from scralenium import ScraleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
class MySpider(scrapy.Spider):
"""Example of using scrapy.Spider with ScraleniumRequest"""
...
def start_requests(self):
for url in self.start_urls:
yield ScreleniumRequest(url, callback=self.parse, pause=4)
def parse(self, response):
html = response.text
title = response.xpath("//title/text()").get()
element = response.find_element(By.ID, "submit-button")
element.send_keys(Keys.Return)
next_page = response.xpath("//a[@class='next-page-link']/@href").get()
next_url = response.urljoin(next_page)
yield ScraleniumRequest(next_url, callback=self.parse, pause=4)
yield {"title": title}
I have added some additional features but am behind on documenting them.
TODO
[x] add features
[] document them
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scralenium-0.1.3.tar.gz
.
File metadata
- Download URL: scralenium-0.1.3.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c38a43d5c37dd0bed6cd4f862f874d20a7e41215792ef2059e3797389c07f999 |
|
MD5 | e265f71fa6564c451f4f1bba913c1541 |
|
BLAKE2b-256 | 525b853d435b1080ed966e711478707781fd9b339354a75904957efbc270dcc8 |
File details
Details for the file scralenium-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: scralenium-0.1.3-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52abaa98d1669244a8a9babccce2cc2fd2ed438ce10e3dde89dfc035a4d6b41a |
|
MD5 | 3e15610a6bf27bf631c73b5bc40e28ec |
|
BLAKE2b-256 | 7693e490111b058c4cfaa1f5ec157245b2f55c1eb16c04c8ef37303eac283c92 |