Middleware for scrapy using selenium webdriver.
Project description
Scralenium
Project name is a scralenium that allows use selenium webdriver with scrapy to do scrape web data from dynamic web pages. The name is actually really clever, if you didn't notice it is scrapy + selenium = scralenium. Genius right? :)
Prerequisites
Before you begin, ensure you have met the following requirements:
- You have installed the latest version of
python 3 - You are familiar with the scrapy framework
- You are familiar with selenium
- You have a webdriver installed/available
Requirements:
- scrapy
- selenium
Installing
To install scralenium, follow these steps:
git clone https://github.com/alexpdev/scralenium.git
cd scralenium
pip install .
From PyPi
pip install scralenium
License
This project uses the following license: Apache 2.0.
Usage
Using scralenium is really simple.
In your scrappy settings set the SELENIUM_DRIVER_NAME and
SELENIUM_DRIVER_EXECUTABLE fields. scralenium currently supports
chrome SELENIUM_DRIVER_NAME field. If the webdriver executable is already
on path then it can be omitted. You also need to enable the
ScraleniumDownloaderMiddleware in the DOWNLOADER_MIDDLEWARES feed.
from shutil import which
SELENIUM_DRIVER_EXECUTABLE = which("chromedriver")
SELENIUM_DRIVER_NAME = "chrome"
DOWNLOADER_MIDDLEWARES = {
"scralenium.ScraleniumDownloaderMiddleware" : 950
}
Once you have added the settings to the settings.py file or in the
spider's custom_settings attribute all that is needed is to use
ScraleniumRequest when yielding from the start_requests method or
from your parse callback methods. The pause argument can be used to set
the webdrivers implicit wait value. And the response argument in the
parse callback methods gives you full access to the normal scrapy response
as well as all the features of the webdriver.
import scrapy
from scralenium import ScraleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
class MySpider(scrapy.Spider):
"""Example of using scrapy.Spider with ScraleniumRequest"""
...
def start_requests(self):
for url in self.start_urls:
yield ScreleniumRequest(url, callback=self.parse, pause=4)
def parse(self, response):
html = response.text
title = response.xpath("//title/text()").get()
element = response.find_element(By.ID, "submit-button")
element.send_keys(Keys.Return)
next_page = response.xpath("//a[@class='next-page-link']/@href").get()
next_url = response.urljoin(next_page)
yield ScraleniumRequest(next_url, callback=self.parse, pause=4)
yield {"title": title}
I have added some additional features but am behind on documenting them.
TODO
[x] add features
[] document them
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scralenium-0.1.3.tar.gz.
File metadata
- Download URL: scralenium-0.1.3.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c38a43d5c37dd0bed6cd4f862f874d20a7e41215792ef2059e3797389c07f999
|
|
| MD5 |
e265f71fa6564c451f4f1bba913c1541
|
|
| BLAKE2b-256 |
525b853d435b1080ed966e711478707781fd9b339354a75904957efbc270dcc8
|
File details
Details for the file scralenium-0.1.3-py3-none-any.whl.
File metadata
- Download URL: scralenium-0.1.3-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52abaa98d1669244a8a9babccce2cc2fd2ed438ce10e3dde89dfc035a4d6b41a
|
|
| MD5 |
3e15610a6bf27bf631c73b5bc40e28ec
|
|
| BLAKE2b-256 |
7693e490111b058c4cfaa1f5ec157245b2f55c1eb16c04c8ef37303eac283c92
|