Scrapy with selenium
Project description
# Scrapy with selenium
Scrapy middleware to handle javascript pages using selenium.
## Installation
```
$ pip install scrapy_selenium_python_pi
```
You should use **python>=3.5**.
You will also need one of the Selenium [compatible browsers](http://www.seleniumhq.org/about/platforms.jsp).
## Configuration
1. Add the browser to use, the path to the executable, and the arguments to pass to the executable to the scrapy settings:
```python
from shutil import which
SELENIUM_DRIVER_NAME='firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH=which('geckodriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless'] # '--headless' if using chrome instead of firefox
```
2. Add the `SeleniumMiddleware` to the downloader middlewares:
```python
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium_python_pi.SeleniumMiddleware': 800
}
```
## Usage
Use the `scrapy_selenium_python_pi.SeleniumRequest` instead of the scrapy built-in `Request` like below:
```python
from scrapy_selenium_python_pi import SeleniumRequest
yield SeleniumRequest(url, self.parse_result)
```
The request will be handled by selenium, and the response will have an additional `meta` key, named `driver` containing the selenium driver with the request processed.
```python
def parse_result(self, response):
print(response.meta['driver'].title)
```
For more information about the available driver methods and attributes, refer to the [selenium python documentation](http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.remote.webdriver)
The `selector` response attribute work as usual (but contains the html processed by the selenium driver).
```python
def parse_result(self, response):
print(response.selector.xpath('//title/@text'))
```
### Additional arguments
The `scrapy_selenium_python_pi.SeleniumRequest` accept 3 additional arguments:
#### `wait_time` / `wait_until`
When used, selenium will perform an [Explicit wait](http://selenium-python.readthedocs.io/waits.html#explicit-waits) before returning the response to the spider.
```python
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
yield SeleniumRequest(
url,
self.parse_result,
wait_time=10,
wait_until=EC.element_to_be_clickable((By.ID, 'someid'))
)
```
#### `screenshot`
When used, selenium will take a screenshot of the page and the binary data of the .png captured will be added to the response `meta`:
```python
yield SeleniumRequest(
url,
self.parse_result,
screenshot=True
)
def parse_result(self, response):
with open('image.png', 'wb') as image_file:
image_file.write(response.meta['screenshot])
```
Scrapy middleware to handle javascript pages using selenium.
## Installation
```
$ pip install scrapy_selenium_python_pi
```
You should use **python>=3.5**.
You will also need one of the Selenium [compatible browsers](http://www.seleniumhq.org/about/platforms.jsp).
## Configuration
1. Add the browser to use, the path to the executable, and the arguments to pass to the executable to the scrapy settings:
```python
from shutil import which
SELENIUM_DRIVER_NAME='firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH=which('geckodriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless'] # '--headless' if using chrome instead of firefox
```
2. Add the `SeleniumMiddleware` to the downloader middlewares:
```python
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium_python_pi.SeleniumMiddleware': 800
}
```
## Usage
Use the `scrapy_selenium_python_pi.SeleniumRequest` instead of the scrapy built-in `Request` like below:
```python
from scrapy_selenium_python_pi import SeleniumRequest
yield SeleniumRequest(url, self.parse_result)
```
The request will be handled by selenium, and the response will have an additional `meta` key, named `driver` containing the selenium driver with the request processed.
```python
def parse_result(self, response):
print(response.meta['driver'].title)
```
For more information about the available driver methods and attributes, refer to the [selenium python documentation](http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.remote.webdriver)
The `selector` response attribute work as usual (but contains the html processed by the selenium driver).
```python
def parse_result(self, response):
print(response.selector.xpath('//title/@text'))
```
### Additional arguments
The `scrapy_selenium_python_pi.SeleniumRequest` accept 3 additional arguments:
#### `wait_time` / `wait_until`
When used, selenium will perform an [Explicit wait](http://selenium-python.readthedocs.io/waits.html#explicit-waits) before returning the response to the spider.
```python
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
yield SeleniumRequest(
url,
self.parse_result,
wait_time=10,
wait_until=EC.element_to_be_clickable((By.ID, 'someid'))
)
```
#### `screenshot`
When used, selenium will take a screenshot of the page and the binary data of the .png captured will be added to the response `meta`:
```python
yield SeleniumRequest(
url,
self.parse_result,
screenshot=True
)
def parse_result(self, response):
with open('image.png', 'wb') as image_file:
image_file.write(response.meta['screenshot])
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for scrapy_selenium_python_pi-0.2.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7709b32fcac3c5ea4c1a867f61e0fc3252015294bd9c6ea100cdb94b6268559 |
|
MD5 | e851e30109ee9621a369810fa164df15 |
|
BLAKE2b-256 | 8905672128e1802f7297d157bfdd4ffcf3578b6760d2a5f90c93eba8faafe60f |
Close
Hashes for scrapy_selenium_python_pi-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3302e38c6e171fcc2d7f7f15a022b6a7c6b968dc12b5880a3de929c9be03f8d8 |
|
MD5 | 3e118354e38fd8be92e70a4103d3fc50 |
|
BLAKE2b-256 | 247100c7610c73772243092ad2ab327bed6f7ff604044a7345ffe001f1a51873 |