Skip to main content

Scrapy with selenium

Project description

# Scrapy with selenium
[![PyPI](https://img.shields.io/pypi/v/scrapy-selenium.svg)](https://pypi.python.org/pypi/scrapy-selenium) [![Build Status](https://travis-ci.org/clemfromspace/scrapy-selenium.svg?branch=master)](https://travis-ci.org/clemfromspace/scrapy-selenium) [![Test Coverage](https://api.codeclimate.com/v1/badges/5c737098dc38a835ff96/test_coverage)](https://codeclimate.com/github/clemfromspace/scrapy-selenium/test_coverage) [![Maintainability](https://api.codeclimate.com/v1/badges/5c737098dc38a835ff96/maintainability)](https://codeclimate.com/github/clemfromspace/scrapy-selenium/maintainability)

Scrapy middleware to handle javascript pages using selenium.

## Installation
```
$ pip install scrapy-selenium
```
You should use **python>=3.6**.
You will also need one of the Selenium [compatible browsers](http://www.seleniumhq.org/about/platforms.jsp).

## Configuration
1. Add the browser to use, the path to the driver executable, and the arguments to pass to the executable to the scrapy settings:
```python
from shutil import which

SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless'] # '--headless' if using chrome instead of firefox
```

Optionally, set the path to the browser executable:
```python
SELENIUM_BROWSER_EXECUTABLE_PATH = which('firefox')
```

2. Add the `SeleniumMiddleware` to the downloader middlewares:
```python
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
```
## Usage
Use the `scrapy_selenium.SeleniumRequest` instead of the scrapy built-in `Request` like below:
```python
from scrapy_selenium import SeleniumRequest

yield SeleniumRequest(url, self.parse_result)
```
The request will be handled by selenium, and the request will have an additional `meta` key, named `driver` containing the selenium driver with the request processed.
```python
def parse_result(self, response):
print(response.request.meta['driver'].title)
```
For more information about the available driver methods and attributes, refer to the [selenium python documentation](http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.remote.webdriver)

The `selector` response attribute work as usual (but contains the html processed by the selenium driver).
```python
def parse_result(self, response):
print(response.selector.xpath('//title/@text'))
```

### Additional arguments
The `scrapy_selenium.SeleniumRequest` accept 4 additional arguments:

#### `wait_time` / `wait_until`

When used, selenium will perform an [Explicit wait](http://selenium-python.readthedocs.io/waits.html#explicit-waits) before returning the response to the spider.
```python
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

yield SeleniumRequest(
url=url,
callback=self.parse_result,
wait_time=10,
wait_until=EC.element_to_be_clickable((By.ID, 'someid'))
)
```

#### `screenshot`
When used, selenium will take a screenshot of the page and the binary data of the .png captured will be added to the response `meta`:
```python
yield SeleniumRequest(
url=url,
callback=self.parse_result,
screenshot=True
)

def parse_result(self, response):
with open('image.png', 'wb') as image_file:
image_file.write(response.meta['screenshot'])
```

#### `script`
When used, selenium will execute custom JavaScript code.
```python
yield SeleniumRequest(
url,
self.parse_result,
script='window.scrollTo(0, document.body.scrollHeight);',
)
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-selenium-0.0.7.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

scrapy_selenium-0.0.7-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-selenium-0.0.7.tar.gz.

File metadata

  • Download URL: scrapy-selenium-0.0.7.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.5

File hashes

Hashes for scrapy-selenium-0.0.7.tar.gz
Algorithm Hash digest
SHA256 51f809802a1f62ed852cfe2d2ed49f6141058cc5254ed4b448d2ffe6f7a1b6e9
MD5 e9872171640c5bf72e73defc2f29d0f6
BLAKE2b-256 6e36b14b771d9238c054cc691c390c0d2c037436a3f3cbcb6de26c1be2ca8e2c

See more details on using hashes here.

File details

Details for the file scrapy_selenium-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: scrapy_selenium-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.5

File hashes

Hashes for scrapy_selenium-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 70766315c7970b12a142e1b7a9f43ffb3ef1260891811062ec9dd46a665d935a
MD5 c12b3a563424e29915b78c5a3addabf2
BLAKE2b-256 2d8f066607f29d4b351c9dbb10d86f580d2d2dde2f24f7c96427dc681b14e741

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page