scrapy-puppeteer·PyPI

Scrapy with puppeteer

Project description

# Scrapy with Puppeteer
[![PyPI](https://img.shields.io/pypi/v/scrapy-puppeteer.svg)](https://pypi.python.org/pypi/scrapy-puppeteer) [![Build Status](https://travis-ci.org/clemfromspace/scrapy-puppeteer.svg?branch=master)](https://travis-ci.org/clemfromspace/scrapy-puppeteer) [![Test Coverage](https://api.codeclimate.com/v1/badges/86603b736e684dd4f8c9/test_coverage)](https://codeclimate.com/github/clemfromspace/scrapy-puppeteer/test_coverage) [![Maintainability](https://api.codeclimate.com/v1/badges/86603b736e684dd4f8c9/maintainability)](https://codeclimate.com/github/clemfromspace/scrapy-puppeteer/maintainability)

Scrapy middleware to handle javascript pages using [puppeteer](https://github.com/GoogleChrome/puppeteer).

## ⚠ IN ACTIVE DEVELOPMENT - READ BEFORE USING ⚠

This is an attempt to make Scrapy and Puppeteer work together to handle Javascript-rendered pages.
The design is strongly inspired of the Scrapy [Splash plugin](https://github.com/scrapy-plugins/scrapy-splash).

**Scrapy and Puppeteer**

The main issue when running Scrapy and Puppeteer together is that Scrapy is using [Twisted](https://twistedmatrix.com/trac/) and that [Pyppeteeer](https://miyakogi.github.io/pyppeteer/) (the python port of puppeteer we are using) is using [asyncio](https://docs.python.org/3/library/asyncio.html) for async stuff.

Luckily, we can use the Twisted's [asyncio reactor](https://twistedmatrix.com/documents/18.4.0/api/twisted.internet.asyncioreactor.html) to make the two talking with each other.

That's why you **cannot** use the buit-in `scrapy` command line (installing the default reactor), you will have to use the `scrapyp` one, provided by this module.

If you are running your spiders from a script, you will have to make sure you install the asyncio reactor before importing scrapy or doing anything else:

```python
import asyncio
from twisted.internet import asyncioreactor

asyncioreactor.install(asyncio.get_event_loop())
```

## Installation
```
$ pip install scrapy-puppeteer
```

## Configuration
Add the `PuppeteerMiddleware` to the downloader middlewares:
```python
DOWNLOADER_MIDDLEWARES = {
'scrapy_puppeteer.PuppeteerMiddleware': 800
}
```

## Usage
Use the `scrapy_puppeteer.PuppeteerRequest` instead of the Scrapy built-in `Request` like below:
```python
from scrapy_puppeteer import PuppeteerRequest

def your_parse_method(self, response):
# Your code...
yield PuppeteerRequest('http://httpbin.org', self.parse_result)
```
The request will be then handled by puppeteer.

The `selector` response attribute work as usual (but contains the html processed by puppeteer).

```python
def parse_result(self, response):
print(response.selector.xpath('//title/@text'))
```

### Additional arguments
The `scrapy_puppeteer.PuppeteerRequest` accept 2 additional arguments:

#### `wait_until`

Will be passed to the [`waitUntil`](https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.goto) parameter of puppeteer.
Default to `domcontentloaded`.

#### `wait_for`
Will be passed to the [`waitFor`](https://miyakogi.github.io/pyppeteer/reference.html?highlight=image#pyppeteer.page.Page.waitFor) to puppeteer.

#### `screenshot`
When used, puppeteer will take a [screenshot](https://miyakogi.github.io/pyppeteer/reference.html?highlight=headers#pyppeteer.page.Page.screenshot) of the page and the binary data of the .png captured will be added to the response `meta`:
```python
yield PuppeteerRequest(
url,
self.parse_result,
screenshot=True
)

def parse_result(self, response):
with open('image.png', 'wb') as image_file:
image_file.write(response.meta['screenshot'])
```

Project details

Release history Release notifications | RSS feed

This version

0.0.1b0 pre-release

Nov 30, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-puppeteer-0.0.1b0.tar.gz (5.1 kB view details)

Uploaded Nov 30, 2018 Source

Built Distribution

scrapy_puppeteer-0.0.1b0-py3-none-any.whl (6.5 kB view details)

Uploaded Nov 30, 2018 Python 3

File details

Details for the file scrapy-puppeteer-0.0.1b0.tar.gz.

File metadata

Download URL: scrapy-puppeteer-0.0.1b0.tar.gz
Upload date: Nov 30, 2018
Size: 5.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for scrapy-puppeteer-0.0.1b0.tar.gz
Algorithm	Hash digest
SHA256	`5f6fd2b0868217506805cf9d4433d49732d721d8488b10293a88f4d7b07adfc8`
MD5	`b5919eb16db8a2ec40c953753bde7da5`
BLAKE2b-256	`dede735b5bb8e9884590a979b7f5c0f69fb034e80cc9c88713006a9b85615a5f`

See more details on using hashes here.

File details

Details for the file scrapy_puppeteer-0.0.1b0-py3-none-any.whl.

File metadata

Download URL: scrapy_puppeteer-0.0.1b0-py3-none-any.whl
Upload date: Nov 30, 2018
Size: 6.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for scrapy_puppeteer-0.0.1b0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fb633b248444817f1f9c7f57f78bfb2c74752f9322986c82ebf45a36cc7d666a`
MD5	`1c73d3eafefc7827d68c51ee19e5175f`
BLAKE2b-256	`a38ed8aefc1d78710a56ddd9a10124dd81bee896b330be0a083f6fc26251c485`

See more details on using hashes here.

scrapy-puppeteer 0.0.1b0

Navigation

Verified details

Maintainers

Unverified details

Project links

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes