Skip to main content

Scrapy with requests-html

Project description

scrapy-requests

PyPI Build Status Codecov

Scrapy middleware to asynchronously handle javascript pages using requests-html.

requests-html uses pyppeteer to load javascript pages, and handles user-agent specification for you. Using requests-html is very intuitive and simple. Check out their documentation.

Requirements

  • Python >= 3.6
  • Scrapy >= 2.0
  • requests-html

Installation

 pip install scrapy-requests

Configuration

Make twisted use Asyncio event loop And add RequestsMiddleware to the downloader middleware

settings.py

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_requests.RequestsMiddleware': 800
}

Usage

Use scrapy_requests.HtmlRequest instead of scrapy.Request

from scrapy_requests import HtmlRequest

yield HtmlRequest(url=url, callback=self.parse)

The requests will be handled by requests_html, and the request will add an additional meta varialble page containing the HTML object.

def parse(self, response):
    page = response.request.meta['page']
    page.html.render()

Additional settings

If you would like the page to be rendered by pyppeteer - pass True to the render key paramater.

yield HtmlRequest(url=url, callback=self.parse, render=True)

You could choose a more speific functionality for the HTML object.

For example - You could set up a sleep timer before loading the page, and js script execution when loading the page - doing it this way:

script = "document.body.querySelector('.btn').click();"
yield HtmlRequest(url=url, callback=self.parse, render=True, options={'sleep': 2, 'script': script})

You could pass default settings to requests-html session - specifying header, proxies, auth settings etc... You do this by specifying an addtional variable in settings.py

DEFAULT_SCRAPY_REQUESTS_SETTINGS = {
    'verify': False, # Verifying SSL certificates
    'mock_browser': True, # Mock browser user-agent
    'browser_args': ['--no-sandbox', '--proxy-server=x.x.x.x:xxxx'], 
}

Notes

Please star this repo if you found it useful.

Feel free to contribute and propose issues & additional features.

License is MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-requests-0.2.0.tar.gz (4.7 kB view hashes)

Uploaded Source

Built Distribution

scrapy_requests-0.2.0-py3-none-any.whl (6.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page