Scrapy with requests-html
Project description
scrapy-requests
Scrapy middleware to handle javascript pages using requests-html.
requests-html uses pyppeteer to load javascript pages, and handles user-agent specification for you. Using requests-html is very intuitive and simple. Check out their documentation.
Requirements
- Python >= 3.6
- Scrapy
- requests-html
Installation
pip install scrapy-requests
Configuration
Add RequestsMiddleware to the downloader middleware
settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy_requests.RequestsMiddleware': 800
}
Usage
Use scrapy_requests.HtmlRequest instead of scrapy.Request
from scrapy_requests import HtmlRequest
yield HtmlRequest(url=url, callback=self.parse)
The requests will be handled by requests_html, and the request will add an additional meta varialble page
containing the HTML object.
def parse(self, response):
page = response.request.meta['page']
page.html.render()
Additional settings
If you would like the page to rendered by pyppeteer - pass True
to the render
key paramater.
yield HtmlRequest(url=url, callback=self.parse, render=True)
You could choose a more speific requirement for the HTML object.
For example - You could set up a sleep timer before loading the page, and js script execution when loading the page - doing it this way:
script = "document.body.querySelector('.btn').click();"
yield HtmlRequest(url=url, callback=self.parse, render=True, options={'sleep': 2, 'script': script})
Notes
Please star this repo if you found it useful.
Feel free to contribute and propose issues & additional features.
License is MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.