Scrapy with requests-html
Project description
scrapy-requests
Scrapy middleware to asynchronously handle javascript pages using requests-html.
requests-html uses pyppeteer to load javascript pages, and handles user-agent specification for you. Using requests-html is very intuitive and simple. Check out their documentation.
Requirements
- Python >= 3.6
- Scrapy >= 2.0
- requests-html
Installation
pip install scrapy-requests
Configuration
Make twisted use Asyncio event loop And add RequestsMiddleware to the downloader middleware
settings.py
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
DOWNLOADER_MIDDLEWARES = {
'scrapy_requests.RequestsMiddleware': 800
}
Usage
Use scrapy_requests.HtmlRequest instead of scrapy.Request
from scrapy_requests import HtmlRequest
yield HtmlRequest(url=url, callback=self.parse)
The requests will be handled by requests_html, and the request will add an additional meta varialble page
containing the HTML object.
def parse(self, response):
page = response.request.meta['page']
page.html.render()
Additional settings
If you would like the page to be rendered by pyppeteer - pass True
to the render
key paramater.
yield HtmlRequest(url=url, callback=self.parse, render=True)
You could choose a more speific functionality for the HTML object.
For example - You could set up a sleep timer before loading the page, and js script execution when loading the page - doing it this way:
script = "document.body.querySelector('.btn').click();"
yield HtmlRequest(url=url, callback=self.parse, render=True, options={'sleep': 2, 'script': script})
You could pass default settings to requests-html session - specifying header, proxies, auth settings etc...
You do this by specifying an addtional variable in settings.py
DEFAULT_SCRAPY_REQUESTS_SETTINGS = {
'verify': False, # Verifying SSL certificates
'mock_browser': True, # Mock browser user-agent
'browser_args': ['--no-sandbox', '--proxy-server=x.x.x.x:xxxx'],
}
Notes
Please star this repo if you found it useful.
Feel free to contribute and propose issues & additional features.
License is MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy_requests-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3bc988c53589607a676a4a82dabe199d06c5b85562f4abbd83c913c0d7f95646 |
|
MD5 | c60a51c2cbfa55c39bf6a93689e6cea6 |
|
BLAKE2b-256 | b3a249cef9d2e348b12cede957f59004202eb8fe59df38309ce5afe1ca170cd8 |