Scrapy with requests-html
Project description
scrapy-requests
Scrapy middleware to asynchronously handle javascript pages using requests-html.
requests-html uses pyppeteer to load javascript pages, and handles user-agent specification for you. Using requests-html is very intuitive and simple. Check out their documentation.
Requirements
- Python >= 3.6
- Scrapy >= 2.0
- requests-html
Installation
pip install scrapy-requests
Configuration
Make twisted use Asyncio event loop And add RequestsMiddleware to the downloader middleware
settings.py
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
DOWNLOADER_MIDDLEWARES = {
'scrapy_requests.RequestsMiddleware': 800
}
Usage
Use scrapy_requests.HtmlRequest instead of scrapy.Request
from scrapy_requests import HtmlRequest
yield HtmlRequest(url=url, callback=self.parse)
The requests will be handled by requests_html, and the request will add an additional meta varialble page
containing the HTML object.
def parse(self, response):
page = response.request.meta['page']
page.html.render()
Additional settings
If you would like the page to be rendered by pyppeteer - pass True
to the render
key paramater.
yield HtmlRequest(url=url, callback=self.parse, render=True)
You could choose a more speific functionality for the HTML object.
For example - You could set up a sleep timer before loading the page, and js script execution when loading the page - doing it this way:
script = "document.body.querySelector('.btn').click();"
yield HtmlRequest(url=url, callback=self.parse, render=True, options={'sleep': 2, 'script': script})
You could pass default settings to requests-html session - specifying header, proxies, auth settings etc...
You do this by specifying an addtional variable in settings.py
DEFAULT_SCRAPY_REQUESTS_SETTINGS = {
'verify': False, # Verifying SSL certificates
'mock_browser': True, # Mock browser user-agent
'browser_args': ['--no-sandbox', '--proxy-server=x.x.x.x:xxxx'],
}
Notes
Please star this repo if you found it useful.
Feel free to contribute and propose issues & additional features.
License is MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-requests-0.2.0.tar.gz
.
File metadata
- Download URL: scrapy-requests-0.2.0.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4176eaa11691fc3cf5a772d26cb26625f4ca2435ae05d0c5e21a7b50cae366e |
|
MD5 | 9765df2bb5d2abf5ae9f385da65652cc |
|
BLAKE2b-256 | 331076fc04b22ad261867080471d9d18ff45ff6acd41e051f71664b7deda68a1 |
File details
Details for the file scrapy_requests-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: scrapy_requests-0.2.0-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3bc988c53589607a676a4a82dabe199d06c5b85562f4abbd83c913c0d7f95646 |
|
MD5 | c60a51c2cbfa55c39bf6a93689e6cea6 |
|
BLAKE2b-256 | b3a249cef9d2e348b12cede957f59004202eb8fe59df38309ce5afe1ca170cd8 |