Skip to main content

Scrapy with requests-html

Project description

scrapy-requests

PyPI Build Status Codecov

Scrapy middleware to asynchronously handle javascript pages using requests-html.

requests-html uses pyppeteer to load javascript pages, and handles user-agent specification for you. Using requests-html is very intuitive and simple. Check out their documentation.

Requirements

  • Python >= 3.6
  • Scrapy >= 2.0
  • requests-html

Installation

 pip install scrapy-requests

Configuration

Make twisted use Asyncio event loop And add RequestsMiddleware to the downloader middleware

settings.py

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_requests.RequestsMiddleware': 800
}

Usage

Use scrapy_requests.HtmlRequest instead of scrapy.Request

from scrapy_requests import HtmlRequest

yield HtmlRequest(url=url, callback=self.parse)

The requests will be handled by requests_html, and the request will add an additional meta varialble page containing the HTML object.

def parse(self, response):
    page = response.request.meta['page']
    page.html.render()

Additional settings

If you would like the page to be rendered by pyppeteer - pass True to the render key paramater.

yield HtmlRequest(url=url, callback=self.parse, render=True)

You could choose a more speific functionality for the HTML object.

For example - You could set up a sleep timer before loading the page, and js script execution when loading the page - doing it this way:

script = "document.body.querySelector('.btn').click();"
yield HtmlRequest(url=url, callback=self.parse, render=True, options={'sleep': 2, 'script': script})

You could pass default settings to requests-html session - specifying header, proxies, auth settings etc... You do this by specifying an addtional variable in settings.py

DEFAULT_SCRAPY_REQUESTS_SETTINGS = {
    'verify': False, # Verifying SSL certificates
    'mock_browser': True, # Mock browser user-agent
    'browser_args': ['--no-sandbox', '--proxy-server=x.x.x.x:xxxx'], 
}

Notes

Please star this repo if you found it useful.

Feel free to contribute and propose issues & additional features.

License is MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-requests-0.2.0.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

scrapy_requests-0.2.0-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-requests-0.2.0.tar.gz.

File metadata

  • Download URL: scrapy-requests-0.2.0.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.2

File hashes

Hashes for scrapy-requests-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b4176eaa11691fc3cf5a772d26cb26625f4ca2435ae05d0c5e21a7b50cae366e
MD5 9765df2bb5d2abf5ae9f385da65652cc
BLAKE2b-256 331076fc04b22ad261867080471d9d18ff45ff6acd41e051f71664b7deda68a1

See more details on using hashes here.

File details

Details for the file scrapy_requests-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: scrapy_requests-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.3 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.9.2

File hashes

Hashes for scrapy_requests-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3bc988c53589607a676a4a82dabe199d06c5b85562f4abbd83c913c0d7f95646
MD5 c60a51c2cbfa55c39bf6a93689e6cea6
BLAKE2b-256 b3a249cef9d2e348b12cede957f59004202eb8fe59df38309ce5afe1ca170cd8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page