Skip to main content

JavaScript support and proxy rotation for Scrapy with ScrapingBee

Project description

Scrapy ScrapingBee Middleware

build version python

Integrate Scrapy with ScrapingBee API to use headless browsers for JavaScript and proxy rotation. Requires to create an account on scrapingbee.com to get an API key.

Installation

pip install scrapy-scrapingbee

Configuration

Add your SCRAPINGBEE_API_KEY and the ScrapingBeeMiddleware to your project settings.py. Don't forget to set CONCURRENT_REQUESTS according to your ScrapingBee plan.

SCRAPINGBEE_API_KEY = 'REPLACE-WITH-YOUR-API-KEY'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_scrapingbee.ScrapingBeeMiddleware': 725,
}

CONCURRENT_REQUESTS = 1

Usage

Inherit your spiders from ScrapingBeeSpider and yield a ScrapingBeeRequest.

ScrapingBeeSpider overrides the default logger to hide your API key in the Scrapy logs.

Below you can see an example from the spider in httpbin.py.

from scrapy_scrapingbee import ScrapingBeeSpider, ScrapingBeeRequest

JS_SNIPPET = 'window.scrollTo(0, document.body.scrollHeight);'


class HttpbinSpider(ScrapingBeeSpider):
    name = 'httpbin'
    start_urls = [
        'https://httpbin.org',
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield ScrapingBeeRequest(url, params={
                # 'render_js': False,
                # 'block_ads': True,
                # 'block_resources': False,
                # 'js_snippet': JS_SNIPPET,
                # 'premium_proxy': True,
                # 'country_code': 'fr',
                # 'return_page_source': True,
                # 'wait': 3000,
                # 'wait_for': '#swagger-ui',
            },
            headers={
                # 'Accept-Language': 'En-US',
            },
            cookies={
                # 'name_1': 'value_1',
            })

    def parse(self, response):
        ...

You can pass ScrapingBee parameters in the params argument of a ScrapingBeeRequest. Headers and cookies are passed like a normal Scrapy Request. ScrapingBeeRequest formats all parameters, headers and cookies to the format expected by the ScrapingBee API.

Examples

Add your API key to settings.py.

To run the examples you need to clone this repository. In your terminal, go to examples/httpbin/httpbin and run the example spider with:

scrapy crawl httpbin

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-scrapingbee-0.0.5.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

scrapy_scrapingbee-0.0.5-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-scrapingbee-0.0.5.tar.gz.

File metadata

  • Download URL: scrapy-scrapingbee-0.0.5.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.6

File hashes

Hashes for scrapy-scrapingbee-0.0.5.tar.gz
Algorithm Hash digest
SHA256 726785fc6027443af0cd94dc621d971a58b8c266cae563deaaf87e68f94a2ac0
MD5 e40d58f6e8b2d3984ed162d089923f4c
BLAKE2b-256 d4dc79cafa6989a9de04e0126558308cdf1545b47d93c44e7aa14e5c4d81ed58

See more details on using hashes here.

File details

Details for the file scrapy_scrapingbee-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: scrapy_scrapingbee-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.6

File hashes

Hashes for scrapy_scrapingbee-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 fdc4fccfe45ac405bb89ace61876fc699ef4889ed47b450674d814f8ff49448f
MD5 961243fee231078d6e3d8e2b59fdb12f
BLAKE2b-256 ca22707b86d51987f754fd494c4ada2b73fd296da4b8d3ef87c933f435985387

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page