Skip to main content

WebScrapingApi Python Scrapy SDK

Project description

WebScrapingAPI Scrapy SDK

WebScrapingApi is an API that allows scraping websites while using rotating proxies to prevent bans. This SDK for Scrapy allows you to create a Scrapy spider, integrated with our API.

API Key

To use the API and the SDK you will need a API Key. You can get one by registering at WebScrapingApi

Installation

Run the following command in the main folder of your project:

pip install webscrapingapi-scrapy-sdk

Usage

To use our API combined with Scrapy you first should install scrapy and create a new project by running these commands:

pip install scrapy
pip install webscrapingapi-scrapy-sdk
scrapy startproject myproject
cd myproject

Now that Scrapy created our project, the first step is to update the project's settings by adding at the end of the file /myproject/myproject/settings.py the following lines:

WEBSCRAPINGAPI_API_KEY = 'YOUR_API_KEY'

DOWNLOADER_MIDDLEWARES = {
    'webscrapingapi_scrapy_sdk.WebScrapingApiMiddleware': 543,
}

CONCURRENT_REQUESTS = 1

The next part is creating the spider. We will name our spider example.py and we will place it in myproject/myproject/spiders/

The source code for the spider is:

from webscrapingapi_scrapy_sdk import WebScrapingApiSpider, WebScrapingApiRequest

class ExampleSpider(WebScrapingApiSpider):
    name = 'example'
    parseIndex = 0

    def start_requests(self):
        start_urls = [
            'https://httpbin.org',
            'http://httpbin.org/ip',
        ]
        for url in start_urls:
            yield WebScrapingApiRequest(url, params={
                # API Parameters
                # Set to 0 (off, default) or 1 (on) depending on whether or not to render JavaScript on the target web page. JavaScript rendering is done by using a browser.
                'render_js': 1,
                # Set datacenter (default) or residential depending on whether proxy type you want to use for your scraping request. Please note that a single residential proxy API request is counted as 25 API requests.
                'proxy_type': 'datacenter',
                # Specify the 2-letter code of the country you would like to use as a proxy geolocation for your scraping API request. Supported countries differ by proxy type, please refer to the Proxy Locations section for details.
                'country': 'us',
                # Set depending on whether or not to use the same proxy address to your request.
                'session': 1,
                # Specify the maximum timeout in milliseconds you would like to use for your scraping API request. In order to force a timeout, you can specify a number such as 1000. This will abort the request after 1000ms and return whatever HTML response was obtained until this point in time.
                'timeout': 10000,
                # Set desktop (default) or mobile or tablet, depending on whether the device type you want to your for your scraping request.
                'device': 'desktop',
                # Specify the option you would like to us as conditional for your scraping API request. Can only be used when the parameter render_js=1 is activated.
                'wait_until': 'domcontentloaded',
                # Some websites may use javascript frameworks that may require a few extra seconds to load their content. This parameters specifies the time in miliseconds to wait for the website. Recommended values are in the interval 5000-10000.
                'wait_for': 0,
            }, headers={
                # API Headers
                'authorization': 'bearer test',
                # Specify custom cookies to be passed to the request.
                'cookie': 'test_cookie=abc; cookie_2=def'
            })

    def parse(self, response):
        self.parseIndex += 1
        filename = f'page-{self.parseIndex}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

To understand better the WebScrapingAPI parameters, please read our documentation

Now that we have the spider, the only thing left to do is run it, by executing the following command:

scrapy crawl example

This spider should create 2 html files in the project folder, with the html sources from the links: https://httpbin.org and http://httpbin.org/ip

For any questions or issues that you may find, please contact us via the contact page

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webscrapingapi_scrapy_sdk-1.0.7.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webscrapingapi_scrapy_sdk-1.0.7-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file webscrapingapi_scrapy_sdk-1.0.7.tar.gz.

File metadata

  • Download URL: webscrapingapi_scrapy_sdk-1.0.7.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for webscrapingapi_scrapy_sdk-1.0.7.tar.gz
Algorithm Hash digest
SHA256 3b0218fbed521448f4db01e7a1ee2a4d1813cd70e45ae641cf5d67f7fd29baaa
MD5 231ccaed18e6510252fd99fe7522bc0a
BLAKE2b-256 f23969e56a119cd6b21954b4e9fe0b16fc738bd3c34d4b279e132dc3d97a41ce

See more details on using hashes here.

File details

Details for the file webscrapingapi_scrapy_sdk-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: webscrapingapi_scrapy_sdk-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for webscrapingapi_scrapy_sdk-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 217ee4b4098f0b3ed84d4abac0b884331c03459ff61f6e9b030452cf3683d71e
MD5 226d8c7daf60ffe8a3a1aef8cd55f022
BLAKE2b-256 1948ac79b9bbb4d1c67724cd2e80a5aab5b386edef659f1df2eb84dc5ed83ba0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page