Skip to main content

WebScrapingApi Python Scrapy SDK

Project description

WebScrapingAPI Scrapy SDK

WebScrapingApi is an API that allows scraping websites while using rotating proxies to prevent bans. This SDK for Scrapy allows you to create a Scrapy spider, integrated with our API.

API Key

To use the API and the SDK you will need a API Key. You can get one by registering at WebScrapingApi

Installation

Run the following command in the main folder of your project:

pip install webscrapingapi-scrapy-sdk

Usage

To use our API combined with Scrapy you first should install scrapy and create a new project by running these commands:

pip install scrapy
pip install webscrapingapi-scrapy-sdk
scrapy startproject myproject
cd myproject

Now that Scrapy created our project, the first step is to update the project's settings by adding at the end of the file /myproject/myproject/settings.py the following lines:

WEBSCRAPINGAPI_API_KEY = 'YOUR_API_KEY'

DOWNLOADER_MIDDLEWARES = {
    'webscrapingapi_scrapy_sdk.WebScrapingApiMiddleware': 543,
}

CONCURRENT_REQUESTS = 1

The next part is creating the spider. We will name our spider example.py and we will place it in myproject/myproject/spiders/

The source code for the spider is:

from webscrapingapi_scrapy_sdk import WebScrapingApiSpider, WebScrapingApiRequest

import urllib.parse as urlparse
from urllib.parse import parse_qs

class ExampleSpider(WebScrapingApiSpider):
    name = 'example'

    def start_requests(self):
        start_urls = [
            'https://httpbin.org',
            'http://httpbin.org/ip',
        ]
        for url in start_urls:
            yield WebScrapingApiRequest(url, params={
                # API Parameters
                # Set to 0 (off, default) or 1 (on) depending on whether or not to render JavaScript on the target web page. JavaScript rendering is done by using a browser.
                'render_js': 1,
                # Set datacenter (default) or residential depending on whether proxy type you want to use for your scraping request. Please note that a single residential proxy API request is counted as 25 API requests.
                'proxy_type': 'datacenter',
                # Specify the 2-letter code of the country you would like to use as a proxy geolocation for your scraping API request. Supported countries differ by proxy type, please refer to the Proxy Locations section for details.
                'country': 'us',
                # Set depending on whether or not to use the same proxy address to your request.
                'session': 1,
                # Specify the maximum timeout in milliseconds you would like to use for your scraping API request. In order to force a timeout, you can specify a number such as 1000. This will abort the request after 1000ms and return whatever HTML response was obtained until this point in time.
                'timeout': 10000,
                # Set desktop (default) or mobile or tablet, depending on whether the device type you want to your for your scraping request.
                'device': 'desktop',
                # Specify the option you would like to us as conditional for your scraping API request. Can only be used when the parameter render_js=1 is activated.
                'wait_until': 'domcontentloaded',
                # Some websites may use javascript frameworks that may require a few extra seconds to load their content. This parameters specifies the time in miliseconds to wait for the website. Recommended values are in the interval 5000-10000.
                'wait_for': 0,
            }, headers={
                # API Headers
                'authorization': 'bearer test',
                # Specify custom cookies to be passed to the request.
                'cookie': 'test_cookie=abc; cookie_2=def'
            })

    def parse(self, response):
        parsed_url = urlparse.urlparse(response.url)
        page = parse_qs(parsed_url.query)['url'][0].split("/")[2:]
        page = "-".join(page)
        filename = f'page-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

To understand better the WebScrapingAPI parameters, please read our documentation

Now that we have the spider, the only thing left to do is run it, by executing the following command:

scrapy crawl example

This spider should create 2 html files in the project folder, with the html sources from the links: https://httpbin.org and http://httpbin.org/ip

For any questions or issues that you may find, please contact us via the contact page

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webscrapingapi_scrapy_sdk-1.0.5.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webscrapingapi_scrapy_sdk-1.0.5-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file webscrapingapi_scrapy_sdk-1.0.5.tar.gz.

File metadata

  • Download URL: webscrapingapi_scrapy_sdk-1.0.5.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for webscrapingapi_scrapy_sdk-1.0.5.tar.gz
Algorithm Hash digest
SHA256 a415d8aaa4d8a679436dc354ab72495b13349492d4d1fd6a0ff2bffa61c6cae3
MD5 10e12bf48dd8283cbfee3ed75440b180
BLAKE2b-256 539903454ee30005ab2133d5ddcfc9d37b76f4022dfbfe8342892aaa52ebc9aa

See more details on using hashes here.

File details

Details for the file webscrapingapi_scrapy_sdk-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: webscrapingapi_scrapy_sdk-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for webscrapingapi_scrapy_sdk-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 05272515df3bfcfc0a4e9566b76ca661adff0c16e762020a51fa6c94acd2a630
MD5 b9ded9f497b170b3f11fc4c4515ae5a2
BLAKE2b-256 b8677997d27a9d2baac4297dbd9c086966da9e90defb5f3b4dcf4f983ffb700b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page