WebScrapingApi Python Scrapy SDK

These details have not been verified by PyPI

Project links

Homepage

Project description

WebScrapingAPI Scrapy SDK

WebScrapingApi is an API that allows scraping websites while using rotating proxies to prevent bans. This SDK for Scrapy allows you to create a Scrapy spider, integrated with our API.

API Key

To use the API and the SDK you will need a API Key. You can get one by registering at WebScrapingApi

Installation

Run the following command in the main folder of your project:

pip install webscrapingapi-scrapy-sdk

Usage

To use our API combined with Scrapy you first should install scrapy and create a new project by running these commands:

pip install scrapy
pip install webscrapingapi-scrapy-sdk
scrapy startproject myproject
cd myproject

Now that Scrapy created our project, the first step is to update the project's settings by adding at the end of the file /myproject/myproject/settings.py the following lines:

WEBSCRAPINGAPI_API_KEY = 'YOUR_API_KEY'

DOWNLOADER_MIDDLEWARES = {
    'webscrapingapi_scrapy_sdk.WebScrapingApiMiddleware': 543,
}

CONCURRENT_REQUESTS = 1

The next part is creating the spider. We will name our spider example.py and we will place it in myproject/myproject/spiders/

The source code for the spider is:

from webscrapingapi_scrapy_sdk import WebScrapingApiSpider, WebScrapingApiRequest

import urllib.parse as urlparse
from urllib.parse import parse_qs

class ExampleSpider(WebScrapingApiSpider):
    name = 'example'

    def start_requests(self):
        start_urls = [
            'https://httpbin.org',
            'http://httpbin.org/ip',
        ]
        for url in start_urls:
            yield WebScrapingApiRequest(url, params={
                # API Parameters
                # Set to 0 (off, default) or 1 (on) depending on whether or not to render JavaScript on the target web page. JavaScript rendering is done by using a browser.
                'render_js': 1,
                # Set datacenter (default) or residential depending on whether proxy type you want to use for your scraping request. Please note that a single residential proxy API request is counted as 25 API requests.
                'proxy_type': 'datacenter',
                # Specify the 2-letter code of the country you would like to use as a proxy geolocation for your scraping API request. Supported countries differ by proxy type, please refer to the Proxy Locations section for details.
                'country': 'us',
                # Set depending on whether or not to use the same proxy address to your request.
                'session': 1,
                # Specify the maximum timeout in milliseconds you would like to use for your scraping API request. In order to force a timeout, you can specify a number such as 1000. This will abort the request after 1000ms and return whatever HTML response was obtained until this point in time.
                'timeout': 10000,
                # Set desktop (default) or mobile or tablet, depending on whether the device type you want to your for your scraping request.
                'device': 'desktop',
                # Specify the option you would like to us as conditional for your scraping API request. Can only be used when the parameter render_js=1 is activated.
                'wait_until': 'domcontentloaded',
                # Some websites may use javascript frameworks that may require a few extra seconds to load their content. This parameters specifies the time in miliseconds to wait for the website. Recommended values are in the interval 5000-10000.
                'wait_for': 0,
            }, headers={
                # API Headers
                'authorization': 'bearer test',
                # Specify custom cookies to be passed to the request.
                'cookie': 'test_cookie=abc; cookie_2=def'
            })

    def parse(self, response):
        parsed_url = urlparse.urlparse(response.url)
        page = parse_qs(parsed_url.query)['url'][0].split("/")[2:]
        page = "-".join(page)
        filename = f'page-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

To understand better the WebScrapingAPI parameters, please read our documentation

Now that we have the spider, the only thing left to do is run it, by executing the following command:

scrapy crawl example

This spider should create 2 html files in the project folder, with the html sources from the links: https://httpbin.org and http://httpbin.org/ip

For any questions or issues that you may find, please contact us via the contact page

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.7

Jul 7, 2021

1.0.6

Jul 7, 2021

1.0.5

Jul 7, 2021

1.0.4

Jul 7, 2021

This version

1.0.3

Jul 6, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webscrapingapi_scrapy_sdk-1.0.3.tar.gz (4.4 kB view details)

Uploaded Jul 6, 2021 Source

File details

Details for the file webscrapingapi_scrapy_sdk-1.0.3.tar.gz.

File metadata

Download URL: webscrapingapi_scrapy_sdk-1.0.3.tar.gz
Upload date: Jul 6, 2021
Size: 4.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.15.0 pkginfo/1.7.0 requests/2.25.1 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/2.7.18

File hashes

Hashes for webscrapingapi_scrapy_sdk-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`640a1b8b61a7759d6c30ddfee68b32485a221c7917d4d6f9a7e5fa1b5ccf3bb2`
MD5	`c7f197d4f6712d5194e561537223d221`
BLAKE2b-256	`6564018fb1149e58d9bb3b70d476569bf8deed45c14d9c34f0814e111b97f160`

See more details on using hashes here.

webscrapingapi-scrapy-sdk 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WebScrapingAPI Scrapy SDK

API Key

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes