Skip to main content

A simple accessory tools for Scrapy.

Project description

Scrapy Accessory

Introduction

Useful accessory utilities for Scrapy.

Containing:

  • middleware
  • feed exporter storage backend

Installation

pip install scrapy-accessory

Usage

Middleware

RandomUserAgentDownloadMiddleware

Add random user-agent to requests.

In settings.py add

# USER_AGENT_LIST_FILE = 'path-to-files'
USER_AGENT_LIST = [
    'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0',
    'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
]

DOWNLOADER_MIDDLEWARES = {
    'scrapy_accessory.middlewares.RandomUserAgentDownloadMiddleware': 200,
}

You can use either USER_AGENT_LIST_FILE or USER_AGENT_LIST to configure user-agents. USER_AGENT_LIST_FILE points to a text file containing one user-agent per line. USER_AGENT_LIST is a list or tuple of user-agents.

ProxyDownloadMiddleware

Add http or https proxy for requests.

In settings.py add

PROXY_ENABLED = True  # True to use proxy, default is False
# PROXY_HOST = 'localhost:8080'  # default static proxy, format: <ip>:<port>, default empty
PROXY_CACHE = 'redis://localhost:6379/0'  # cache for proxy, use redis://<host>:<port>/<db> to use redis cache, default dict in memory
PROXY_TTL = 30 # proxy cache ttl in seconds, default 30s
CHANGE_PROXY_STATUS = [429]  # a list of status codes that force to change proxy if received, default [429]

Default is a static proxy configured in settings.py, you can add dynamic proxy from API or others. Just need to extend the ProxyDownloadMiddleware class and implement the generate_proxy method.

Example:

class DynamicProxyDownloadMiddleware(ProxyDownloadMiddleware):

    api = 'http://api-to-get-proxy-ip'

    def generate_proxy(self):
        res = requests.get(self.api)
        if res.status_code < 300:
            return res.text  # return format <ip>:<port>
        return None

Feed exporter storage backend

ObsFeedStorage

Feed exporter storage backend for huawei cloud OBS.

Install obs sdk first

pip install esdk-obs-python

Configure in settings.py

FEED_STORAGES = {
    'obs': 'scrapy_accessory.feedexporter.ObsFeedStorage',
}
HUAWEI_ACCESS_KEY_ID = '<your access key id>'
HUAWEI_SECRET_ACCESS_KEY = '<your secret access key>'
HUAWEI_OBS_ENDPOINT = '<your obs bucket endpoint> ex: https://obs.cn-north-4.myhuaweicloud.com'

Output to OBS by obs schema -o obs://<bucket>/<key>

OssFeedStorage

Feed exporter storage backend for ali cloud OSS.

Install oss sdk first

pip install oss2

Configure in settings.py

FEED_STORAGES = {
    'oss': 'scrapy_accessory.feedexporter.OssFeedStorage',
}
ALI_ACCESS_KEY_ID = '<your access key id>'
ALI_SECRET_ACCESS_KEY = '<your secret access key>'
ALI_OSS_ENDPOINT = '<your oss bucket endpoint> ex: https://oss-cn-beijing.aliyuncs.com'

Output to OSS by oss schema -o oss://<bucket>/<key>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-accessory-0.2.0.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

scrapy_accessory-0.2.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-accessory-0.2.0.tar.gz.

File metadata

  • Download URL: scrapy-accessory-0.2.0.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6

File hashes

Hashes for scrapy-accessory-0.2.0.tar.gz
Algorithm Hash digest
SHA256 947d20416cf0a252d737b1d6c9b135ca4f968eda0811ebac3d8f9e7be47edc29
MD5 f5f7b8c0b2750469046da789df0d5fce
BLAKE2b-256 0160f9cb74633f2192eee40b02a3d2c3dd7620e6141ecd1edd840736731373b8

See more details on using hashes here.

File details

Details for the file scrapy_accessory-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: scrapy_accessory-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6

File hashes

Hashes for scrapy_accessory-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2be74e25249c961414d0d1effd418fea96279304853c396df451584eb0a30dea
MD5 6ff0f8e3cb0b13e95c56853258f4571e
BLAKE2b-256 bff584fab6d9e753374d242eef95145ce7167ee345c102408c5d4d9f3d14ea5d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page