A simple accessory tools for Scrapy.

These details have not been verified by PyPI

Project description

Scrapy Accessory

Introduction

Useful accessory utilities for Scrapy.

Containing:

middleware
item pipeline
feed exporter storage backend

Installation

pip install scrapy-accessory

Usage

Middleware

RandomUserAgentDownloadMiddleware

Add random user-agent to requests.

In settings.py add

# USER_AGENT_LIST_FILE = 'path-to-files'
USER_AGENT_LIST = [
    'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0',
    'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
]

DOWNLOADER_MIDDLEWARES = {
    'scrapy_accessory.middlewares.RandomUserAgentDownloadMiddleware': 200,
}

You can use either USER_AGENT_LIST_FILE or USER_AGENT_LIST to configure user-agents. USER_AGENT_LIST_FILE points to a text file containing one user-agent per line. USER_AGENT_LIST is a list or tuple of user-agents.

ProxyDownloadMiddleware

Add http or https proxy for requests.

In settings.py add

PROXY_ENABLED = True  # True to use proxy, default is False
# PROXY_HOST = 'localhost:8080'  # default static proxy, format: <ip>:<port>, default empty
PROXY_CACHE = 'redis://localhost:6379/0'  # cache for proxy, use redis://<host>:<port>/<db> to use redis cache, default dict in memory
PROXY_TTL = 30 # proxy cache ttl in seconds, default 30s
CHANGE_PROXY_STATUS = [429]  # a list of status codes that force to change proxy if received, default [429]

Default is a static proxy configured in settings.py, you can add dynamic proxy from API or others. Just need to extend the ProxyDownloadMiddleware class and implement the generate_proxy method.

Example:

class DynamicProxyDownloadMiddleware(ProxyDownloadMiddleware):

    api = 'http://api-to-get-proxy-ip'

    def generate_proxy(self):
        res = requests.get(self.api)
        if res.status_code < 300:
            return res.text  # return format <ip>:<port>
        return None

Feed exporter storage backend

ObsFeedStorage

Feed exporter storage backend for huawei cloud OBS.

Install obs sdk first

pip install esdk-obs-python

Configure in settings.py

FEED_STORAGES = {
    'obs': 'scrapy_accessory.feedexporter.ObsFeedStorage',
}
HUAWEI_ACCESS_KEY_ID = '<your access key id>'
HUAWEI_SECRET_ACCESS_KEY = '<your secret access key>'
HUAWEI_OBS_ENDPOINT = '<your obs bucket endpoint> ex: https://obs.cn-north-4.myhuaweicloud.com'

Output to OBS by obs schema -o obs://<bucket>/<key>

OssFeedStorage

Feed exporter storage backend for ali cloud OSS.

Install oss sdk first

pip install oss2

Configure in settings.py

FEED_STORAGES = {
    'oss': 'scrapy_accessory.feedexporter.OssFeedStorage',
}
ALI_ACCESS_KEY_ID = '<your access key id>'
ALI_SECRET_ACCESS_KEY = '<your secret access key>'
ALI_OSS_ENDPOINT = '<your oss bucket endpoint> ex: https://oss-cn-beijing.aliyuncs.com'

Output to OSS by oss schema -o oss://<bucket>/<key>

Item Pipeline

RedisListPipeline

Export items to redis list.

Install redis first.

pip install redis

Configure in settings.py

REDIS_CONNECTION_URL = 'redis://localhost:6379/0'  # required
REDIS_DEFAULT_QUEUE = 'test'  # use spider's queue attribute to override it
REDIS_MAX_RETRY = 5  # default 5

Add scrapy_accessory.pipelines.RedisListPipeline to your ITEM_PIPELINES settings.

ITEM_PIPELINES = {
    'scrapy_accessory.pipelines.RedisListPipeline': 1
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.1

Mar 11, 2021

0.2.0

Mar 27, 2020

0.1.0

Jan 13, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-accessory-0.2.1.tar.gz (5.5 kB view details)

Uploaded Mar 11, 2021 Source

Built Distribution

scrapy_accessory-0.2.1-py3-none-any.whl (8.4 kB view details)

Uploaded Mar 11, 2021 Python 3

File details

Details for the file scrapy-accessory-0.2.1.tar.gz.

File metadata

Download URL: scrapy-accessory-0.2.1.tar.gz
Upload date: Mar 11, 2021
Size: 5.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6

File hashes

Hashes for scrapy-accessory-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`18301fcdb52871c84e1f6ba603ff58710fd146d65cfc6cb45810ca57f7494040`
MD5	`4828d5b118073388d8aa5c92ffb29e82`
BLAKE2b-256	`4c4fc5af4be32ab54c3d73a077e7ec3e18e4263e9d86c2e97c6ba33c89da7f39`

See more details on using hashes here.

File details

Details for the file scrapy_accessory-0.2.1-py3-none-any.whl.

File metadata

Download URL: scrapy_accessory-0.2.1-py3-none-any.whl
Upload date: Mar 11, 2021
Size: 8.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6

File hashes

Hashes for scrapy_accessory-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f5317e08b24d39fa56d5c1a671ee6108289fe1e96a5b72a66fb15dafa291bde`
MD5	`a091e97387f9280b6f54a24cf68d447f`
BLAKE2b-256	`10e728ac4cfbf1c808e67cfae3c28b903610394504a90dbc6f198b0c2ff0c344`

See more details on using hashes here.

scrapy-accessory 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Scrapy Accessory

Introduction

Installation

Usage

Middleware

RandomUserAgentDownloadMiddleware

ProxyDownloadMiddleware

Feed exporter storage backend

ObsFeedStorage

OssFeedStorage

Item Pipeline

RedisListPipeline

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes