A simple accessory tools for Scrapy.
Project description
Scrapy Accessory
Introduction
Useful accessory utilities for Scrapy.
Containing:
- middleware
- item pipeline
- feed exporter storage backend
Installation
pip install scrapy-accessory
Usage
Middleware
RandomUserAgentDownloadMiddleware
Add random user-agent to requests.
In settings.py add
# USER_AGENT_LIST_FILE = 'path-to-files'
USER_AGENT_LIST = [
'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0',
'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
]
DOWNLOADER_MIDDLEWARES = {
'scrapy_accessory.middlewares.RandomUserAgentDownloadMiddleware': 200,
}
You can use either USER_AGENT_LIST_FILE
or USER_AGENT_LIST
to configure user-agents.
USER_AGENT_LIST_FILE
points to a text file containing one user-agent per line.
USER_AGENT_LIST
is a list or tuple of user-agents.
ProxyDownloadMiddleware
Add http or https proxy for requests.
In settings.py add
PROXY_ENABLED = True # True to use proxy, default is False
# PROXY_HOST = 'localhost:8080' # default static proxy, format: <ip>:<port>, default empty
PROXY_CACHE = 'redis://localhost:6379/0' # cache for proxy, use redis://<host>:<port>/<db> to use redis cache, default dict in memory
PROXY_TTL = 30 # proxy cache ttl in seconds, default 30s
CHANGE_PROXY_STATUS = [429] # a list of status codes that force to change proxy if received, default [429]
Default is a static proxy configured in settings.py, you can add dynamic proxy from API or others.
Just need to extend the ProxyDownloadMiddleware
class and implement the generate_proxy
method.
Example:
class DynamicProxyDownloadMiddleware(ProxyDownloadMiddleware):
api = 'http://api-to-get-proxy-ip'
def generate_proxy(self):
res = requests.get(self.api)
if res.status_code < 300:
return res.text # return format <ip>:<port>
return None
Feed exporter storage backend
ObsFeedStorage
Feed exporter storage backend for huawei cloud OBS.
Install obs sdk first
pip install esdk-obs-python
Configure in settings.py
FEED_STORAGES = {
'obs': 'scrapy_accessory.feedexporter.ObsFeedStorage',
}
HUAWEI_ACCESS_KEY_ID = '<your access key id>'
HUAWEI_SECRET_ACCESS_KEY = '<your secret access key>'
HUAWEI_OBS_ENDPOINT = '<your obs bucket endpoint> ex: https://obs.cn-north-4.myhuaweicloud.com'
Output to OBS by obs schema -o obs://<bucket>/<key>
OssFeedStorage
Feed exporter storage backend for ali cloud OSS.
Install oss sdk first
pip install oss2
Configure in settings.py
FEED_STORAGES = {
'oss': 'scrapy_accessory.feedexporter.OssFeedStorage',
}
ALI_ACCESS_KEY_ID = '<your access key id>'
ALI_SECRET_ACCESS_KEY = '<your secret access key>'
ALI_OSS_ENDPOINT = '<your oss bucket endpoint> ex: https://oss-cn-beijing.aliyuncs.com'
Output to OSS by oss schema -o oss://<bucket>/<key>
Item Pipeline
RedisListPipeline
Export items to redis list.
Install redis first.
pip install redis
Configure in settings.py
REDIS_CONNECTION_URL = 'redis://localhost:6379/0' # required
REDIS_DEFAULT_QUEUE = 'test' # use spider's queue attribute to override it
REDIS_MAX_RETRY = 5 # default 5
Add scrapy_accessory.pipelines.RedisListPipeline
to your ITEM_PIPELINES
settings.
ITEM_PIPELINES = {
'scrapy_accessory.pipelines.RedisListPipeline': 1
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy_accessory-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f5317e08b24d39fa56d5c1a671ee6108289fe1e96a5b72a66fb15dafa291bde |
|
MD5 | a091e97387f9280b6f54a24cf68d447f |
|
BLAKE2b-256 | 10e728ac4cfbf1c808e67cfae3c28b903610394504a90dbc6f198b0c2ff0c344 |