A simple accessory tools for Scrapy.
Project description
Scrapy Accessory
Introduction
Useful accessory utilities for Scrapy.
Containing:
- middleware
- item pipeline
- feed exporter storage backend
Installation
pip install scrapy-accessory
Usage
Middleware
RandomUserAgentDownloadMiddleware
Add random user-agent to requests.
In settings.py add
# USER_AGENT_LIST_FILE = 'path-to-files'
USER_AGENT_LIST = [
'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0',
'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
]
DOWNLOADER_MIDDLEWARES = {
'scrapy_accessory.middlewares.RandomUserAgentDownloadMiddleware': 200,
}
You can use either USER_AGENT_LIST_FILE
or USER_AGENT_LIST
to configure user-agents.
USER_AGENT_LIST_FILE
points to a text file containing one user-agent per line.
USER_AGENT_LIST
is a list or tuple of user-agents.
ProxyDownloadMiddleware
Add http or https proxy for requests.
In settings.py add
PROXY_ENABLED = True # True to use proxy, default is False
# PROXY_HOST = 'localhost:8080' # default static proxy, format: <ip>:<port>, default empty
PROXY_CACHE = 'redis://localhost:6379/0' # cache for proxy, use redis://<host>:<port>/<db> to use redis cache, default dict in memory
PROXY_TTL = 30 # proxy cache ttl in seconds, default 30s
CHANGE_PROXY_STATUS = [429] # a list of status codes that force to change proxy if received, default [429]
Default is a static proxy configured in settings.py, you can add dynamic proxy from API or others.
Just need to extend the ProxyDownloadMiddleware
class and implement the generate_proxy
method.
Example:
class DynamicProxyDownloadMiddleware(ProxyDownloadMiddleware):
api = 'http://api-to-get-proxy-ip'
def generate_proxy(self):
res = requests.get(self.api)
if res.status_code < 300:
return res.text # return format <ip>:<port>
return None
Feed exporter storage backend
ObsFeedStorage
Feed exporter storage backend for huawei cloud OBS.
Install obs sdk first
pip install esdk-obs-python
Configure in settings.py
FEED_STORAGES = {
'obs': 'scrapy_accessory.feedexporter.ObsFeedStorage',
}
HUAWEI_ACCESS_KEY_ID = '<your access key id>'
HUAWEI_SECRET_ACCESS_KEY = '<your secret access key>'
HUAWEI_OBS_ENDPOINT = '<your obs bucket endpoint> ex: https://obs.cn-north-4.myhuaweicloud.com'
Output to OBS by obs schema -o obs://<bucket>/<key>
OssFeedStorage
Feed exporter storage backend for ali cloud OSS.
Install oss sdk first
pip install oss2
Configure in settings.py
FEED_STORAGES = {
'oss': 'scrapy_accessory.feedexporter.OssFeedStorage',
}
ALI_ACCESS_KEY_ID = '<your access key id>'
ALI_SECRET_ACCESS_KEY = '<your secret access key>'
ALI_OSS_ENDPOINT = '<your oss bucket endpoint> ex: https://oss-cn-beijing.aliyuncs.com'
Output to OSS by oss schema -o oss://<bucket>/<key>
Item Pipeline
RedisListPipeline
Export items to redis list.
Install redis first.
pip install redis
Configure in settings.py
REDIS_CONNECTION_URL = 'redis://localhost:6379/0' # required
REDIS_DEFAULT_QUEUE = 'test' # use spider's queue attribute to override it
REDIS_MAX_RETRY = 5 # default 5
Add scrapy_accessory.pipelines.RedisListPipeline
to your ITEM_PIPELINES
settings.
ITEM_PIPELINES = {
'scrapy_accessory.pipelines.RedisListPipeline': 1
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-accessory-0.2.1.tar.gz
.
File metadata
- Download URL: scrapy-accessory-0.2.1.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18301fcdb52871c84e1f6ba603ff58710fd146d65cfc6cb45810ca57f7494040 |
|
MD5 | 4828d5b118073388d8aa5c92ffb29e82 |
|
BLAKE2b-256 | 4c4fc5af4be32ab54c3d73a077e7ec3e18e4263e9d86c2e97c6ba33c89da7f39 |
File details
Details for the file scrapy_accessory-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: scrapy_accessory-0.2.1-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0.post20200106 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f5317e08b24d39fa56d5c1a671ee6108289fe1e96a5b72a66fb15dafa291bde |
|
MD5 | a091e97387f9280b6f54a24cf68d447f |
|
BLAKE2b-256 | 10e728ac4cfbf1c808e67cfae3c28b903610394504a90dbc6f198b0c2ff0c344 |