A simple accessory tools for Scrapy.
Project description
Scrapy Accessory
Introduction
Useful accessory utilities for Scrapy.
Containing:
- middleware
- feed exporter storage backend
Installation
pip install scrapy-accessory
Usage
Middleware
RandomUserAgentDownloadMiddleware
Add random user-agent to requests.
In settings.py add
# USER_AGENT_LIST_FILE = 'path-to-files'
USER_AGENT_LIST = [
'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0',
'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
]
DOWNLOADER_MIDDLEWARES = {
'scrapy_accessory.middlewares.RandomUserAgentDownloadMiddleware': 200,
}
You can use either USER_AGENT_LIST_FILE
or USER_AGENT_LIST
to configure user-agents.
USER_AGENT_LIST_FILE
points to a text file containing one user-agent per line.
USER_AGENT_LIST
is a list or tuple of user-agents.
ProxyDownloadMiddleware
Add http or https proxy for requests.
In settings.py add
PROXY_ENABLED = True # True to use proxy, default is False
# PROXY_HOST = 'localhost:8080' # default static proxy, format: <ip>:<port>, default empty
PROXY_CACHE = 'redis://localhost:6379/0' # cache for proxy, use redis://<host>:<port>/<db> to use redis cache, default dict in memory
PROXY_TTL = 30 # proxy cache ttl in seconds, default 30s
CHANGE_PROXY_STATUS = [429] # a list of status codes that force to change proxy if received, default [429]
Default is a static proxy configured in settings.py, you can add dynamic proxy from API or others.
Just need to extend the ProxyDownloadMiddleware
class and implement the generate_proxy
method.
Example:
class DynamicProxyDownloadMiddleware(ProxyDownloadMiddleware):
api = 'http://api-to-get-proxy-ip'
def generate_proxy(self):
res = requests.get(self.api)
if res.status_code < 300:
return res.text # return format <ip>:<port>
return None
Feed exporter storage backend
ObsFeedStorage
Feed exporter storage backend for huawei cloud OBS.
Install obs sdk first
pip install esdk-obs-python
Configure in settings.py
FEED_STORAGES = {
'obs': 'scrapy_accessory.feedexporter.ObsFeedStorage',
}
HUAWEI_ACCESS_KEY_ID = '<your access key id>'
HUAWEI_SECRET_ACCESS_KEY = '<your secret access key>'
HUAWEI_OBS_ENDPOINT = '<your obs bucket endpoint> ex: https://obs.cn-north-4.myhuaweicloud.com'
Output to OBS by obs schema -o obs://<bucket>/<key>
OssFeedStorage
Feed exporter storage backend for ali cloud OSS.
Install oss sdk first
pip install oss2
Configure in settings.py
FEED_STORAGES = {
'oss': 'scrapy_accessory.feedexporter.OssFeedStorage',
}
ALI_ACCESS_KEY_ID = '<your access key id>'
ALI_SECRET_ACCESS_KEY = '<your secret access key>'
ALI_OSS_ENDPOINT = '<your oss bucket endpoint> ex: https://oss-cn-beijing.aliyuncs.com'
Output to OSS by oss schema -o oss://<bucket>/<key>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy_accessory-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2be74e25249c961414d0d1effd418fea96279304853c396df451584eb0a30dea |
|
MD5 | 6ff0f8e3cb0b13e95c56853258f4571e |
|
BLAKE2b-256 | bff584fab6d9e753374d242eef95145ce7167ee345c102408c5d4d9f3d14ea5d |