Scrapy Proxies: random proxy middleware for Scrapy(support load proxies from IPProxyTool)
Project description
Random proxy middleware for Scrapy (http://scrapy.org/)
base on https://github.com/aivarsk/scrapy-proxies , support load proxies from https://github.com/qiyeboy/IPProxyPool
Processes Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed.
Get your proxy list from sites like http://www.hidemyass.com/ (copy-paste into text file and reformat to http://host:port format)
Install
The quick way:
pip install scrapy-proxies-tool
Or checkout the source and run
python setup.py install
settings.py
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_SETTINGS = {
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
# if PROXY_SETTINGS[from_proxies_server] = True , proxy_list is server address (ref https://github.com/qiyeboy/IPProxyPool and https://github.com/awolfly9/IPProxyTool )
# Only support http(ref https://github.com/qiyeboy/IPProxyPool#%E5%8F%82%E6%95%B0)
# list : ['http://localhost:8000?protocol=0'],
'list':['/path/to/proxy/list.txt'],
# disable proxy settings and use real ip when all proxies are unusable
'use_real_when_empty':False,
'from_proxies_server':False,
# If proxy mode is 2 uncomment this sentence :
# 'custom_proxy': "http://host1:port",
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
'mode':0
}
For older versions of Scrapy (before 1.0.0) you have to use
scrapy.contrib.downloadermiddleware.retry.RetryMiddleware
and
scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware
middlewares instead.
Your spider
In each callback ensure that proxy /really/ returned your target page by checking for site logo or some other significant element. If not - retry request with dont_filter=True
if not hxs.select('//get/site/logo'):
yield Request(url=response.url, dont_filter=True)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for scrapy-proxies-tool-0.4.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b5e2614b91625f52906b6f990b86efcafc3141eb3886f0de5978bbe63523915 |
|
MD5 | 42faf711cde6fd74f28394bdca1b93a8 |
|
BLAKE2b-256 | d69a529796b44339608ff26026831bf8ebf123a804e001535ef8239f75329416 |