Rotating proxies for Scrapy
Project description
scrapy-rotating-proxies
This package provides a Scrapy middleware to use rotating proxies, check that they are alive and adjust crawling speed.
License is MIT.
Installation
pip install scrapy-rotating-proxies
Usage
Add ROTATING_PROXY_LIST option with a list of proxies to settings.py:
ROTATING_PROXY_LIST = [ 'proxy1.com:8000', 'proxy2.com:8031', # ... ]
You can load it from file if needed:
def load_lines(path): with open(path, 'rb') as f: return [line.strip() for line in f.read().decode('utf8').splitlines() if line.strip()] ROTATING_PROXY_LIST = load_lines('/my/path/proxies.txt')
Then add rotating_proxies middlewares to your DOWNLOADER_MIDDLEWARES:
DOWNLOADER_MIDDLEWARES = { # ... 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610, 'rotating_proxies.middlewares.BanDetectionMiddleware': 620, # ... }
After this all requests will be proxied using one of the proxies from the ROTATING_PROXY_LIST.
Requests with “proxy” set in their meta are not handled by scrapy-rotating-proxies. To disable proxying for a request set request.meta['proxy'] = None; to set proxy explicitly use request.meta['proxy'] = "<my-proxy-address>".
Concurrency
By default, all default Scrapy concurrency options (DOWNLOAD_DELAY, AUTHTHROTTLE_..., CONCURRENT_REQUESTS_PER_DOMAIN, etc) become per-proxy for proxied requests when RotatingProxyMiddleware is enabled. For example, if you set CONCURRENT_REQUESTS_PER_DOMAIN=2 then spider will be making at most 2 concurrent connections to each proxy, regardless of request url domain.
Customization
scrapy-rotating-proxies keeps track of working and non-working proxies, and re-checks non-working from time to time.
Detection of a non-working proxy is site-specific. By default, scrapy-rotating-proxies uses a simple heuristic: if a response status code is not 200, response body is empty or if there was an exception then proxy is considered dead.
You can override ban detection method by passing a path to a custom BanDectionPolicy in ROTATING_PROXY_BAN_POLICY option, e.g.:
# settings.py ROTATING_PROXY_BAN_POLICY = 'myproject.policy.MyBanPolicy'
The policy must be a class with response_is_ban and exception_is_ban methods. These methods can return True (ban detected), False (not a ban) or None (unknown). It can be convenient to subclass and modify default BanDetectionPolicy:
# myproject/policy.py from rotating_proxies.policy import BanDetectionPolicy class MyPolicy(BanDetectionPolicy): def response_is_ban(self, request, response): # use default rules, but also consider HTTP 200 responses # a ban if there is 'captcha' word in response body. ban = super(MyPolicy, self).response_is_ban(request, response) ban = ban or b'captcha' in response.body return ban def exception_is_ban(self, request, exception): # override method completely: don't take exceptions in account return None
Instead of creating a policy you can also implement response_is_ban and exception_is_ban methods as spider methods, for example:
class MySpider(scrapy.Spider): # ... def response_is_ban(self, request, response): return b'banned' in response.body def exception_is_ban(self, request, exception): return None
It is important to have these rules correct because action for a failed request and a bad proxy should be different: if it is a proxy to blame it makes sense to retry the request with a different proxy.
Non-working proxies could become alive again after some time. scrapy-rotating-proxies uses a randomized exponential backoff for these checks - first check happens soon, if it still fails then next check is delayed further, etc. Use ROTATING_PROXY_BACKOFF_BASE to adjust the initial delay (by default it is random, from0 to 5 minutes).
Settings
ROTATING_PROXY_LIST - a list of proxies to choose from;
ROTATING_PROXY_LOGSTATS_INTERVAL - stats logging interval in seconds, 30 by default;
ROTATING_PROXY_CLOSE_SPIDER - When True, spider is stopped if there are no alive proxies. If False (default), then when there is no alive proxies all dead proxies are re-checked.
ROTATING_PROXY_PAGE_RETRY_TIMES - a number of times to retry downloading a page using a different proxy. After this amount of retries failure is considered a page failure, not a proxy failure. Think of it this way: every improperly detected ban cost you ROTATING_PROXY_PAGE_RETRY_TIMES alive proxies. Default: 5.
It is possible to change this option per-request using max_proxies_to_try request.meta key - for example, you can use a higher value for certain pages if you’re sure they should work.
ROTATING_PROXY_BACKOFF_BASE - base backoff time, in seconds. Default is 300 (i.e. 5 min).
ROTATING_PROXY_BAN_POLICY - path to a ban detection policy. Default is 'rotating_proxies.policy.BanDetectionPolicy'.
FAQ
Q: Where to get proxy lists? How to write and maintain ban rules?
A: It is up to you to find proxies and maintain proper ban rules for web sites; scrapy-rotating-proxies doesn’t have anything built-in. There are commercial proxy services like https://crawlera.com/ which can integrate with Scrapy (see https://github.com/scrapy-plugins/scrapy-crawlera) and take care of all these details.
Contributing
source code: https://github.com/TeamHG-Memex/scrapy-rotating-proxies
bug tracker: https://github.com/TeamHG-Memex/scrapy-rotating-proxies/issues
To run tests, install tox and run tox from the source checkout.
CHANGES
0.3 (2017-03-14)
redirects with empty bodies are no longer considered bans (thanks Diga Widyaprana).
ROTATING_PROXY_BAN_POLICY option allows to customize ban detection for all spiders.
0.2.3 (2017-03-03)
max_proxies_to_try request.meta key allows to override ROTATING_PROXY_PAGE_RETRY_TIMES option per-request.
0.2.2 (2017-03-01)
Update default ban detection rules: scrapy.exceptions.IgnoreRequest is not a ban.
0.2.1 (2017-02-08)
changed ROTATING_PROXY_PAGE_RETRY_TIMES default value - it is now 5.
0.2 (2017-02-07)
improved default ban detection rules;
log ban stats.
0.1 (2017-02-01)
Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy-rotating-proxies-0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11bc36784bec7a0c56b2c290c80d7c2ed23c3875bd2271737b2fb443f8cb5f85 |
|
MD5 | d16f4f7ea4dbbde693f9eaf08f406612 |
|
BLAKE2b-256 | 86163e91efe3b958f3a3697e7a51e669ed2d9282e0300ab027a90b3d98927095 |
Hashes for scrapy_rotating_proxies-0.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4eb97b668aec81d406425766d4fae79117a0eb81bd616fb04663d88139a32c9b |
|
MD5 | c0f397345d427eaeea51d02a1d4adeec |
|
BLAKE2b-256 | 37b682bb99ad8c0f626918cefa048949fc30454d0bdd1b87e796e4be8862b3c4 |