No project description provided
Project description
Overview
Scrapy is a great framework for web crawling. This package provides a highly customized way to deal with the exceptions happening in the downloader middleware because of the proxy, and uses a signal to note relatives to treat the invalidated proxies (e.g. moving to blacklist, renew the proxy pool).
There are two types of signals this package support:
traditional signal, sync
deferred signal, async
Please refer to the scrapy and twisted documents:
Requirements
Scrapy
Tests on Python 3.5
Tests on Linux, but it is a pure python module, should work on any other platforms with official python and twisted support
Installation
The quick way:
pip install -U scrapy-proxy-validation
Or put this middleware just beside the scrapy project.
Documentation
Set this middleware in ITEMPIPELINES in settings.py, for example:
from scrapy_proxy_validation.downloadermiddlewares.proxy_validation import Validation DOWNLOADER_MIDDLEWARES.update({ 'scrapy_proxy_validation.downloadmiddlewares.proxy_validation.ProxyValidation': 751 }) SIGNALS = [Validation(exception='twisted.internet.error.ConnectionRefusedError', signal='scrapy.signals.spider_closed'), Validation(exception='twisted.internet.error.ConnectionLost', signal='scrapy.signals.spider_closed', signal_deferred='scrapy.signals.spider_closed', limit=5)] RECYCLE_REQUEST = 'scrapy_proxy_validation.utils.recycle_request.recycle_request'
Settings Reference
SIGNALS
A list of the class Validation with the exception it wants to deal with, the sync signal it sends, the async signal it sends and the limit it touches.
RECYCLE_REQUEST
A function to recycle the request which have trouble with the proxy, the input argument is request, and the output is request too.
Note: remember to set ``dont_filter`` to be ``True``, or the middleware ``duplicate_fitler`` will remove this request.
Built-in Functions
scrapy_proxy_validation.utils.recycle_request.recycle_request
This is a built-in function to recycle the request which has a problem with the proxy.
This function will remove the proxy keyword in meta and set dont_filter to be True.
To use this function, in settings.py:
RECYCLE_REQUEST = 'scrapy_proxy_validation.utils.recycle_request.recycle_request'
Note
There could be many different problems about the proxy, thus it will take some to collect them all and add to SIGNALS. Please be patient, this is not a once-time solution middleware for this case.
TODO
No idea, please let me know if you have!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy-proxy-validation-0.0.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f6e3d72bf3458fd712b4693073ad4f98c4e4db567c705a9106f3c1053c4369f |
|
MD5 | 51ec3a913dfe488420607213fa3749b7 |
|
BLAKE2b-256 | a39cdf6b9e97074b47405427ffc93e48746ec44e2004810a976fc695b60bcc07 |
Hashes for scrapy_proxy_validation-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ebc6c9882d5e3ca47cab4d78a504aed8ae80915af0715310c82baf0e426a589 |
|
MD5 | e6e8eada3ede813ddd969688e7cf4e8e |
|
BLAKE2b-256 | 130ba436a8ea64215bf1b14ffc8c1b3097b01301e96011b78a9c33f136f2de72 |