No project description provided
Project description
Overview
Scrapy is a great framework for web crawling. This package provides a highly customized way to deal with the exceptions happening in the downloader middleware because of the proxy, and uses a signal to note relatives to treat the invalidated proxies (e.g. moving to blacklist, renew the proxy pool).
There are two types of signals this package support:
traditional signal, sync
deferred signal, async
Please refer to the scrapy and twisted documents:
Requirements
Scrapy
Tests on Python 3.5
Tests on Linux, but it is a pure python module, should work on any other platforms with official python and twisted support
Installation
The quick way:
pip install -U scrapy-proxy-validation
Or put this middleware just beside the scrapy project.
Documentation
Set this middleware in ITEMPIPELINES in settings.py, for example:
from scrapy_proxy_validation.downloadermiddlewares.proxy_validation import Validation
DOWNLOADER_MIDDLEWARES.update({
'scrapy_proxy_validation.downloadmiddlewares.proxy_validation.ProxyValidation': 751
})
SIGNALS = [Validation(exception='twisted.internet.error.ConnectionRefusedError',
signal='scrapy.signals.spider_closed'),
Validation(exception='twisted.internet.error.ConnectionLost',
signal='scrapy.signals.spider_closed',
signal_deferred='scrapy.signals.spider_closed',
limit=5)]
RECYCLE_REQUEST = 'scrapy_proxy_validation.utils.recycle_request.recycle_request'
Settings Reference
SIGNALS
A list of the class Validation with the exception it wants to deal with, the sync signal it sends, the async signal it sends and the limit it touches.
RECYCLE_REQUEST
A function to recycle the request which have trouble with the proxy, the input argument is request, and the output is request too.
Note: remember to set ``dont_filter`` to be ``True``, or the middleware ``duplicate_fitler`` will remove this request.
Built-in Functions
scrapy_proxy_validation.utils.recycle_request.recycle_request
This is a built-in function to recycle the request which has a problem with the proxy.
This function will remove the proxy keyword in meta and set dont_filter to be True.
To use this function, in settings.py:
RECYCLE_REQUEST = 'scrapy_proxy_validation.utils.recycle_request.recycle_request'
Note
There could be many different problems about the proxy, thus it will take some to collect them all and add to SIGNALS. Please be patient, this is not a once-time solution middleware for this case.
TODO
No idea, please let me know if you have!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy-proxy-validation-0.0.4.tar.gz.
File metadata
- Download URL: scrapy-proxy-validation-0.0.4.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f6e3d72bf3458fd712b4693073ad4f98c4e4db567c705a9106f3c1053c4369f
|
|
| MD5 |
51ec3a913dfe488420607213fa3749b7
|
|
| BLAKE2b-256 |
a39cdf6b9e97074b47405427ffc93e48746ec44e2004810a976fc695b60bcc07
|
File details
Details for the file scrapy_proxy_validation-0.0.4-py3-none-any.whl.
File metadata
- Download URL: scrapy_proxy_validation-0.0.4-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ebc6c9882d5e3ca47cab4d78a504aed8ae80915af0715310c82baf0e426a589
|
|
| MD5 |
e6e8eada3ede813ddd969688e7cf4e8e
|
|
| BLAKE2b-256 |
130ba436a8ea64215bf1b14ffc8c1b3097b01301e96011b78a9c33f136f2de72
|