Skip to main content

No project description provided

Project description

PyPI Version Build Status Wheel Status Coverage report

Overview

Scrapy is a great framework for web crawling. This package provides a highly customized way to deal with the exceptions happening in the downloader middleware because of the proxy, and uses a signal to note relatives to treat the invalidated proxies (e.g. moving to blacklist, renew the proxy pool).

There are two types of signals this package support:

  • traditional signal, sync

  • deferred signal, async

Please refer to the scrapy and twisted documents:

Requirements

  • Scrapy

  • Tests on Python 3.5

  • Tests on Linux, but it is a pure python module, should work on any other platforms with official python and twisted support

Installation

The quick way:

pip install -U scrapy-proxy-validation

Or put this middleware just beside the scrapy project.

Documentation

Set this middleware in ITEMPIPELINES in settings.py, for example:

from scrapy_proxy_validation.downloadermiddlewares.proxy_validation import Validation

DOWNLOADER_MIDDLEWARES.update({
    'scrapy_proxy_validation.downloadmiddlewares.proxy_validation.ProxyValidation': 751
})

SIGNALS = [Validation(exception='twisted.internet.error.ConnectionRefusedError',
                      signal='scrapy.signals.spider_closed'),
           Validation(exception='twisted.internet.error.ConnectionLost',
                      signal='scrapy.signals.spider_closed',
                      signal_deferred='scrapy.signals.spider_closed',
                      limit=5)]

RECYCLE_REQUEST = 'scrapy_proxy_validation.utils.recycle_request.recycle_request'

Settings Reference

SIGNALS

A list of the class Validation with the exception it wants to deal with, the sync signal it sends, the async signal it sends and the limit it touches.

RECYCLE_REQUEST

A function to recycle the request which have trouble with the proxy, the input argument is request, and the output is request too.

Note: remember to set ``dont_filter`` to be ``True``, or the middleware ``duplicate_fitler`` will remove this request.

Built-in Functions

scrapy_proxy_validation.utils.recycle_request.recycle_request

This is a built-in function to recycle the request which has a problem with the proxy.

This function will remove the proxy keyword in meta and set dont_filter to be True.

To use this function, in settings.py:

RECYCLE_REQUEST = 'scrapy_proxy_validation.utils.recycle_request.recycle_request'

Note

There could be many different problems about the proxy, thus it will take some to collect them all and add to SIGNALS. Please be patient, this is not a once-time solution middleware for this case.

TODO

No idea, please let me know if you have!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-proxy-validation-0.0.4.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_proxy_validation-0.0.4-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-proxy-validation-0.0.4.tar.gz.

File metadata

File hashes

Hashes for scrapy-proxy-validation-0.0.4.tar.gz
Algorithm Hash digest
SHA256 3f6e3d72bf3458fd712b4693073ad4f98c4e4db567c705a9106f3c1053c4369f
MD5 51ec3a913dfe488420607213fa3749b7
BLAKE2b-256 a39cdf6b9e97074b47405427ffc93e48746ec44e2004810a976fc695b60bcc07

See more details on using hashes here.

File details

Details for the file scrapy_proxy_validation-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_proxy_validation-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7ebc6c9882d5e3ca47cab4d78a504aed8ae80915af0715310c82baf0e426a589
MD5 e6e8eada3ede813ddd969688e7cf4e8e
BLAKE2b-256 130ba436a8ea64215bf1b14ffc8c1b3097b01301e96011b78a9c33f136f2de72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page