Skip to main content

No project description provided

Project description

PyPI Version Build Status Wheel Status Coverage report

Overview

Scrapy is a great framework for web crawling. This package provides a spider middleware to inspect the spider blocked or not in a highly customized way.

Requirements

  • Tests on Python 2.7 and Python 3.5, but it should work on other version higher then Python 3.3

  • Tests on Linux, but it’s a pure python module, it should work on other platforms with official python supported, e.g. Windows, Mac OSX, BSD

Installation

The quick way:

pip install scrapy-block-inspector

Or put this middleware just beside the scrapy project.

Documentation

Block Inspector in spider middleware, in settings.py, for example:

# -----------------------------------------------------------------------------
# USER AGENT
# -----------------------------------------------------------------------------

SPIDER_MIDDLEWARES.update({
    'scrapy_block_inspector.spidermiddlewares.block_inspector.BlockInspector': 500,
})
BLOCK_INSPECTOR = 'scrapy_project.spiders.spider.inspect_block'
BLOCK_SIGNALS = ['scrapy_rotated_proxy.signals.proxy_block']
BLOCK_SIGNALS_DEFERRED = ['scrapy_httpcache.signals.response_block']
RECYCLE_BLOCK_REQUEST = 'scrapy_project.utils.recycle_block_request'

This middleware will add a new stats in the stats collector, named ‘block_inspector/block’.

Settings Reference

BLOCK_INSPECTOR

A function in the spidermiddleware to inspect block, if blocked this function will return True, otherwise return False.

The input of this function is the response.

BLOCK_SIGNALS

When a block inspected, this spidermiddleware can send a signal to the signal manager of the crawler to let other parts (middlewares, extensions, stats, etc.) to execute relative operations.

This should be a list.

BLOCK_SIGNALS_DEFERRED

If the signal is connected to a function or method which will return a deferred object, this signal should be put in this setting.

This should be a list.

RECYCLE_BLOCK_REQUEST

A function to recycle the blocked request. Sometimes the block request need to recycle after some further treatment, like to remove proxy related key in request.meta, etc.

Note: in this middleware ‘dont_filter=True’ will be added automatically.

The input of this function is the request.

Build-in Functions To Inspect Block

inspect_block_google_recaptcha

This is a function to check google recaptcha block.

To use this inspector, in settings:

BLOCK_INSPECTOR = 'scrapy_block_inspector.utils.inspect_block_google_recaptcha.inspect_block'

NOTE

Please note: in scrapy, the exception raised by the method process_spider_input will be sent to request.err_back first if there is err_back defined. So please make sure the exception BlockException defined by this middleware can be raised in err_back function to trigger off the method process_spider_exception correctly.

TODO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-block-inspector-0.0.2.tar.gz (22.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scrapy_block_inspector-0.0.2-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

scrapy_block_inspector-0.0.2-py2-none-any.whl (8.3 kB view details)

Uploaded Python 2

File details

Details for the file scrapy-block-inspector-0.0.2.tar.gz.

File metadata

File hashes

Hashes for scrapy-block-inspector-0.0.2.tar.gz
Algorithm Hash digest
SHA256 5d83ad1ec75aca5ea31b475b4c6025cff05595acafe64abd30b12bc8ae56c901
MD5 e9515d8d047f37c1fe4bba6f9423373d
BLAKE2b-256 6b66c4e64355541587ebfc38215b23fc395560f75b9ddb8067dffe2b81c8b610

See more details on using hashes here.

File details

Details for the file scrapy_block_inspector-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_block_inspector-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d3df02536e846faf7d1f03f54179e1f8e3de557699bffcd90162d302eba66fde
MD5 3830ed55e38293fbd43e1c30c52ef865
BLAKE2b-256 137e1d51ca163c04eda0f40f5817c1bb010244dec11510289f514a3285baa152

See more details on using hashes here.

File details

Details for the file scrapy_block_inspector-0.0.2-py2-none-any.whl.

File metadata

File hashes

Hashes for scrapy_block_inspector-0.0.2-py2-none-any.whl
Algorithm Hash digest
SHA256 3d25e4b3adbe0160898036011ab86c052f4f4f8a1480cedb25bf2f1fae7a10ad
MD5 9577f94a207a9b39c53a9f78a6f5bc95
BLAKE2b-256 272051d5730140b88445244128e543aefeae4344daab27d30bbf96a0a593424f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page