No project description provided
Project description
Overview
Scrapy is a great framework for web crawling. This package provides a spider middleware to inspect the spider blocked or not in a highly customized way.
Requirements
Tests on Python 2.7 and Python 3.5, but it should work on other version higher then Python 3.3
Tests on Linux, but it’s a pure python module, it should work on other platforms with official python supported, e.g. Windows, Mac OSX, BSD
Installation
The quick way:
pip install scrapy-block-inspector
Or put this middleware just beside the scrapy project.
Documentation
Block Inspector in spider middleware, in settings.py, for example:
# ----------------------------------------------------------------------------- # USER AGENT # ----------------------------------------------------------------------------- SPIDER_MIDDLEWARES.update({ 'scrapy_block_inspector.spidermiddlewares.block_inspector.BlockInspector': 500, }) BLOCK_INSPECTOR = 'scrapy_project.spiders.spider.inspect_block' BLOCK_SIGNALS = ['scrapy_rotated_proxy.signals.proxy_block'] BLOCK_SIGNALS_DEFERRED = ['scrapy_httpcache.signals.response_block'] RECYCLE_BLOCK_REQUEST = 'scrapy_project.utils.recycle_block_request'
This middleware will add a new stats in the stats collector, named ‘block_inspector/block’.
Settings Reference
BLOCK_INSPECTOR
A function in the spidermiddleware to inspect block, if blocked this function will return True, otherwise return False.
The input of this function is the response.
BLOCK_SIGNALS
When a block inspected, this spidermiddleware can send a signal to the signal manager of the crawler to let other parts (middlewares, extensions, stats, etc.) to execute relative operations.
This should be a list.
BLOCK_SIGNALS_DEFERRED
If the signal is connected to a function or method which will return a deferred object, this signal should be put in this setting.
This should be a list.
RECYCLE_BLOCK_REQUEST
A function to recycle the blocked request. Sometimes the block request need to recycle after some further treatment, like to remove proxy related key in request.meta, etc.
Note: in this middleware ‘dont_filter=True’ will be added automatically.
The input of this function is the request.
Build-in Functions To Inspect Block
inspect_block_google_recaptcha
This is a function to check google recaptcha block.
To use this inspector, in settings:
BLOCK_INSPECTOR = 'scrapy_block_inspector.utils.inspect_block_google_recaptcha.inspect_block'
NOTE
Please note: in scrapy, the exception raised by the method process_spider_input will be sent to request.err_back first if there is err_back defined. So please make sure the exception BlockException defined by this middleware can be raised in err_back function to trigger off the method process_spider_exception correctly.
TODO
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for scrapy-block-inspector-0.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d83ad1ec75aca5ea31b475b4c6025cff05595acafe64abd30b12bc8ae56c901 |
|
MD5 | e9515d8d047f37c1fe4bba6f9423373d |
|
BLAKE2b-256 | 6b66c4e64355541587ebfc38215b23fc395560f75b9ddb8067dffe2b81c8b610 |
Hashes for scrapy_block_inspector-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3df02536e846faf7d1f03f54179e1f8e3de557699bffcd90162d302eba66fde |
|
MD5 | 3830ed55e38293fbd43e1c30c52ef865 |
|
BLAKE2b-256 | 137e1d51ca163c04eda0f40f5817c1bb010244dec11510289f514a3285baa152 |
Hashes for scrapy_block_inspector-0.0.2-py2-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d25e4b3adbe0160898036011ab86c052f4f4f8a1480cedb25bf2f1fae7a10ad |
|
MD5 | 9577f94a207a9b39c53a9f78a6f5bc95 |
|
BLAKE2b-256 | 272051d5730140b88445244128e543aefeae4344daab27d30bbf96a0a593424f |