No project description provided
Project description
Overview
Scrapy is a great framework for web crawling. This package provides a spider middleware to inspect the spider blocked or not in a highly customized way.
Requirements
Tests on Python 2.7 and Python 3.5, but it should work on other version higher then Python 3.3
Tests on Linux, but it’s a pure python module, it should work on other platforms with official python supported, e.g. Windows, Mac OSX, BSD
Installation
The quick way:
pip install scrapy-block-inspector
Or put this middleware just beside the scrapy project.
Documentation
Block Inspector in spider middleware, in settings.py, for example:
# -----------------------------------------------------------------------------
# USER AGENT
# -----------------------------------------------------------------------------
SPIDER_MIDDLEWARES.update({
'scrapy_block_inspector.spidermiddlewares.block_inspector.BlockInspector': 500,
})
BLOCK_INSPECTOR = 'scrapy_project.spiders.spider.inspect_block'
BLOCK_SIGNALS = ['scrapy_rotated_proxy.signals.proxy_block']
BLOCK_SIGNALS_DEFERRED = ['scrapy_httpcache.signals.response_block']
RECYCLE_BLOCK_REQUEST = 'scrapy_project.utils.recycle_block_request'
This middleware will add a new stats in the stats collector, named ‘block_inspector/block’.
Settings Reference
BLOCK_INSPECTOR
A function in the spidermiddleware to inspect block, if blocked this function will return True, otherwise return False.
The input of this function is the response.
BLOCK_SIGNALS
When a block inspected, this spidermiddleware can send a signal to the signal manager of the crawler to let other parts (middlewares, extensions, stats, etc.) to execute relative operations.
This should be a list.
BLOCK_SIGNALS_DEFERRED
If the signal is connected to a function or method which will return a deferred object, this signal should be put in this setting.
This should be a list.
RECYCLE_BLOCK_REQUEST
A function to recycle the blocked request. Sometimes the block request need to recycle after some further treatment, like to remove proxy related key in request.meta, etc.
Note: in this middleware ‘dont_filter=True’ will be added automatically.
The input of this function is the request.
Build-in Functions To Inspect Block
inspect_block_google_recaptcha
This is a function to check google recaptcha block.
To use this inspector, in settings:
BLOCK_INSPECTOR = 'scrapy_block_inspector.utils.inspect_block_google_recaptcha.inspect_block'
NOTE
Please note: in scrapy, the exception raised by the method process_spider_input will be sent to request.err_back first if there is err_back defined. So please make sure the exception BlockException defined by this middleware can be raised in err_back function to trigger off the method process_spider_exception correctly.
TODO
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy-block-inspector-0.0.2.tar.gz.
File metadata
- Download URL: scrapy-block-inspector-0.0.2.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d83ad1ec75aca5ea31b475b4c6025cff05595acafe64abd30b12bc8ae56c901
|
|
| MD5 |
e9515d8d047f37c1fe4bba6f9423373d
|
|
| BLAKE2b-256 |
6b66c4e64355541587ebfc38215b23fc395560f75b9ddb8067dffe2b81c8b610
|
File details
Details for the file scrapy_block_inspector-0.0.2-py3-none-any.whl.
File metadata
- Download URL: scrapy_block_inspector-0.0.2-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3df02536e846faf7d1f03f54179e1f8e3de557699bffcd90162d302eba66fde
|
|
| MD5 |
3830ed55e38293fbd43e1c30c52ef865
|
|
| BLAKE2b-256 |
137e1d51ca163c04eda0f40f5817c1bb010244dec11510289f514a3285baa152
|
File details
Details for the file scrapy_block_inspector-0.0.2-py2-none-any.whl.
File metadata
- Download URL: scrapy_block_inspector-0.0.2-py2-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d25e4b3adbe0160898036011ab86c052f4f4f8a1480cedb25bf2f1fae7a10ad
|
|
| MD5 |
9577f94a207a9b39c53a9f78a6f5bc95
|
|
| BLAKE2b-256 |
272051d5730140b88445244128e543aefeae4344daab27d30bbf96a0a593424f
|