Scrapy Toolkit
Project description
scrapy-zen
A toolkit for Scrapy that provides multiple output pipelines, monitoring capabilities, and enhanced request handling.
Features
- Unified download handler
- Request pre-processing and deduplication
- Item pre-processing and deduplication
- Spidermon integration for monitoring
- Support for Playwright, Browser Impersonation, and Zyte API
- Multiple output pipelines (Discord, Telegram, WebSocket, HTTP, gRPC, Synoptic)
Installation
pip install scrapy-zen[all]
Available extras:
grpc- gRPC pipeline dependencieswebsocket- WebSocket pipeline dependenciesmonitoring- Spidermon integrationplaywright- Playwright supportimpersonate- Browser impersonation supportzyte- Zyte API support
Configuration
settings.py
DB_EXPIRY_DAYS = 30 # Optional, defaults to 30 days
The following settings need to be configured in your .env file:
.env
# Database settings (required for deduplication)
DB_NAME = "your_db_name"
DB_USER = "your_db_user"
DB_PASS = "your_db_password"
DB_HOST = "localhost"
DB_PORT = "5432"
Optional Pipeline Settings
Discord Pipeline
DISCORD_SERVER_URI = "your_discord_webhook_url"
Synoptic Pipeline
SYNOPTIC_SERVER_URI = "your_synoptic_server_url"
SYNOPTIC_STREAM_ID = "your_stream_id"
SYNOPTIC_API_KEY = "your_api_key"
Telegram Pipeline
TELEGRAM_SERVER_URI = "your_telegram_api_url"
TELEGRAM_TOKEN = "your_bot_token"
TELEGRAM_CHAT_ID = "your_chat_id"
gRPC Pipeline
GRPC_SERVER_URI = "your_grpc_server"
GRPC_TOKEN = "your_token"
GRPC_ID = "your_id"
GRPC_PROTO_MODULE = "your_proto_module"
WebSocket Pipeline
WS_SERVER_URI = "your_websocket_server_url"
HTTP Pipeline
HTTP_SERVER_URI = "your_http_server_url"
HTTP_TOKEN = "your_auth_token"
Zyte & Playwright Settings
settings.py
# Playwright settings
PLAYWRIGHT_ABORT_REQUEST = lambda req: req.resource_type == "image"
PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None
# Zyte API
ZYTE_ENABLED = True # Enable Zyte API integration
Monitoring Settings
settings.py
# Spidermon settings
SPIDERMON_ENABLED = True
SPIDERMON_MAX_ERRORS = 0
SPIDERMON_MAX_CRITICALS = 0
SPIDERMON_MAX_DOWNLOADER_EXCEPTIONS = 0
SPIDERMON_UNWANTED_HTTP_CODES = {403: 0, 429: 0}
# Discord notifications
SPIDERMON_DISCORD_WEBHOOK_URL = "your_discord_webhook"
# Telegram notifications (disabled at the moment)
SPIDERMON_TELEGRAM_SENDER_TOKEN = "your_telegram_token"
SPIDERMON_TELEGRAM_RECIPIENTS = ["your_chat_id"]
Addons
ZenAddon
It provides a plug-in-play experience by configuring all previous settings except monitoring.
SpidermonAddon
It provides a plug-in-play experience by configuring monitoring settings.
"ADDONS": {
"scrapy_zen.addons.ZenAddon": 1,
"scrapy_zen.addons.SpidermonAddon": 2,
}
Usage
"ADDONS": {
"scrapy_zen.addons.ZenAddon": 1,
"scrapy_zen.addons.SpidermonAddon": 2,
}
'ITEM_PIPELINES': {
'scrapy_zen.pipelines.PreProcessingPipeline': 100,
'scrapy_zen.pipelines.DiscordPipeline': 200,
'scrapy_zen.pipelines.TelegramPipeline': 300,
'scrapy_zen.pipelines.WSPipeline': 400,
'scrapy_zen.pipelines.GRPCPipeline': 500,
'scrapy_zen.pipelines.HttpPipeline': 600,
'scrapy_zen.pipelines.SynopticPipeline': 700,
}
'DOWNLOADER_MIDDLEWARES': {
'scrapy_zen.middlewares.PreProcessingMiddleware': 100,
}
yield Request(
url="http://example.com",
meta={
"_id": "unique_id", # For deduplication
"_dt": "2024-01-01", # For date filtering
"_dt_format": "%Y-%m-%d", # Optional date format
}
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_zen-0.6.7.tar.gz.
File metadata
- Download URL: scrapy_zen-0.6.7.tar.gz
- Upload date:
- Size: 15.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.0 CPython/3.12.3 Linux/6.11.0-26-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c07f34f8bd2d0b6c0bf8bd81876f528ee0bc964ddb9085babd41218e86ac1742
|
|
| MD5 |
0164070058d77a07365b1b540866bada
|
|
| BLAKE2b-256 |
69e32abfce2a723b64eb91e276816cbc73e530c5ab9cc1368910a8fd19598db2
|
File details
Details for the file scrapy_zen-0.6.7-py3-none-any.whl.
File metadata
- Download URL: scrapy_zen-0.6.7-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.0 CPython/3.12.3 Linux/6.11.0-26-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89eb9710232da3254c09a888fa29399e4aeec756bf3d30a387bdb6a7c0809870
|
|
| MD5 |
aaffcb2b7e478481ab2b7ec4ca063650
|
|
| BLAKE2b-256 |
8abc2f26ff81ea97da8e22ecfb2f669878765e0be7e9547b752403b065a3298a
|