Scrapy pipeline & extensions for AP Cloudy (logs, stats, requests, items)
Project description
APCloudy Pipeline
A high-performance Scrapy integration that sends items, requests, logs, and spider statistics to your backend using unified batch processing and secure HMAC-based authentication.
โจ Features
- ๐ฆ Unified Data Sending - All data (items, requests, logs, stats) sent in a single API call
- ๐ Batch Processing - Automatic batching every 10 items or on spider close
- ๐ Complete Request Tracking - Logs both successful and failed requests with detailed error information
- ๐ Spider Statistics - Comprehensive spider performance metrics
- ๐งพ Log Forwarding - Captures spider, user, and Scrapy internal logs
- ๐ Secure Authentication - HMAC-SHA256 signature-based API communication
- โก High Performance - Thread-safe data collection with minimal overhead
- ๐ฏ Zero Configuration - Works out of the box with sensible defaults
๐ฆ Installation
pip install apcloudy-pipeline
โ๏ธ Configuration
Add the following settings to your Scrapy project's settings.py file:
# Required: API credentials
APCLOUDY_URL = "https://your-api.com" # Base URL (webhook path added automatically)
APCLOUDY_API_KEY = "your_public_api_key"
APCLOUDY_SECRET_KEY = "your_secret_key"
JOB_ID = 123 # Can also be passed via spider args
# Optional: Batch size (default: 10)
APCLOUDY_BATCH_SIZE = 10 # Send data every 10 items
# Required: Item Pipeline
ITEM_PIPELINES = {
"apcloudy_pipeline.pipelines.APCloudyItemPipeline": 300,
}
# Required: Error Middleware (to catch failed requests)
DOWNLOADER_MIDDLEWARES = {
'apcloudy_pipeline.middleware.APCloudyErrorMiddleware': 50,
# ... your other middlewares
}
# Required: Extensions for request logging, logs, and stats
EXTENSIONS = {
"apcloudy_pipeline.request_logger.APCloudyRequestLogger": 100,
"apcloudy_pipeline.extensions.APCloudyLoggingExtension": 100,
"apcloudy_pipeline.extensions.APCloudyStatsExtension": 100,
}
๐๏ธ Architecture
Data Flow
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SPIDER EXECUTION โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ
โRequests โ โ Items โ โ Logs โ
โโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโ
โ DataCollector โ
โ (Thread-Safe) โ
โโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโ
โ Batch Trigger? โ
โ โข 10+ items โ
โ โข Spider closes โ
โโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโ
โ APCloudyClient โ
โ (HMAC Auth) โ
โโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโ
โ Backend โ
โ (Unified API) โ
โโโโโโโโโโโโโโโโโโโโ
๐ก API Payload Structure
All data is sent in a single unified payload:
{
"job_id": "123",
"data": {
"requests": [
{
"url": "https://example.com/product",
"method": "GET",
"status_code": 200,
"response_time": 1.23,
"fingerprint": "f3045685b89f920b3faefc7d3df2d3c88bdab393",
"error": null,
"success": true
},
{
"url": "https://example.com/error",
"method": "GET",
"status_code": 400,
"response_time": 0.45,
"fingerprint": "a1b2c3d4e5f6...",
"error": "AskPablosAPIError: [400] Invalid JSON format",
"success": false
}
],
"items": [
{
"title": "Product Name",
"price": "99.99",
"url": "https://example.com/product"
}
],
"logs": [
{
"level": "INFO",
"message": "Spider started",
"exception": null
},
{
"level": "ERROR",
"message": "Failed to parse page",
"exception": "Traceback..."
}
],
"stats": {
"item_scraped_count": 1,
"request_count": 2,
"response_received_count": 1,
"downloader/exception_count": 1,
"finish_time": "2026-01-06T11:28:46",
"finish_reason": "finished"
}
}
}
๐ง Components
1. APCloudyItemPipeline
- Collects scraped items
- Triggers batch send every 10 items
- Sends remaining items on spider close
2. APCloudyRequestLogger
- Logs successful HTTP responses (including non-200 status codes)
- Tracks response time for each request
- Generates unique fingerprints for requests
3. APCloudyErrorMiddleware
- Captures failed requests and exceptions
- Extracts HTTP status codes from error messages
- Logs middleware errors (API errors, timeouts, network failures)
4. APCloudyLoggingExtension
- Captures all Python logging output
- Forwards spider, user, and Scrapy internal logs
- Includes exception tracebacks
5. APCloudyStatsExtension
- Collects comprehensive spider statistics
- Sent once at spider close
- Includes item counts, request metrics, timing, etc.
6. DataCollector
- Thread-safe central data storage
- Shared across all components
- Batches data before sending
7. APCloudyClient
- Handles HMAC-SHA256 authentication
- Sends unified data to backend
- Automatic signature generation
๐ Authentication
The package uses HMAC-SHA256 for secure API communication:
Headers:
X-API-KEY: {your_public_key}
X-TIMESTAMP: {unix_timestamp}
X-SIGNATURE: {hmac_sha256(secret_key, timestamp + "." + json_body)}
Content-Type: application/json
Signature Calculation:
message = f"{timestamp}.{json_body}"
signature = HMAC_SHA256(secret_key, message)
๐ Performance
- Reduced API Calls: 1 request per batch instead of N individual requests
- Configurable Batch Size: Set
APCLOUDY_BATCH_SIZEin settings (default: 10 items) - Thread-Safe: Handles concurrent data collection safely
- Minimal Overhead: Efficient data collection with locks
๐ Requirements
- Python 3.8+
- Scrapy 2.0+
- requests
๐ ๏ธ Advanced Configuration
Custom Batch Size
Control when data is sent by adjusting the batch size:
# Send every 50 items
APCLOUDY_BATCH_SIZE = 50
# Send every 100 items
APCLOUDY_BATCH_SIZE = 100
# Send immediately (batch size of 1)
APCLOUDY_BATCH_SIZE = 1
Note: Data is always sent when the spider closes, regardless of batch size.
Custom Job ID via Spider Args
scrapy crawl myspider -a JOB_ID=456
Backend Endpoint
The package automatically appends /api/webhook/consume to your base URL:
Base URL: https://your-api.com
Full endpoint: https://your-api.com/api/webhook/consume
Data is sent via:
POST {APCLOUDY_URL}/api/webhook/consume
Make sure your backend handles the unified payload structure at this endpoint.
๐ Example Spider
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
yield {
'title': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
'url': response.url
}
No changes needed to your spider code! The pipeline automatically collects and sends all data.
๐ Troubleshooting
Data Not Being Sent
- Check that all required settings are configured
- Verify API credentials are correct
- Ensure middleware and extensions are enabled
- Check logs for error messages
Failed Requests Not Being Logged
- Ensure
APCloudyErrorMiddlewareis added toDOWNLOADER_MIDDLEWARES - Check middleware priority (should be < 100 to catch errors early)
Stats Not Included
- Verify
APCloudyStatsExtensionis enabled inEXTENSIONS - Stats are only sent once at spider close
๐ License
This project is licensed under the MIT License.
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ง Support
For issues and questions, please open an issue on GitHub.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file apcloudy_pipeline-0.1.4.tar.gz.
File metadata
- Download URL: apcloudy_pipeline-0.1.4.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47d4097686f3b36d188dafa2dd92b93b3c12b4f93507f4688267cf24fe25e2fe
|
|
| MD5 |
242169b214eb77f3f2f08ab08965c150
|
|
| BLAKE2b-256 |
d416d0ae0f16592bbb07588cdd7049e1307d1c81bfd2104b6d3881daadfaa4a6
|
File details
Details for the file apcloudy_pipeline-0.1.4-py3-none-any.whl.
File metadata
- Download URL: apcloudy_pipeline-0.1.4-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb1b3b28c94b8d4136ba86d3146f32deab682243855f61a95c277e24e442a00a
|
|
| MD5 |
d8d6a192a2c33c28e1a9e329438112e5
|
|
| BLAKE2b-256 |
5908bcdd41d19027b30a35e8a5fcbf8fa51564ab38cae181bbeb001486e301d8
|