Scrapy pipeline & extensions for AP Cloudy (logs, stats, requests, items)

These details have not been verified by PyPI

Project links

Homepage

Framework
- Scrapy
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

APCloudy Pipeline

A high-performance Scrapy integration that sends items, requests, logs, and spider statistics to your backend using unified batch processing and secure HMAC-based authentication.

✨ Features

📦 Unified Data Sending - All data (items, requests, logs, stats) sent in a single API call
🚀 Batch Processing - Automatic batching every 10 items or on spider close
🌐 Complete Request Tracking - Logs both successful and failed requests with detailed error information
📊 Spider Statistics - Comprehensive spider performance metrics
🧾 Log Forwarding - Captures spider, user, and Scrapy internal logs
🔐 Secure Authentication - HMAC-SHA256 signature-based API communication
⚡ High Performance - Thread-safe data collection with minimal overhead
🎯 Zero Configuration - Works out of the box with sensible defaults

📦 Installation

pip install apcloudy-pipeline

⚙️ Configuration

Add the following settings to your Scrapy project's settings.py file:

# Required: API credentials
APCLOUDY_URL = "https://your-api.com"  # Base URL (webhook path added automatically)
APCLOUDY_API_KEY = "your_public_api_key"
APCLOUDY_SECRET_KEY = "your_secret_key"
JOB_ID = 123  # Can also be passed via spider args

# Optional: Batch size (default: 10)
APCLOUDY_BATCH_SIZE = 10  # Send data every 10 items

# Required: Item Pipeline
ITEM_PIPELINES = {
    "apcloudy_pipeline.pipelines.APCloudyItemPipeline": 300,
}

# Required: Error Middleware (to catch failed requests)
DOWNLOADER_MIDDLEWARES = {
    'apcloudy_pipeline.middleware.APCloudyErrorMiddleware': 50,
    # ... your other middlewares
}

# Required: Extensions for request logging, logs, and stats
EXTENSIONS = {
    "apcloudy_pipeline.request_logger.APCloudyRequestLogger": 100,
    "apcloudy_pipeline.extensions.APCloudyLoggingExtension": 100,
    "apcloudy_pipeline.extensions.APCloudyStatsExtension": 100,
}

🏗️ Architecture

Data Flow

┌──────────────────────────────────────────────────────────┐
│                    SPIDER EXECUTION                       │
└──────────────────────────────────────────────────────────┘
                         ↓
    ┌────────────────────┼────────────────────┐
    ↓                    ↓                     ↓
┌─────────┐      ┌──────────┐         ┌──────────┐
│Requests │      │  Items   │         │   Logs   │
└─────────┘      └──────────┘         └──────────┘
    ↓                    ↓                     ↓
    └────────────────────┼─────────────────────┘
                         ↓
              ┌──────────────────┐
              │  DataCollector   │
              │  (Thread-Safe)   │
              └──────────────────┘
                         ↓
              ┌──────────────────┐
              │ Batch Trigger?   │
              │ • 10+ items      │
              │ • Spider closes  │
              └──────────────────┘
                         ↓
              ┌──────────────────┐
              │  APCloudyClient  │
              │  (HMAC Auth)     │
              └──────────────────┘
                         ↓
              ┌──────────────────┐
              │      Backend     │
              │   (Unified API)  │
              └──────────────────┘

📡 API Payload Structure

All data is sent in a single unified payload:

{
  "job_id": "123",
  "data": {
    "requests": [
      {
        "url": "https://example.com/product",
        "method": "GET",
        "status_code": 200,
        "response_time": 1.23,
        "fingerprint": "f3045685b89f920b3faefc7d3df2d3c88bdab393",
        "error": null,
        "success": true
      },
      {
        "url": "https://example.com/error",
        "method": "GET",
        "status_code": 400,
        "response_time": 0.45,
        "fingerprint": "a1b2c3d4e5f6...",
        "error": "AskPablosAPIError: [400] Invalid JSON format",
        "success": false
      }
    ],
    "items": [
      {
        "title": "Product Name",
        "price": "99.99",
        "url": "https://example.com/product"
      }
    ],
    "logs": [
      {
        "level": "INFO",
        "message": "Spider started",
        "exception": null
      },
      {
        "level": "ERROR",
        "message": "Failed to parse page",
        "exception": "Traceback..."
      }
    ],
    "stats": {
      "item_scraped_count": 1,
      "request_count": 2,
      "response_received_count": 1,
      "downloader/exception_count": 1,
      "finish_time": "2026-01-06T11:28:46",
      "finish_reason": "finished"
    }
  }
}

🔧 Components

1. APCloudyItemPipeline

Collects scraped items
Triggers batch send every 10 items
Sends remaining items on spider close

2. APCloudyRequestLogger

Logs successful HTTP responses (including non-200 status codes)
Tracks response time for each request
Generates unique fingerprints for requests

3. APCloudyErrorMiddleware

Captures failed requests and exceptions
Extracts HTTP status codes from error messages
Logs middleware errors (API errors, timeouts, network failures)

4. APCloudyLoggingExtension

Captures all Python logging output
Forwards spider, user, and Scrapy internal logs
Includes exception tracebacks

5. APCloudyStatsExtension

Collects comprehensive spider statistics
Sent once at spider close
Includes item counts, request metrics, timing, etc.

6. DataCollector

Thread-safe central data storage
Shared across all components
Batches data before sending

7. APCloudyClient

Handles HMAC-SHA256 authentication
Sends unified data to backend
Automatic signature generation

🔐 Authentication

The package uses HMAC-SHA256 for secure API communication:

Headers:

X-API-KEY: {your_public_key}
X-TIMESTAMP: {unix_timestamp}
X-SIGNATURE: {hmac_sha256(secret_key, timestamp + "." + json_body)}
Content-Type: application/json

Signature Calculation:

message = f"{timestamp}.{json_body}"
signature = HMAC_SHA256(secret_key, message)

🚀 Performance

Reduced API Calls: 1 request per batch instead of N individual requests
Configurable Batch Size: Set APCLOUDY_BATCH_SIZE in settings (default: 10 items)
Thread-Safe: Handles concurrent data collection safely
Minimal Overhead: Efficient data collection with locks

📋 Requirements

Python 3.8+
Scrapy 2.0+
requests

🛠️ Advanced Configuration

Custom Batch Size

Control when data is sent by adjusting the batch size:

# Send every 50 items
APCLOUDY_BATCH_SIZE = 50

# Send every 100 items
APCLOUDY_BATCH_SIZE = 100

# Send immediately (batch size of 1)
APCLOUDY_BATCH_SIZE = 1

Note: Data is always sent when the spider closes, regardless of batch size.

Custom Job ID via Spider Args

scrapy crawl myspider -a JOB_ID=456

Backend Endpoint

The package automatically appends /api/webhook/consume to your base URL:

Base URL: https://your-api.com
Full endpoint: https://your-api.com/api/webhook/consume

Data is sent via:

POST {APCLOUDY_URL}/api/webhook/consume

Make sure your backend handles the unified payload structure at this endpoint.

📝 Example Spider

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        urls = ['https://example.com/page1', 'https://example.com/page2']
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
            'url': response.url
        }

No changes needed to your spider code! The pipeline automatically collects and sends all data.

🐛 Troubleshooting

Data Not Being Sent

Check that all required settings are configured
Verify API credentials are correct
Ensure middleware and extensions are enabled
Check logs for error messages

Failed Requests Not Being Logged

Ensure APCloudyErrorMiddleware is added to DOWNLOADER_MIDDLEWARES
Check middleware priority (should be < 100 to catch errors early)

Stats Not Included

Verify APCloudyStatsExtension is enabled in EXTENSIONS
Stats are only sent once at spider close

📄 License

This project is licensed under the MIT License.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📧 Support

For issues and questions, please open an issue on GitHub.

Project details

These details have not been verified by PyPI

Project links

Homepage

Framework
- Scrapy
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.4

Feb 11, 2026

This version

0.1.3

Jan 6, 2026

0.1.2

Jan 6, 2026

0.1.1

Dec 25, 2025

0.1.0

Dec 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apcloudy_pipeline-0.1.3.tar.gz (10.5 kB view details)

Uploaded Jan 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

apcloudy_pipeline-0.1.3-py3-none-any.whl (13.3 kB view details)

Uploaded Jan 6, 2026 Python 3

File details

Details for the file apcloudy_pipeline-0.1.3.tar.gz.

File metadata

Download URL: apcloudy_pipeline-0.1.3.tar.gz
Upload date: Jan 6, 2026
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for apcloudy_pipeline-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`3494e71bd7bdfc7c519140ba236cea57aad4449e91df7a24ae1849bb8ce94e2f`
MD5	`7778c78016287edbd0fe306de202abb9`
BLAKE2b-256	`fd7ccec62f64e73ba9d8ad7455f2dac5e4987de87618e8b23fb513b3fc63d60d`

See more details on using hashes here.

File details

Details for the file apcloudy_pipeline-0.1.3-py3-none-any.whl.

File metadata

Download URL: apcloudy_pipeline-0.1.3-py3-none-any.whl
Upload date: Jan 6, 2026
Size: 13.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for apcloudy_pipeline-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`26796cf56863f1b45ac13d02d5aaa3392044688ad4f6b5982e07e4cce8861df3`
MD5	`39d8f9b1f1fcdb76b7cf070d19eb7505`
BLAKE2b-256	`685429cfeef60f4dc932e446752efe09f13c2b7932c77bb4078eabf2e0aaa022`

See more details on using hashes here.

apcloudy-pipeline 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

APCloudy Pipeline

✨ Features

📦 Installation

⚙️ Configuration

🏗️ Architecture

Data Flow

📡 API Payload Structure

🔧 Components

1. APCloudyItemPipeline

2. APCloudyRequestLogger

3. APCloudyErrorMiddleware

4. APCloudyLoggingExtension

5. APCloudyStatsExtension

6. DataCollector

7. APCloudyClient

🔐 Authentication

🚀 Performance

📋 Requirements

🛠️ Advanced Configuration

Custom Batch Size

Custom Job ID via Spider Args

Backend Endpoint

📝 Example Spider

🐛 Troubleshooting

Data Not Being Sent

Failed Requests Not Being Logged

Stats Not Included

📄 License

🤝 Contributing

📧 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes