Skip to main content

Scrapy pipeline & extensions for AP Cloudy (logs, stats, requests, items)

Project description

APCloudy Pipeline

A high-performance Scrapy integration that sends items, requests, logs, and spider statistics to your backend using unified batch processing and secure HMAC-based authentication.


โœจ Features

  • ๐Ÿ“ฆ Unified Data Sending - All data (items, requests, logs, stats) sent in a single API call
  • ๐Ÿš€ Batch Processing - Automatic batching every 10 items or on spider close
  • ๐ŸŒ Complete Request Tracking - Logs both successful and failed requests with detailed error information
  • ๐Ÿ“Š Spider Statistics - Comprehensive spider performance metrics
  • ๐Ÿงพ Log Forwarding - Captures spider, user, and Scrapy internal logs
  • ๐Ÿ” Secure Authentication - HMAC-SHA256 signature-based API communication
  • โšก High Performance - Thread-safe data collection with minimal overhead
  • ๐ŸŽฏ Zero Configuration - Works out of the box with sensible defaults

๐Ÿ“ฆ Installation

pip install apcloudy-pipeline

โš™๏ธ Configuration

Add the following settings to your Scrapy project's settings.py file:

# Required: API credentials
APCLOUDY_URL = "https://your-api.com"  # Base URL (webhook path added automatically)
APCLOUDY_API_KEY = "your_public_api_key"
APCLOUDY_SECRET_KEY = "your_secret_key"
JOB_ID = 123  # Can also be passed via spider args

# Optional: Batch size (default: 10)
APCLOUDY_BATCH_SIZE = 10  # Send data every 10 items

# Required: Item Pipeline
ITEM_PIPELINES = {
    "apcloudy_pipeline.pipelines.APCloudyItemPipeline": 300,
}

# Required: Error Middleware (to catch failed requests)
DOWNLOADER_MIDDLEWARES = {
    'apcloudy_pipeline.middleware.APCloudyErrorMiddleware': 50,
    # ... your other middlewares
}

# Required: Extensions for request logging, logs, and stats
EXTENSIONS = {
    "apcloudy_pipeline.request_logger.APCloudyRequestLogger": 100,
    "apcloudy_pipeline.extensions.APCloudyLoggingExtension": 100,
    "apcloudy_pipeline.extensions.APCloudyStatsExtension": 100,
}

๐Ÿ—๏ธ Architecture

Data Flow

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    SPIDER EXECUTION                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ†“                    โ†“                     โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚Requests โ”‚      โ”‚  Items   โ”‚         โ”‚   Logs   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“                    โ†“                     โ†“
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ†“
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  DataCollector   โ”‚
              โ”‚  (Thread-Safe)   โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ†“
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚ Batch Trigger?   โ”‚
              โ”‚ โ€ข 10+ items      โ”‚
              โ”‚ โ€ข Spider closes  โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ†“
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚  APCloudyClient  โ”‚
              โ”‚  (HMAC Auth)     โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ†“
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚      Backend     โ”‚
              โ”‚   (Unified API)  โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ก API Payload Structure

All data is sent in a single unified payload:

{
  "job_id": "123",
  "data": {
    "requests": [
      {
        "url": "https://example.com/product",
        "method": "GET",
        "status_code": 200,
        "response_time": 1.23,
        "fingerprint": "f3045685b89f920b3faefc7d3df2d3c88bdab393",
        "error": null,
        "success": true
      },
      {
        "url": "https://example.com/error",
        "method": "GET",
        "status_code": 400,
        "response_time": 0.45,
        "fingerprint": "a1b2c3d4e5f6...",
        "error": "AskPablosAPIError: [400] Invalid JSON format",
        "success": false
      }
    ],
    "items": [
      {
        "title": "Product Name",
        "price": "99.99",
        "url": "https://example.com/product"
      }
    ],
    "logs": [
      {
        "level": "INFO",
        "message": "Spider started",
        "exception": null
      },
      {
        "level": "ERROR",
        "message": "Failed to parse page",
        "exception": "Traceback..."
      }
    ],
    "stats": {
      "item_scraped_count": 1,
      "request_count": 2,
      "response_received_count": 1,
      "downloader/exception_count": 1,
      "finish_time": "2026-01-06T11:28:46",
      "finish_reason": "finished"
    }
  }
}

๐Ÿ”ง Components

1. APCloudyItemPipeline

  • Collects scraped items
  • Triggers batch send every 10 items
  • Sends remaining items on spider close

2. APCloudyRequestLogger

  • Logs successful HTTP responses (including non-200 status codes)
  • Tracks response time for each request
  • Generates unique fingerprints for requests

3. APCloudyErrorMiddleware

  • Captures failed requests and exceptions
  • Extracts HTTP status codes from error messages
  • Logs middleware errors (API errors, timeouts, network failures)

4. APCloudyLoggingExtension

  • Captures all Python logging output
  • Forwards spider, user, and Scrapy internal logs
  • Includes exception tracebacks

5. APCloudyStatsExtension

  • Collects comprehensive spider statistics
  • Sent once at spider close
  • Includes item counts, request metrics, timing, etc.

6. DataCollector

  • Thread-safe central data storage
  • Shared across all components
  • Batches data before sending

7. APCloudyClient

  • Handles HMAC-SHA256 authentication
  • Sends unified data to backend
  • Automatic signature generation

๐Ÿ” Authentication

The package uses HMAC-SHA256 for secure API communication:

Headers:

X-API-KEY: {your_public_key}
X-TIMESTAMP: {unix_timestamp}
X-SIGNATURE: {hmac_sha256(secret_key, timestamp + "." + json_body)}
Content-Type: application/json

Signature Calculation:

message = f"{timestamp}.{json_body}"
signature = HMAC_SHA256(secret_key, message)

๐Ÿš€ Performance

  • Reduced API Calls: 1 request per batch instead of N individual requests
  • Configurable Batch Size: Set APCLOUDY_BATCH_SIZE in settings (default: 10 items)
  • Thread-Safe: Handles concurrent data collection safely
  • Minimal Overhead: Efficient data collection with locks

๐Ÿ“‹ Requirements

  • Python 3.8+
  • Scrapy 2.0+
  • requests

๐Ÿ› ๏ธ Advanced Configuration

Custom Batch Size

Control when data is sent by adjusting the batch size:

# Send every 50 items
APCLOUDY_BATCH_SIZE = 50

# Send every 100 items
APCLOUDY_BATCH_SIZE = 100

# Send immediately (batch size of 1)
APCLOUDY_BATCH_SIZE = 1

Note: Data is always sent when the spider closes, regardless of batch size.

Custom Job ID via Spider Args

scrapy crawl myspider -a JOB_ID=456

Backend Endpoint

The package automatically appends /api/webhook/consume to your base URL:

Base URL: https://your-api.com
Full endpoint: https://your-api.com/api/webhook/consume

Data is sent via:

POST {APCLOUDY_URL}/api/webhook/consume

Make sure your backend handles the unified payload structure at this endpoint.


๐Ÿ“ Example Spider

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        urls = ['https://example.com/page1', 'https://example.com/page2']
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
            'url': response.url
        }

No changes needed to your spider code! The pipeline automatically collects and sends all data.


๐Ÿ› Troubleshooting

Data Not Being Sent

  1. Check that all required settings are configured
  2. Verify API credentials are correct
  3. Ensure middleware and extensions are enabled
  4. Check logs for error messages

Failed Requests Not Being Logged

  1. Ensure APCloudyErrorMiddleware is added to DOWNLOADER_MIDDLEWARES
  2. Check middleware priority (should be < 100 to catch errors early)

Stats Not Included

  1. Verify APCloudyStatsExtension is enabled in EXTENSIONS
  2. Stats are only sent once at spider close

๐Ÿ“„ License

This project is licensed under the MIT License.


๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


๐Ÿ“ง Support

For issues and questions, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apcloudy_pipeline-0.1.3.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

apcloudy_pipeline-0.1.3-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file apcloudy_pipeline-0.1.3.tar.gz.

File metadata

  • Download URL: apcloudy_pipeline-0.1.3.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for apcloudy_pipeline-0.1.3.tar.gz
Algorithm Hash digest
SHA256 3494e71bd7bdfc7c519140ba236cea57aad4449e91df7a24ae1849bb8ce94e2f
MD5 7778c78016287edbd0fe306de202abb9
BLAKE2b-256 fd7ccec62f64e73ba9d8ad7455f2dac5e4987de87618e8b23fb513b3fc63d60d

See more details on using hashes here.

File details

Details for the file apcloudy_pipeline-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for apcloudy_pipeline-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 26796cf56863f1b45ac13d02d5aaa3392044688ad4f6b5982e07e4cce8861df3
MD5 39d8f9b1f1fcdb76b7cf070d19eb7505
BLAKE2b-256 685429cfeef60f4dc932e446752efe09f13c2b7932c77bb4078eabf2e0aaa022

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page