Skip to main content

Unified PDF processing pipeline with resume capability, multi-storage backends (GCS/Drive), distributed locking, and production-ready deployment.

Project description

PDF Processing Worker

A comprehensive, scalable PDF processing system with support for Google Cloud Storage (GCS) and Google Drive backends, featuring resume capability, distributed locking, and production-ready deployment options.

๐Ÿš€ Features

  • ๐Ÿ”„ Resume Capability: Can resume from where it left off after crashes or interruptions
  • โšก Concurrent Processing: File-level and page-level concurrency with intelligent backpressure
  • ๐Ÿ—„๏ธ Multi-Storage Backends: Support for both GCS and Google Drive via pluggable storage interface
  • ๐Ÿ”’ Distributed Locking: Prevents duplicate processing across multiple instances
  • ๐Ÿ“Š Comprehensive Logging: JSON logs, dead letter queue, and Supabase integration
  • โœ… PDF Validation: Validates PDF integrity before processing
  • ๐Ÿšฆ Rate Limiting: Global Gemini API throttling and storage operation limits
  • ๐Ÿ›ก๏ธ Graceful Shutdown: Proper cleanup on termination signals
  • ๐Ÿฅ Health Monitoring: Built-in health checks and monitoring endpoints
  • ๐Ÿ“ˆ Auto-scaling: Kubernetes HPA for dynamic scaling
  • ๐Ÿณ Container Ready: Docker and Kubernetes deployment configurations

๐Ÿ—๏ธ Architecture

The system consists of:

  1. Unified Worker: Single worker supporting both GCS and Google Drive backends
  2. Storage Interface: Pluggable storage abstraction layer
  3. OCR Engine: Gemini API integration with intelligent rate limiting
  4. Resume System: Persistent progress tracking and resume capability
  5. Distributed Locking: Redis-based or file-based locking to prevent duplicates
  6. Comprehensive Logging: Multi-output logging system with structured JSON logs
  7. Health Monitoring: Built-in health checks and metrics endpoints

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.11+
  • Google Cloud Storage bucket OR Google Drive folders
  • Gemini API key
  • Service account credentials (GCS) OR OAuth2 credentials (Drive)
  • Redis instance (for distributed locking)

Installation

  1. Clone and setup:
git clone <repository-url>
cd gcs-pdf-processing
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
  1. Configure environment:
cp .env.example .env
# Edit .env with your configuration

Configuration

Create a .env file with your settings:

# API Keys
GEMINI_API_KEY=your_gemini_api_key

# Google Cloud Storage (for GCS backend)
GOOGLE_APPLICATION_CREDENTIALS=secrets/gcs-service-account.json
GCS_BUCKET_NAME=your-bucket-name
GCS_SOURCE_PREFIX=source/
GCS_DEST_PREFIX=processed/

# Google Drive (for Drive backend)
GOOGLE_DRIVE_CREDENTIALS=secrets/drive-oauth2-credentials.json
DRIVE_SOURCE_FOLDER_ID=your_source_folder_id
DRIVE_DEST_FOLDER_ID=your_dest_folder_id

# Redis (for distributed locking)
REDIS_URL=redis://localhost:6379/0

# Supabase (optional, for persistent error logging)
SUPABASE_URL=your_supabase_url
SUPABASE_API_KEY=your_supabase_api_key

# Worker Configuration
POLL_INTERVAL=30
MAX_CONCURRENT_FILES=3
MAX_CONCURRENT_WORKERS=8
GEMINI_GLOBAL_CONCURRENCY=10
MAX_RETRIES=3

๐ŸŽฏ Usage

Local Development

# Run GCS worker
dist-gcs-worker

# Run Drive worker  
dist-drive-worker

# Run API server
dist-gcs-api

Docker Deployment

# Build and run with Docker Compose
docker-compose up -d

# Scale workers
docker-compose up -d --scale pdf-worker-gcs=3 --scale pdf-worker-drive=2

Kubernetes Deployment

# Deploy to Kubernetes
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/secrets.yaml
kubectl apply -f k8s/redis-deployment.yaml
kubectl apply -f k8s/worker-deployment.yaml
kubectl apply -f k8s/api-deployment.yaml
kubectl apply -f k8s/hpa.yaml

๐Ÿ“ Project Structure

โ”œโ”€โ”€ src/dist_gcs_pdf_processing/
โ”‚   โ”œโ”€โ”€ unified_worker.py      # ๐ŸŽฏ Main unified worker (use this)
โ”‚   โ”œโ”€โ”€ storage_interface.py   # ๐Ÿ—„๏ธ Storage abstraction layer
โ”‚   โ”œโ”€โ”€ gcs_utils.py          # โ˜๏ธ GCS operations
โ”‚   โ”œโ”€โ”€ drive_utils_oauth2.py # ๐Ÿ“ Drive operations
โ”‚   โ”œโ”€โ”€ ocr.py                # ๐Ÿ” OCR processing
โ”‚   โ”œโ”€โ”€ config.py             # โš™๏ธ Configuration
โ”‚   โ”œโ”€โ”€ env.py                # ๐ŸŒ Environment setup
โ”‚   โ””โ”€โ”€ shared.py             # ๐Ÿ”ง Shared utilities
โ”œโ”€โ”€ k8s/                      # โ˜ธ๏ธ Kubernetes manifests
โ”œโ”€โ”€ docker-compose.yml        # ๐Ÿณ Docker Compose config
โ”œโ”€โ”€ Dockerfile               # ๐Ÿณ Docker configuration
โ””โ”€โ”€ README_DEPLOYMENT.md     # ๐Ÿ“š Deployment guide

๐Ÿ”ง Configuration Options

Variable Description Default Notes
STORAGE_BACKEND Storage backend (gcs/drive) gcs Determines which storage to use
POLL_INTERVAL Polling interval in seconds 30 How often to check for new files
MAX_CONCURRENT_FILES Max concurrent files 3 Files processed simultaneously
MAX_CONCURRENT_WORKERS Max concurrent workers 8 Pages processed simultaneously
GEMINI_GLOBAL_CONCURRENCY Global Gemini API concurrency 10 Global API rate limiting
MAX_RETRIES Max retries per page 3 Retry failed pages
REDIS_URL Redis connection URL None For distributed locking
WORKER_INSTANCE_ID Unique worker instance ID Auto-generated For logging and locking

๐Ÿ“Š Monitoring & Logging

Health Checks

  • Worker Health: Checks for log file existence
  • API Health: HTTP endpoint at /health
  • Redis Health: Redis ping command

Logging

  • Structured Logs: JSON format in logs/json/
  • Dead Letter Queue: Failed files in logs/dead_letter/
  • Progress Tracking: Resume state in logs/progress/
  • Supabase Integration: Persistent error logging

Metrics

  • Prometheus Metrics: Available at /metrics endpoint
  • Resource Usage: CPU, memory, network
  • Processing Metrics: Files processed, pages processed, errors

๐Ÿš€ Deployment Options

1. Docker Compose (Recommended for Development)

docker-compose up -d

2. Kubernetes (Recommended for Production)

kubectl apply -f k8s/

3. Individual Containers

docker run -d --name pdf-worker --env-file .env pdf-worker:latest

๐Ÿ” Troubleshooting

Common Issues

  1. Redis Connection Failed

    # Check Redis status
    kubectl get pods -l app=redis -n pdf-processing
    
  2. Authentication Errors

    # Check secrets
    kubectl get secret pdf-worker-secrets -n pdf-processing -o yaml
    
  3. Duplicate Processing

    # Check Redis locks
    redis-cli keys "pdf_processing:*"
    

Debug Commands

# Check worker status
kubectl describe pod <pod-name> -n pdf-processing

# View logs
kubectl logs -f <pod-name> -n pdf-processing

# Execute shell in pod
kubectl exec -it <pod-name> -n pdf-processing -- /bin/bash

๐Ÿ“ˆ Scaling Strategies

Horizontal Scaling

  1. Kubernetes HPA: Automatic scaling based on CPU/memory
  2. Manual Scaling: kubectl scale deployment
  3. Docker Compose: docker-compose up --scale

Vertical Scaling

  1. Resource Limits: Adjust CPU/memory limits
  2. Concurrency: Increase MAX_CONCURRENT_FILES
  3. Workers: Increase MAX_CONCURRENT_WORKERS

๐Ÿ›ก๏ธ Security Considerations

  1. Secrets Management: Use Kubernetes secrets or external secret management
  2. Network Policies: Implement network segmentation
  3. RBAC: Configure proper role-based access control
  4. Image Security: Scan images for vulnerabilities
  5. Resource Limits: Prevent resource exhaustion attacks

๐Ÿ“š Documentation

  • Deployment Guide: Comprehensive deployment instructions
  • API Documentation: API endpoints and usage
  • Configuration Reference: Detailed configuration options
  • Troubleshooting Guide: Common issues and solutions

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ†˜ Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dist_gcs_pdf_processing-2.0.0.tar.gz (30.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dist_gcs_pdf_processing-2.0.0-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file dist_gcs_pdf_processing-2.0.0.tar.gz.

File metadata

  • Download URL: dist_gcs_pdf_processing-2.0.0.tar.gz
  • Upload date:
  • Size: 30.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for dist_gcs_pdf_processing-2.0.0.tar.gz
Algorithm Hash digest
SHA256 c063abda43652cc3a91838fefa8c3f356349f25b72a45f368739da0126f110c8
MD5 de41f9304aa5d256b4ea103f0b457f03
BLAKE2b-256 686e6f0f47ddde7966dd52a93589b418bd6c89d88eab19e8107df32f70c71c96

See more details on using hashes here.

File details

Details for the file dist_gcs_pdf_processing-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dist_gcs_pdf_processing-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 66d0e7dfdb3a70923de03fcba94e80c9b076dda15f9973b516202245e1645ed3
MD5 a77d10ffac3f0e9474866d2571addd0b
BLAKE2b-256 91b60f5d05abb8119ff628cbb75eb3f56bbc31c31387bf536e3456f83015190d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page