Unified PDF processing pipeline with resume capability, multi-storage backends (GCS/Drive), distributed locking, and production-ready deployment.

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Software Development :: Libraries :: Application Frameworks

Project description

PDF Processing Worker

A comprehensive, scalable PDF processing system with support for Google Cloud Storage (GCS) and Google Drive backends, featuring resume capability, distributed locking, and production-ready deployment options.

🚀 Features

🔄 Resume Capability: Can resume from where it left off after crashes or interruptions
⚡ Concurrent Processing: File-level and page-level concurrency with intelligent backpressure
🗄️ Multi-Storage Backends: Support for both GCS and Google Drive via pluggable storage interface
🔒 Distributed Locking: Prevents duplicate processing across multiple instances
📊 Comprehensive Logging: JSON logs, dead letter queue, and Supabase integration
✅ PDF Validation: Validates PDF integrity before processing
🚦 Rate Limiting: Global Gemini API throttling and storage operation limits
🛡️ Graceful Shutdown: Proper cleanup on termination signals
🏥 Health Monitoring: Built-in health checks and monitoring endpoints
📈 Auto-scaling: Kubernetes HPA for dynamic scaling
🐳 Container Ready: Docker and Kubernetes deployment configurations

🏗️ Architecture

The system consists of:

Unified Worker: Single worker supporting both GCS and Google Drive backends
Storage Interface: Pluggable storage abstraction layer
OCR Engine: Gemini API integration with intelligent rate limiting
Resume System: Persistent progress tracking and resume capability
Distributed Locking: Redis-based or file-based locking to prevent duplicates
Comprehensive Logging: Multi-output logging system with structured JSON logs
Health Monitoring: Built-in health checks and metrics endpoints

🚀 Quick Start

Prerequisites

Python 3.11+
Google Cloud Storage bucket OR Google Drive folders
Gemini API key
Service account credentials (GCS) OR OAuth2 credentials (Drive)
Redis instance (for distributed locking)

Installation

Clone and setup:

git clone <repository-url>
cd gcs-pdf-processing
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .

Configure environment:

cp .env.example .env
# Edit .env with your configuration

Configuration

Create a .env file with your settings:

# API Keys
GEMINI_API_KEY=your_gemini_api_key

# Google Cloud Storage (for GCS backend)
GOOGLE_APPLICATION_CREDENTIALS=secrets/gcs-service-account.json
GCS_BUCKET_NAME=your-bucket-name
GCS_SOURCE_PREFIX=source/
GCS_DEST_PREFIX=processed/

# Google Drive (for Drive backend)
GOOGLE_DRIVE_CREDENTIALS=secrets/drive-oauth2-credentials.json
DRIVE_SOURCE_FOLDER_ID=your_source_folder_id
DRIVE_DEST_FOLDER_ID=your_dest_folder_id

# Redis (for distributed locking)
REDIS_URL=redis://localhost:6379/0

# Supabase (optional, for persistent error logging)
SUPABASE_URL=your_supabase_url
SUPABASE_API_KEY=your_supabase_api_key

# Worker Configuration
POLL_INTERVAL=30
MAX_CONCURRENT_FILES=3
MAX_CONCURRENT_WORKERS=8
GEMINI_GLOBAL_CONCURRENCY=10
MAX_RETRIES=3

🎯 Usage

Local Development

# Run GCS worker
dist-gcs-worker

# Run Drive worker  
dist-drive-worker

# Run API server
dist-gcs-api

Docker Deployment

# Build and run with Docker Compose
docker-compose up -d

# Scale workers
docker-compose up -d --scale pdf-worker-gcs=3 --scale pdf-worker-drive=2

Kubernetes Deployment

# Deploy to Kubernetes
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/secrets.yaml
kubectl apply -f k8s/redis-deployment.yaml
kubectl apply -f k8s/worker-deployment.yaml
kubectl apply -f k8s/api-deployment.yaml
kubectl apply -f k8s/hpa.yaml

📁 Project Structure

├── src/dist_gcs_pdf_processing/
│   ├── unified_worker.py      # 🎯 Main unified worker (use this)
│   ├── storage_interface.py   # 🗄️ Storage abstraction layer
│   ├── gcs_utils.py          # ☁️ GCS operations
│   ├── drive_utils_oauth2.py # 📁 Drive operations
│   ├── ocr.py                # 🔍 OCR processing
│   ├── config.py             # ⚙️ Configuration
│   ├── env.py                # 🌍 Environment setup
│   └── shared.py             # 🔧 Shared utilities
├── k8s/                      # ☸️ Kubernetes manifests
├── docker-compose.yml        # 🐳 Docker Compose config
├── Dockerfile               # 🐳 Docker configuration
└── README_DEPLOYMENT.md     # 📚 Deployment guide

🔧 Configuration Options

Variable	Description	Default	Notes
`STORAGE_BACKEND`	Storage backend (gcs/drive)	gcs	Determines which storage to use
`POLL_INTERVAL`	Polling interval in seconds	30	How often to check for new files
`MAX_CONCURRENT_FILES`	Max concurrent files	3	Files processed simultaneously
`MAX_CONCURRENT_WORKERS`	Max concurrent workers	8	Pages processed simultaneously
`GEMINI_GLOBAL_CONCURRENCY`	Global Gemini API concurrency	10	Global API rate limiting
`MAX_RETRIES`	Max retries per page	3	Retry failed pages
`REDIS_URL`	Redis connection URL	None	For distributed locking
`WORKER_INSTANCE_ID`	Unique worker instance ID	Auto-generated	For logging and locking

📊 Monitoring & Logging

Health Checks

Worker Health: Checks for log file existence
API Health: HTTP endpoint at /health
Redis Health: Redis ping command

Logging

Structured Logs: JSON format in logs/json/
Dead Letter Queue: Failed files in logs/dead_letter/
Progress Tracking: Resume state in logs/progress/
Supabase Integration: Persistent error logging

Metrics

Prometheus Metrics: Available at /metrics endpoint
Resource Usage: CPU, memory, network
Processing Metrics: Files processed, pages processed, errors

🚀 Deployment Options

1. Docker Compose (Recommended for Development)

docker-compose up -d

2. Kubernetes (Recommended for Production)

kubectl apply -f k8s/

3. Individual Containers

docker run -d --name pdf-worker --env-file .env pdf-worker:latest

🔍 Troubleshooting

Common Issues

Redis Connection Failed

# Check Redis status
kubectl get pods -l app=redis -n pdf-processing

Authentication Errors

# Check secrets
kubectl get secret pdf-worker-secrets -n pdf-processing -o yaml

Duplicate Processing

# Check Redis locks
redis-cli keys "pdf_processing:*"

Debug Commands

# Check worker status
kubectl describe pod <pod-name> -n pdf-processing

# View logs
kubectl logs -f <pod-name> -n pdf-processing

# Execute shell in pod
kubectl exec -it <pod-name> -n pdf-processing -- /bin/bash

📈 Scaling Strategies

Horizontal Scaling

Kubernetes HPA: Automatic scaling based on CPU/memory
Manual Scaling: kubectl scale deployment
Docker Compose: docker-compose up --scale

Vertical Scaling

Resource Limits: Adjust CPU/memory limits
Concurrency: Increase MAX_CONCURRENT_FILES
Workers: Increase MAX_CONCURRENT_WORKERS

🛡️ Security Considerations

Secrets Management: Use Kubernetes secrets or external secret management
Network Policies: Implement network segmentation
RBAC: Configure proper role-based access control
Image Security: Scan images for vulnerabilities
Resource Limits: Prevent resource exhaustion attacks

📚 Documentation

Deployment Guide: Comprehensive deployment instructions
API Documentation: API endpoints and usage
Configuration Reference: Detailed configuration options
Troubleshooting Guide: Common issues and solutions

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🆘 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Wiki

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Software Development :: Libraries :: Application Frameworks

Release history Release notifications | RSS feed

This version

2.0.0

Sep 14, 2025

1.0.0

Jul 4, 2025

0.1.0

Jul 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dist_gcs_pdf_processing-2.0.0.tar.gz (30.3 kB view details)

Uploaded Sep 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dist_gcs_pdf_processing-2.0.0-py3-none-any.whl (26.2 kB view details)

Uploaded Sep 14, 2025 Python 3

File details

Details for the file dist_gcs_pdf_processing-2.0.0.tar.gz.

File metadata

Download URL: dist_gcs_pdf_processing-2.0.0.tar.gz
Upload date: Sep 14, 2025
Size: 30.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for dist_gcs_pdf_processing-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c063abda43652cc3a91838fefa8c3f356349f25b72a45f368739da0126f110c8`
MD5	`de41f9304aa5d256b4ea103f0b457f03`
BLAKE2b-256	`686e6f0f47ddde7966dd52a93589b418bd6c89d88eab19e8107df32f70c71c96`

See more details on using hashes here.

File details

Details for the file dist_gcs_pdf_processing-2.0.0-py3-none-any.whl.

File metadata

Download URL: dist_gcs_pdf_processing-2.0.0-py3-none-any.whl
Upload date: Sep 14, 2025
Size: 26.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for dist_gcs_pdf_processing-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`66d0e7dfdb3a70923de03fcba94e80c9b076dda15f9973b516202245e1645ed3`
MD5	`a77d10ffac3f0e9474866d2571addd0b`
BLAKE2b-256	`91b60f5d05abb8119ff628cbb75eb3f56bbc31c31387bf536e3456f83015190d`

See more details on using hashes here.

dist-gcs-pdf-processing 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF Processing Worker

🚀 Features

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Configuration

🎯 Usage

Local Development

Docker Deployment

Kubernetes Deployment

📁 Project Structure

🔧 Configuration Options

📊 Monitoring & Logging

Health Checks

Logging

Metrics

🚀 Deployment Options

1. Docker Compose (Recommended for Development)

2. Kubernetes (Recommended for Production)

3. Individual Containers

🔍 Troubleshooting

Common Issues

Debug Commands

📈 Scaling Strategies

Horizontal Scaling

Vertical Scaling

🛡️ Security Considerations

📚 Documentation

🤝 Contributing

📄 License

🆘 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes