Unified PDF processing pipeline with resume capability, multi-storage backends (GCS/Drive), distributed locking, and production-ready deployment.
Project description
PDF Processing Worker
A comprehensive, scalable PDF processing system with support for Google Cloud Storage (GCS) and Google Drive backends, featuring resume capability, distributed locking, and production-ready deployment options.
๐ Features
- ๐ Resume Capability: Can resume from where it left off after crashes or interruptions
- โก Concurrent Processing: File-level and page-level concurrency with intelligent backpressure
- ๐๏ธ Multi-Storage Backends: Support for both GCS and Google Drive via pluggable storage interface
- ๐ Distributed Locking: Prevents duplicate processing across multiple instances
- ๐ Comprehensive Logging: JSON logs, dead letter queue, and Supabase integration
- โ PDF Validation: Validates PDF integrity before processing
- ๐ฆ Rate Limiting: Global Gemini API throttling and storage operation limits
- ๐ก๏ธ Graceful Shutdown: Proper cleanup on termination signals
- ๐ฅ Health Monitoring: Built-in health checks and monitoring endpoints
- ๐ Auto-scaling: Kubernetes HPA for dynamic scaling
- ๐ณ Container Ready: Docker and Kubernetes deployment configurations
๐๏ธ Architecture
The system consists of:
- Unified Worker: Single worker supporting both GCS and Google Drive backends
- Storage Interface: Pluggable storage abstraction layer
- OCR Engine: Gemini API integration with intelligent rate limiting
- Resume System: Persistent progress tracking and resume capability
- Distributed Locking: Redis-based or file-based locking to prevent duplicates
- Comprehensive Logging: Multi-output logging system with structured JSON logs
- Health Monitoring: Built-in health checks and metrics endpoints
๐ Quick Start
Prerequisites
- Python 3.11+
- Google Cloud Storage bucket OR Google Drive folders
- Gemini API key
- Service account credentials (GCS) OR OAuth2 credentials (Drive)
- Redis instance (for distributed locking)
Installation
- Clone and setup:
git clone <repository-url>
cd gcs-pdf-processing
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
- Configure environment:
cp .env.example .env
# Edit .env with your configuration
Configuration
Create a .env file with your settings:
# API Keys
GEMINI_API_KEY=your_gemini_api_key
# Google Cloud Storage (for GCS backend)
GOOGLE_APPLICATION_CREDENTIALS=secrets/gcs-service-account.json
GCS_BUCKET_NAME=your-bucket-name
GCS_SOURCE_PREFIX=source/
GCS_DEST_PREFIX=processed/
# Google Drive (for Drive backend)
GOOGLE_DRIVE_CREDENTIALS=secrets/drive-oauth2-credentials.json
DRIVE_SOURCE_FOLDER_ID=your_source_folder_id
DRIVE_DEST_FOLDER_ID=your_dest_folder_id
# Redis (for distributed locking)
REDIS_URL=redis://localhost:6379/0
# Supabase (optional, for persistent error logging)
SUPABASE_URL=your_supabase_url
SUPABASE_API_KEY=your_supabase_api_key
# Worker Configuration
POLL_INTERVAL=30
MAX_CONCURRENT_FILES=3
MAX_CONCURRENT_WORKERS=8
GEMINI_GLOBAL_CONCURRENCY=10
MAX_RETRIES=3
๐ฏ Usage
Local Development
# Run GCS worker
dist-gcs-worker
# Run Drive worker
dist-drive-worker
# Run API server
dist-gcs-api
Docker Deployment
# Build and run with Docker Compose
docker-compose up -d
# Scale workers
docker-compose up -d --scale pdf-worker-gcs=3 --scale pdf-worker-drive=2
Kubernetes Deployment
# Deploy to Kubernetes
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/secrets.yaml
kubectl apply -f k8s/redis-deployment.yaml
kubectl apply -f k8s/worker-deployment.yaml
kubectl apply -f k8s/api-deployment.yaml
kubectl apply -f k8s/hpa.yaml
๐ Project Structure
โโโ src/dist_gcs_pdf_processing/
โ โโโ unified_worker.py # ๐ฏ Main unified worker (use this)
โ โโโ storage_interface.py # ๐๏ธ Storage abstraction layer
โ โโโ gcs_utils.py # โ๏ธ GCS operations
โ โโโ drive_utils_oauth2.py # ๐ Drive operations
โ โโโ ocr.py # ๐ OCR processing
โ โโโ config.py # โ๏ธ Configuration
โ โโโ env.py # ๐ Environment setup
โ โโโ shared.py # ๐ง Shared utilities
โโโ k8s/ # โธ๏ธ Kubernetes manifests
โโโ docker-compose.yml # ๐ณ Docker Compose config
โโโ Dockerfile # ๐ณ Docker configuration
โโโ README_DEPLOYMENT.md # ๐ Deployment guide
๐ง Configuration Options
| Variable | Description | Default | Notes |
|---|---|---|---|
STORAGE_BACKEND |
Storage backend (gcs/drive) | gcs | Determines which storage to use |
POLL_INTERVAL |
Polling interval in seconds | 30 | How often to check for new files |
MAX_CONCURRENT_FILES |
Max concurrent files | 3 | Files processed simultaneously |
MAX_CONCURRENT_WORKERS |
Max concurrent workers | 8 | Pages processed simultaneously |
GEMINI_GLOBAL_CONCURRENCY |
Global Gemini API concurrency | 10 | Global API rate limiting |
MAX_RETRIES |
Max retries per page | 3 | Retry failed pages |
REDIS_URL |
Redis connection URL | None | For distributed locking |
WORKER_INSTANCE_ID |
Unique worker instance ID | Auto-generated | For logging and locking |
๐ Monitoring & Logging
Health Checks
- Worker Health: Checks for log file existence
- API Health: HTTP endpoint at
/health - Redis Health: Redis ping command
Logging
- Structured Logs: JSON format in
logs/json/ - Dead Letter Queue: Failed files in
logs/dead_letter/ - Progress Tracking: Resume state in
logs/progress/ - Supabase Integration: Persistent error logging
Metrics
- Prometheus Metrics: Available at
/metricsendpoint - Resource Usage: CPU, memory, network
- Processing Metrics: Files processed, pages processed, errors
๐ Deployment Options
1. Docker Compose (Recommended for Development)
docker-compose up -d
2. Kubernetes (Recommended for Production)
kubectl apply -f k8s/
3. Individual Containers
docker run -d --name pdf-worker --env-file .env pdf-worker:latest
๐ Troubleshooting
Common Issues
-
Redis Connection Failed
# Check Redis status kubectl get pods -l app=redis -n pdf-processing
-
Authentication Errors
# Check secrets kubectl get secret pdf-worker-secrets -n pdf-processing -o yaml
-
Duplicate Processing
# Check Redis locks redis-cli keys "pdf_processing:*"
Debug Commands
# Check worker status
kubectl describe pod <pod-name> -n pdf-processing
# View logs
kubectl logs -f <pod-name> -n pdf-processing
# Execute shell in pod
kubectl exec -it <pod-name> -n pdf-processing -- /bin/bash
๐ Scaling Strategies
Horizontal Scaling
- Kubernetes HPA: Automatic scaling based on CPU/memory
- Manual Scaling:
kubectl scale deployment - Docker Compose:
docker-compose up --scale
Vertical Scaling
- Resource Limits: Adjust CPU/memory limits
- Concurrency: Increase
MAX_CONCURRENT_FILES - Workers: Increase
MAX_CONCURRENT_WORKERS
๐ก๏ธ Security Considerations
- Secrets Management: Use Kubernetes secrets or external secret management
- Network Policies: Implement network segmentation
- RBAC: Configure proper role-based access control
- Image Security: Scan images for vulnerabilities
- Resource Limits: Prevent resource exhaustion attacks
๐ Documentation
- Deployment Guide: Comprehensive deployment instructions
- API Documentation: API endpoints and usage
- Configuration Reference: Detailed configuration options
- Troubleshooting Guide: Common issues and solutions
๐ค Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
๐ License
MIT License - see LICENSE file for details.
๐ Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Wiki
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dist_gcs_pdf_processing-2.0.0.tar.gz.
File metadata
- Download URL: dist_gcs_pdf_processing-2.0.0.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c063abda43652cc3a91838fefa8c3f356349f25b72a45f368739da0126f110c8
|
|
| MD5 |
de41f9304aa5d256b4ea103f0b457f03
|
|
| BLAKE2b-256 |
686e6f0f47ddde7966dd52a93589b418bd6c89d88eab19e8107df32f70c71c96
|
File details
Details for the file dist_gcs_pdf_processing-2.0.0-py3-none-any.whl.
File metadata
- Download URL: dist_gcs_pdf_processing-2.0.0-py3-none-any.whl
- Upload date:
- Size: 26.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66d0e7dfdb3a70923de03fcba94e80c9b076dda15f9973b516202245e1645ed3
|
|
| MD5 |
a77d10ffac3f0e9474866d2571addd0b
|
|
| BLAKE2b-256 |
91b60f5d05abb8119ff628cbb75eb3f56bbc31c31387bf536e3456f83015190d
|