GPU-accelerated document processing: PDF/HWP → YAML for LLM training data
Project description
PDF-YAML Pipeline
GPU-accelerated document processing pipeline that converts PDF/HWP documents to structured YAML format, optimized for LLM training data preparation.
Table of Contents
- Features
- System Requirements
- Quick Start
- Configuration
- Usage
- Output Format
- Architecture
- Testing
- Troubleshooting
- License
Features
- PDF Parsing: High-quality PDF text extraction using Docling
- HWP Support: Korean HWP/HWPX document parsing
- Scan Detection: Automatic triage of scanned vs digital PDFs
- GPU Acceleration: CUDA-optimized processing with multi-GPU support
- CPU Mode: Works without GPU (slower but functional)
- Table Extraction: Structured table data with cell-level bounding boxes
- Redis Queue: Distributed worker architecture for scalability
- Fault Tolerant: Automatic retry, dead letter queue, lock management
- YAML Output: Structured, LLM-friendly output format
System Requirements
Minimum (CPU Mode)
| Component | Requirement |
|---|---|
| OS | Linux (Ubuntu 20.04+), macOS, Windows with WSL2 |
| Docker | 20.10+ with Docker Compose v2 |
| RAM | 8GB |
| Disk | 10GB free space |
Recommended (GPU Mode)
| Component | Requirement |
|---|---|
| OS | Linux (Ubuntu 20.04+ recommended) |
| Docker | 20.10+ with Docker Compose v2 |
| NVIDIA Driver | 525+ |
| NVIDIA Container Toolkit | Installed and configured |
| GPU | RTX 3060 (12GB) or better |
| RAM | 16GB+ |
| Disk | 20GB free space |
Verify GPU Setup
# Check NVIDIA driver
nvidia-smi
# Check Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi
Quick Start
Option 1: One-Click Setup (Recommended)
git clone https://github.com/seunghyuoffice-design/pdf-yaml-pipeline.git
cd pdf-yaml-pipeline
./setup.sh
The setup script automatically:
- Detects GPU availability (falls back to CPU mode if no GPU)
- Creates
.envconfiguration file - Creates
data/anddata/output/directories - Builds the Docker image (~10-20 min on first run)
- Optionally downloads a sample PDF for testing
Option 2: Manual Setup
# 1. Clone the repository
git clone https://github.com/seunghyuoffice-design/pdf-yaml-pipeline.git
cd pdf-yaml-pipeline
# 2. Create configuration
cp .env.example .env
mkdir -p data/output
# 3. (Optional) Edit .env for your environment
# For CPU mode: set DOCLING_DEVICE=cpu
# 4. Build Docker image
docker build -t pdf-pipeline:latest .
# 5. Start the pipeline
docker compose up -d
Configuration
All Environment Variables
Edit .env to customize:
| Variable | Default | Description |
|---|---|---|
| Data Paths | ||
DATA_PATH |
./data |
Input directory for PDF/HWP files |
OUTPUT_PATH |
./data/output |
Output directory for YAML files |
| GPU Settings | ||
CUDA_VISIBLE_DEVICES |
0 |
GPU device ID(s), e.g., 0 or 0,1 |
DOCLING_DEVICE |
cuda |
cuda for GPU, cpu for CPU-only |
| Worker Settings | ||
THREADS_PER_WORKER |
2 |
CPU threads per worker |
WORKER_MEMORY |
8G |
Memory limit per worker |
| Processing | ||
MAX_PDF_PAGES |
100 |
Max pages per PDF (truncates larger docs) |
TIMEOUT |
600 |
Processing timeout in seconds |
MAX_RETRIES |
1 |
Retry count before moving to DLQ |
SAFE_MODE |
true |
Skip problematic files instead of crashing |
OCR_ENABLED |
false |
Enable OCR for scanned documents |
LOG_LEVEL |
INFO |
Logging verbosity (DEBUG/INFO/WARNING/ERROR) |
| Redis | ||
REDIS_HOST |
redis |
Redis hostname (use default in Docker) |
REDIS_PORT |
6379 |
Redis port |
REDIS_PASSWORD |
(empty) | Redis password (optional) |
CPU Mode (No GPU Required)
# In .env:
DOCLING_DEVICE=cpu
Note: CPU mode is approximately 5-10x slower than GPU mode.
Worker Configuration
Adjust resources in .env:
# Workers use this many CPU threads
THREADS_PER_WORKER=2
# Memory per worker
WORKER_MEMORY=8G
Scale workers:
# 1 worker (default, lower memory usage)
docker compose up -d
# 2 workers (2x throughput, requires more memory)
docker compose --profile scale up -d
Recommended settings by system:
| System RAM | GPU VRAM | Workers | Threads | Memory |
|---|---|---|---|---|
| 8GB | None (CPU) | 1 | 2 | 4G |
| 16GB | 8GB | 1 | 2 | 8G |
| 32GB | 12GB+ | 2 | 4 | 12G |
| 64GB+ | 24GB+ | 2 | 8 | 16G |
Usage
Basic Workflow
# 1. Add PDF files to data directory
cp /path/to/documents/*.pdf ./data/
# 2. Start the pipeline
docker compose up -d
# 3. Initialize the queue (scans data/ for new files)
docker compose run --rm queue-init
# 4. Monitor progress
docker compose logs -f worker-0
# 5. Check results
ls ./data/output/
Useful Commands
# View worker logs
docker compose logs -f worker-0
# Check queue status
docker compose run --rm queue-monitor
# Stop pipeline
docker compose down
# Restart with fresh state
docker compose down -v # Warning: clears Redis data
docker compose up -d
# Add more files to running pipeline
cp more_files/*.pdf ./data/
docker compose run --rm queue-init
Processing States
Files move through these states:
file:queue → file:processing → file:done
↓
file:failed (DLQ)
Output Format
YAML Structure
Each processed PDF creates a .yaml file:
document:
source_path: "example.pdf"
format: "pdf"
parser: "docling"
text_extractor: "pypdfium2"
page_count: 15
original_pages: 15 # Total pages in original PDF
truncated: false # true if exceeded MAX_PDF_PAGES
max_pages_limit: null # Limit that caused truncation (if any)
ocr_enabled: false
table_extraction: true
encrypted: false
content:
paragraphs:
- "First paragraph text extracted from the document..."
- "Second paragraph continues here with more content..."
- "Each paragraph is a separate list item."
tables:
- page: 1
bbox: [100, 200, 500, 400] # [x1, y1, x2, y2]
cells:
- text: "Header 1"
row: 0
col: 0
bbox: [100, 200, 200, 220]
confidence: 0.95
- text: "Value 1"
row: 1
col: 0
bbox: [100, 220, 200, 240]
confidence: 0.98
assets:
images: [] # Image metadata if extracted
Key Fields Explained
| Field | Type | Description |
|---|---|---|
document.page_count |
int | Number of pages actually processed |
document.original_pages |
int | Total pages in source PDF |
document.truncated |
bool | true if document was cut at MAX_PDF_PAGES |
content.paragraphs |
list | Extracted text paragraphs in reading order |
tables[].cells |
list | Cell-level table data with positions |
tables[].cells[].confidence |
float | OCR confidence score (0.0-1.0) |
tables[].cells[].bbox |
list | Bounding box [x1, y1, x2, y2] in pixels |
Architecture
┌─────────────────────────────────────────────────────────────┐
│ PDF-YAML Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Input │───▶│ Redis │───▶│ Workers │ │
│ │ (data/) │ │ Queue │ │ (GPU/CPU)│ │
│ └──────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Triage │ │ Parser │ │
│ │(PDF type)│ │(Docling) │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ ┌──────┴──────┐ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Digital │ │ Scanned │ │ YAML │ │
│ │ PDF │ │ (→ OCR) │ │ Converter│ │
│ └─────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ Output │ │
│ │(data/out)│ │
│ └──────────┘ │
└─────────────────────────────────────────────────────────────┘
Components
| Component | Description |
|---|---|
| Redis | Job queue, state management, distributed locking |
| Worker | PDF processing using Docling + pypdfium2 |
| Triage | Classifies PDFs as digital or scanned |
| Parser | Extracts text, tables, images |
| Converter | Transforms parsed data to YAML format |
File Flow
- PDFs placed in
data/directory queue-initscans and adds files to Redis queue- Workers pull files from queue with distributed locks
- Docling parses PDF structure (GPU-accelerated)
- YAML files written to
data/output/ - Completed files tracked in Redis
file:doneset
Testing
# Install test dependencies (run outside Docker)
pip install pytest fakeredis
# Run all tests
pytest tests/ -v
# Run specific test suite
pytest tests/test_lock_operations.py -v # Lock atomicity tests
pytest tests/test_deduplicator.py -v # Deduplication tests
pytest tests/test_yaml_converter.py -v # Output format tests
Troubleshooting
"CUDA not available" or GPU errors
Solution: Switch to CPU mode:
# In .env:
DOCLING_DEVICE=cpu
Docker build fails
Solution 1: Build without cache:
docker build --no-cache -t pdf-pipeline:latest .
Solution 2: Check disk space:
df -h
docker system prune -a # Warning: removes all unused images
Out of memory (OOM)
Solution 1: Reduce page limit:
# In .env:
MAX_PDF_PAGES=50
Solution 2: Reduce worker memory:
# In .env:
WORKER_MEMORY=4G
Solution 3: Use single worker only:
docker compose up -d # Don't use --profile scale
Pipeline stuck / no output
Check Redis:
docker compose logs redis
docker compose restart redis
Check worker logs:
docker compose logs worker-0
Reset queue:
docker compose down
docker compose up -d
docker compose run --rm queue-init
Permission denied on data/
chmod -R 755 data/
# Or for Docker volume issues:
sudo chown -R $USER:$USER data/
Files not being processed
Ensure files are in the queue:
docker compose run --rm queue-init --pattern "**/*.pdf"
Worker crashes repeatedly
Check for problematic PDF in failed set - the pipeline will skip it on restart with SAFE_MODE=true.
API Reference (For Programmatic Use)
Direct Python Usage
from src.pipeline.parsers.docling_yaml_adapter import DoclingYAMLAdapter
# Initialize parser
adapter = DoclingYAMLAdapter(
ocr_enabled=False,
table_extraction=True
)
# Parse a PDF
result = adapter.parse("/path/to/document.pdf")
# Access results
print(result["document"]["page_count"])
print(result["content"]["paragraphs"])
print(result["tables"])
Redis Queue Keys
| Key | Type | Description |
|---|---|---|
file:queue |
List | Files waiting to be processed |
file:queue:set |
Set | Deduplication set for queue |
file:processing |
List | Files currently being processed |
file:done |
Set | Successfully processed files |
file:failed |
Set | Failed files (dead letter queue) |
file:lock:{hash} |
String | Per-file distributed lock |
License
MIT License - see LICENSE file.
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_yaml_pipeline-0.1.0.tar.gz.
File metadata
- Download URL: pdf_yaml_pipeline-0.1.0.tar.gz
- Upload date:
- Size: 117.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13192365678cadae89b55a3ea9ac3ec8e3edb4b0911b4d7a2de53a4988be96ac
|
|
| MD5 |
06da4ce9b45623e12a16ad70e0b55200
|
|
| BLAKE2b-256 |
edc549c1b0d393852f212387d4b56b61bcfe12f291dd49ad264b2dedec655782
|
File details
Details for the file pdf_yaml_pipeline-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdf_yaml_pipeline-0.1.0-py3-none-any.whl
- Upload date:
- Size: 153.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6688b399ed1897e6e7a67302810864a450d672029b36b0ce1404e47f0566a3c6
|
|
| MD5 |
9d5b6c301d6c070e0cf3a1b0084e5479
|
|
| BLAKE2b-256 |
06049f3078edddce378ed1eebd4e70b4ac331a3d2ea158e4cdbdfebc5e2f34c3
|