Skip to main content

GPU-accelerated document processing: PDF/HWP → YAML for LLM training data

Project description

PDF-YAML Pipeline

GPU-accelerated document processing pipeline that converts PDF/HWP documents to structured YAML format, optimized for LLM training data preparation.

Table of Contents

Features

  • PDF Parsing: High-quality PDF text extraction using Docling
  • HWP Support: Korean HWP/HWPX document parsing
  • Scan Detection: Automatic triage of scanned vs digital PDFs
  • GPU Acceleration: CUDA-optimized processing with multi-GPU support
  • CPU Mode: Works without GPU (slower but functional)
  • Table Extraction: Structured table data with cell-level bounding boxes
  • Redis Queue: Distributed worker architecture for scalability
  • Fault Tolerant: Automatic retry, dead letter queue, lock management
  • YAML Output: Structured, LLM-friendly output format

System Requirements

Minimum (CPU Mode)

Component Requirement
OS Linux (Ubuntu 20.04+), macOS, Windows with WSL2
Docker 20.10+ with Docker Compose v2
RAM 8GB
Disk 10GB free space

Recommended (GPU Mode)

Component Requirement
OS Linux (Ubuntu 20.04+ recommended)
Docker 20.10+ with Docker Compose v2
NVIDIA Driver 525+
NVIDIA Container Toolkit Installed and configured
GPU RTX 3060 (12GB) or better
RAM 16GB+
Disk 20GB free space

Verify GPU Setup

# Check NVIDIA driver
nvidia-smi

# Check Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi

Quick Start

Option 1: One-Click Setup (Recommended)

git clone https://github.com/seunghyuoffice-design/pdf-yaml-pipeline.git
cd pdf-yaml-pipeline
./setup.sh

The setup script automatically:

  1. Detects GPU availability (falls back to CPU mode if no GPU)
  2. Creates .env configuration file
  3. Creates data/ and data/output/ directories
  4. Builds the Docker image (~10-20 min on first run)
  5. Optionally downloads a sample PDF for testing

Option 2: Manual Setup

# 1. Clone the repository
git clone https://github.com/seunghyuoffice-design/pdf-yaml-pipeline.git
cd pdf-yaml-pipeline

# 2. Create configuration
cp .env.example .env
mkdir -p data/output

# 3. (Optional) Edit .env for your environment
# For CPU mode: set DOCLING_DEVICE=cpu

# 4. Build Docker image
docker build -t pdf-pipeline:latest .

# 5. Start the pipeline
docker compose up -d

Configuration

All Environment Variables

Edit .env to customize:

Variable Default Description
Data Paths
DATA_PATH ./data Input directory for PDF/HWP files
OUTPUT_PATH ./data/output Output directory for YAML files
GPU Settings
CUDA_VISIBLE_DEVICES 0 GPU device ID(s), e.g., 0 or 0,1
DOCLING_DEVICE cuda cuda for GPU, cpu for CPU-only
Worker Settings
THREADS_PER_WORKER 2 CPU threads per worker
WORKER_MEMORY 8G Memory limit per worker
Processing
MAX_PDF_PAGES 100 Max pages per PDF (truncates larger docs)
TIMEOUT 600 Processing timeout in seconds
MAX_RETRIES 1 Retry count before moving to DLQ
SAFE_MODE true Skip problematic files instead of crashing
OCR_ENABLED false Enable OCR for scanned documents
LOG_LEVEL INFO Logging verbosity (DEBUG/INFO/WARNING/ERROR)
Redis
REDIS_HOST redis Redis hostname (use default in Docker)
REDIS_PORT 6379 Redis port
REDIS_PASSWORD (empty) Redis password (optional)

CPU Mode (No GPU Required)

# In .env:
DOCLING_DEVICE=cpu

Note: CPU mode is approximately 5-10x slower than GPU mode.

Worker Configuration

Adjust resources in .env:

# Workers use this many CPU threads
THREADS_PER_WORKER=2

# Memory per worker
WORKER_MEMORY=8G

Scale workers:

# 1 worker (default, lower memory usage)
docker compose up -d

# 2 workers (2x throughput, requires more memory)
docker compose --profile scale up -d

Recommended settings by system:

System RAM GPU VRAM Workers Threads Memory
8GB None (CPU) 1 2 4G
16GB 8GB 1 2 8G
32GB 12GB+ 2 4 12G
64GB+ 24GB+ 2 8 16G

Usage

Basic Workflow

# 1. Add PDF files to data directory
cp /path/to/documents/*.pdf ./data/

# 2. Start the pipeline
docker compose up -d

# 3. Initialize the queue (scans data/ for new files)
docker compose run --rm queue-init

# 4. Monitor progress
docker compose logs -f worker-0

# 5. Check results
ls ./data/output/

Useful Commands

# View worker logs
docker compose logs -f worker-0

# Check queue status
docker compose run --rm queue-monitor

# Stop pipeline
docker compose down

# Restart with fresh state
docker compose down -v  # Warning: clears Redis data
docker compose up -d

# Add more files to running pipeline
cp more_files/*.pdf ./data/
docker compose run --rm queue-init

Processing States

Files move through these states:

file:queue → file:processing → file:done
                    ↓
              file:failed (DLQ)

Output Format

YAML Structure

Each processed PDF creates a .yaml file:

document:
  source_path: "example.pdf"
  format: "pdf"
  parser: "docling"
  text_extractor: "pypdfium2"
  page_count: 15
  original_pages: 15        # Total pages in original PDF
  truncated: false          # true if exceeded MAX_PDF_PAGES
  max_pages_limit: null     # Limit that caused truncation (if any)
  ocr_enabled: false
  table_extraction: true
  encrypted: false

content:
  paragraphs:
    - "First paragraph text extracted from the document..."
    - "Second paragraph continues here with more content..."
    - "Each paragraph is a separate list item."

tables:
  - page: 1
    bbox: [100, 200, 500, 400]  # [x1, y1, x2, y2]
    cells:
      - text: "Header 1"
        row: 0
        col: 0
        bbox: [100, 200, 200, 220]
        confidence: 0.95
      - text: "Value 1"
        row: 1
        col: 0
        bbox: [100, 220, 200, 240]
        confidence: 0.98

assets:
  images: []  # Image metadata if extracted

Key Fields Explained

Field Type Description
document.page_count int Number of pages actually processed
document.original_pages int Total pages in source PDF
document.truncated bool true if document was cut at MAX_PDF_PAGES
content.paragraphs list Extracted text paragraphs in reading order
tables[].cells list Cell-level table data with positions
tables[].cells[].confidence float OCR confidence score (0.0-1.0)
tables[].cells[].bbox list Bounding box [x1, y1, x2, y2] in pixels

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    PDF-YAML Pipeline                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐               │
│  │  Input   │───▶│  Redis   │───▶│ Workers  │               │
│  │  (data/) │    │  Queue   │    │ (GPU/CPU)│               │
│  └──────────┘    └──────────┘    └────┬─────┘               │
│                                       │                      │
│                       ┌───────────────┴───────────────┐      │
│                       ▼                               ▼      │
│                ┌──────────┐                    ┌──────────┐  │
│                │  Triage  │                    │  Parser  │  │
│                │(PDF type)│                    │(Docling) │  │
│                └────┬─────┘                    └────┬─────┘  │
│                     │                               │        │
│              ┌──────┴──────┐                        │        │
│              ▼             ▼                        ▼        │
│        ┌─────────┐  ┌──────────┐            ┌──────────┐    │
│        │ Digital │  │ Scanned  │            │  YAML    │    │
│        │  PDF    │  │ (→ OCR)  │            │ Converter│    │
│        └─────────┘  └──────────┘            └────┬─────┘    │
│                                                  │           │
│                                                  ▼           │
│                                           ┌──────────┐       │
│                                           │  Output  │       │
│                                           │(data/out)│       │
│                                           └──────────┘       │
└─────────────────────────────────────────────────────────────┘

Components

Component Description
Redis Job queue, state management, distributed locking
Worker PDF processing using Docling + pypdfium2
Triage Classifies PDFs as digital or scanned
Parser Extracts text, tables, images
Converter Transforms parsed data to YAML format

File Flow

  1. PDFs placed in data/ directory
  2. queue-init scans and adds files to Redis queue
  3. Workers pull files from queue with distributed locks
  4. Docling parses PDF structure (GPU-accelerated)
  5. YAML files written to data/output/
  6. Completed files tracked in Redis file:done set

Testing

# Install test dependencies (run outside Docker)
pip install pytest fakeredis

# Run all tests
pytest tests/ -v

# Run specific test suite
pytest tests/test_lock_operations.py -v    # Lock atomicity tests
pytest tests/test_deduplicator.py -v       # Deduplication tests
pytest tests/test_yaml_converter.py -v     # Output format tests

Troubleshooting

"CUDA not available" or GPU errors

Solution: Switch to CPU mode:

# In .env:
DOCLING_DEVICE=cpu

Docker build fails

Solution 1: Build without cache:

docker build --no-cache -t pdf-pipeline:latest .

Solution 2: Check disk space:

df -h
docker system prune -a  # Warning: removes all unused images

Out of memory (OOM)

Solution 1: Reduce page limit:

# In .env:
MAX_PDF_PAGES=50

Solution 2: Reduce worker memory:

# In .env:
WORKER_MEMORY=4G

Solution 3: Use single worker only:

docker compose up -d  # Don't use --profile scale

Pipeline stuck / no output

Check Redis:

docker compose logs redis
docker compose restart redis

Check worker logs:

docker compose logs worker-0

Reset queue:

docker compose down
docker compose up -d
docker compose run --rm queue-init

Permission denied on data/

chmod -R 755 data/
# Or for Docker volume issues:
sudo chown -R $USER:$USER data/

Files not being processed

Ensure files are in the queue:

docker compose run --rm queue-init --pattern "**/*.pdf"

Worker crashes repeatedly

Check for problematic PDF in failed set - the pipeline will skip it on restart with SAFE_MODE=true.

API Reference (For Programmatic Use)

Direct Python Usage

from src.pipeline.parsers.docling_yaml_adapter import DoclingYAMLAdapter

# Initialize parser
adapter = DoclingYAMLAdapter(
    ocr_enabled=False,
    table_extraction=True
)

# Parse a PDF
result = adapter.parse("/path/to/document.pdf")

# Access results
print(result["document"]["page_count"])
print(result["content"]["paragraphs"])
print(result["tables"])

Redis Queue Keys

Key Type Description
file:queue List Files waiting to be processed
file:queue:set Set Deduplication set for queue
file:processing List Files currently being processed
file:done Set Successfully processed files
file:failed Set Failed files (dead letter queue)
file:lock:{hash} String Per-file distributed lock

License

MIT License - see LICENSE file.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_yaml_pipeline-0.1.0.tar.gz (117.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_yaml_pipeline-0.1.0-py3-none-any.whl (153.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf_yaml_pipeline-0.1.0.tar.gz.

File metadata

  • Download URL: pdf_yaml_pipeline-0.1.0.tar.gz
  • Upload date:
  • Size: 117.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pdf_yaml_pipeline-0.1.0.tar.gz
Algorithm Hash digest
SHA256 13192365678cadae89b55a3ea9ac3ec8e3edb4b0911b4d7a2de53a4988be96ac
MD5 06da4ce9b45623e12a16ad70e0b55200
BLAKE2b-256 edc549c1b0d393852f212387d4b56b61bcfe12f291dd49ad264b2dedec655782

See more details on using hashes here.

File details

Details for the file pdf_yaml_pipeline-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_yaml_pipeline-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6688b399ed1897e6e7a67302810864a450d672029b36b0ce1404e47f0566a3c6
MD5 9d5b6c301d6c070e0cf3a1b0084e5479
BLAKE2b-256 06049f3078edddce378ed1eebd4e70b4ac331a3d2ea158e4cdbdfebc5e2f34c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page