GPU-accelerated document processing: PDF/HWP → YAML for LLM training data

These details have not been verified by PyPI

Project links

Project description

PDF-YAML Pipeline

GPU-accelerated document processing pipeline that converts PDF/HWP documents to structured YAML format, optimized for LLM training data preparation.

Features
System Requirements
Quick Start
Configuration
Usage
Output Format
Architecture
Testing
Troubleshooting
License

Features

PDF Parsing: High-quality PDF text extraction using Docling
HWP Support: Korean HWP/HWPX document parsing
Scan Detection: Automatic triage of scanned vs digital PDFs
GPU Acceleration: CUDA-optimized processing with multi-GPU support
CPU Mode: Works without GPU (slower but functional)
Table Extraction: Structured table data with cell-level bounding boxes
Redis Queue: Distributed worker architecture for scalability
Fault Tolerant: Automatic retry, dead letter queue, lock management
YAML Output: Structured, LLM-friendly output format

System Requirements

Minimum (CPU Mode)

Component	Requirement
OS	Linux (Ubuntu 20.04+), macOS, Windows with WSL2
Docker	20.10+ with Docker Compose v2
RAM	8GB
Disk	10GB free space

Recommended (GPU Mode)

Component	Requirement
OS	Linux (Ubuntu 20.04+ recommended)
Docker	20.10+ with Docker Compose v2
NVIDIA Driver	525+
NVIDIA Container Toolkit	Installed and configured
GPU	RTX 3060 (12GB) or better
RAM	16GB+
Disk	20GB free space

Verify GPU Setup

# Check NVIDIA driver
nvidia-smi

# Check Docker GPU support
docker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu22.04 nvidia-smi

Quick Start

Option 1: One-Click Setup (Recommended)

git clone https://github.com/seunghyuoffice-design/pdf-yaml-pipeline.git
cd pdf-yaml-pipeline
./setup.sh

The setup script automatically:

Detects GPU availability (falls back to CPU mode if no GPU)
Creates .env configuration file
Creates data/ and data/output/ directories
Builds the Docker image (~10-20 min on first run)
Optionally downloads a sample PDF for testing

Option 2: Manual Setup

# 1. Clone the repository
git clone https://github.com/seunghyuoffice-design/pdf-yaml-pipeline.git
cd pdf-yaml-pipeline

# 2. Create configuration
cp .env.example .env
mkdir -p data/output

# 3. (Optional) Edit .env for your environment
# For CPU mode: set DOCLING_DEVICE=cpu

# 4. Build Docker image
docker build -t pdf-pipeline:latest .

# 5. Start the pipeline
docker compose up -d

Configuration

All Environment Variables

Edit .env to customize:

Variable	Default	Description
Data Paths
`DATA_PATH`	`./data`	Input directory for PDF/HWP files
`OUTPUT_PATH`	`./data/output`	Output directory for YAML files
GPU Settings
`CUDA_VISIBLE_DEVICES`	`0`	GPU device ID(s), e.g., `0` or `0,1`
`DOCLING_DEVICE`	`cuda`	`cuda` for GPU, `cpu` for CPU-only
Worker Settings
`THREADS_PER_WORKER`	`2`	CPU threads per worker
`WORKER_MEMORY`	`8G`	Memory limit per worker
Processing
`MAX_PDF_PAGES`	`100`	Max pages per PDF (truncates larger docs)
`TIMEOUT`	`600`	Processing timeout in seconds
`MAX_RETRIES`	`1`	Retry count before moving to DLQ
`SAFE_MODE`	`true`	Skip problematic files instead of crashing
`OCR_ENABLED`	`false`	Enable OCR for scanned documents
`LOG_LEVEL`	`INFO`	Logging verbosity (DEBUG/INFO/WARNING/ERROR)
Redis
`REDIS_HOST`	`redis`	Redis hostname (use default in Docker)
`REDIS_PORT`	`6379`	Redis port
`REDIS_PASSWORD`	(empty)	Redis password (optional)

CPU Mode (No GPU Required)

# In .env:
DOCLING_DEVICE=cpu

Note: CPU mode is approximately 5-10x slower than GPU mode.

Worker Configuration

Adjust resources in .env:

# Workers use this many CPU threads
THREADS_PER_WORKER=2

# Memory per worker
WORKER_MEMORY=8G

Scale workers:

# 1 worker (default, lower memory usage)
docker compose up -d

# 2 workers (2x throughput, requires more memory)
docker compose --profile scale up -d

Recommended settings by system:

System RAM	GPU VRAM	Workers	Threads	Memory
8GB	None (CPU)	1	2	4G
16GB	8GB	1	2	8G
32GB	12GB+	2	4	12G
64GB+	24GB+	2	8	16G

Usage

Basic Workflow

# 1. Add PDF files to data directory
cp /path/to/documents/*.pdf ./data/

# 2. Start the pipeline
docker compose up -d

# 3. Initialize the queue (scans data/ for new files)
docker compose run --rm queue-init

# 4. Monitor progress
docker compose logs -f worker-0

# 5. Check results
ls ./data/output/

Useful Commands

# View worker logs
docker compose logs -f worker-0

# Check queue status
docker compose run --rm queue-monitor

# Stop pipeline
docker compose down

# Restart with fresh state
docker compose down -v  # Warning: clears Redis data
docker compose up -d

# Add more files to running pipeline
cp more_files/*.pdf ./data/
docker compose run --rm queue-init

Processing States

Files move through these states:

file:queue → file:processing → file:done
                    ↓
              file:failed (DLQ)

Output Format

YAML Structure

Each processed PDF creates a .yaml file:

document:
  source_path: "example.pdf"
  format: "pdf"
  parser: "docling"
  text_extractor: "pypdfium2"
  page_count: 15
  original_pages: 15        # Total pages in original PDF
  truncated: false          # true if exceeded MAX_PDF_PAGES
  max_pages_limit: null     # Limit that caused truncation (if any)
  ocr_enabled: false
  table_extraction: true
  encrypted: false

content:
  paragraphs:
    - "First paragraph text extracted from the document..."
    - "Second paragraph continues here with more content..."
    - "Each paragraph is a separate list item."

tables:
  - page: 1
    bbox: [100, 200, 500, 400]  # [x1, y1, x2, y2]
    cells:
      - text: "Header 1"
        row: 0
        col: 0
        bbox: [100, 200, 200, 220]
        confidence: 0.95
      - text: "Value 1"
        row: 1
        col: 0
        bbox: [100, 220, 200, 240]
        confidence: 0.98

assets:
  images: []  # Image metadata if extracted

Key Fields Explained

Field	Type	Description
`document.page_count`	int	Number of pages actually processed
`document.original_pages`	int	Total pages in source PDF
`document.truncated`	bool	`true` if document was cut at MAX_PDF_PAGES
`content.paragraphs`	list	Extracted text paragraphs in reading order
`tables[].cells`	list	Cell-level table data with positions
`tables[].cells[].confidence`	float	OCR confidence score (0.0-1.0)
`tables[].cells[].bbox`	list	Bounding box [x1, y1, x2, y2] in pixels

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    PDF-YAML Pipeline                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐               │
│  │  Input   │───▶│  Redis   │───▶│ Workers  │               │
│  │  (data/) │    │  Queue   │    │ (GPU/CPU)│               │
│  └──────────┘    └──────────┘    └────┬─────┘               │
│                                       │                      │
│                       ┌───────────────┴───────────────┐      │
│                       ▼                               ▼      │
│                ┌──────────┐                    ┌──────────┐  │
│                │  Triage  │                    │  Parser  │  │
│                │(PDF type)│                    │(Docling) │  │
│                └────┬─────┘                    └────┬─────┘  │
│                     │                               │        │
│              ┌──────┴──────┐                        │        │
│              ▼             ▼                        ▼        │
│        ┌─────────┐  ┌──────────┐            ┌──────────┐    │
│        │ Digital │  │ Scanned  │            │  YAML    │    │
│        │  PDF    │  │ (→ OCR)  │            │ Converter│    │
│        └─────────┘  └──────────┘            └────┬─────┘    │
│                                                  │           │
│                                                  ▼           │
│                                           ┌──────────┐       │
│                                           │  Output  │       │
│                                           │(data/out)│       │
│                                           └──────────┘       │
└─────────────────────────────────────────────────────────────┘

Components

Component	Description
Redis	Job queue, state management, distributed locking
Worker	PDF processing using Docling + pypdfium2
Triage	Classifies PDFs as digital or scanned
Parser	Extracts text, tables, images
Converter	Transforms parsed data to YAML format

File Flow

PDFs placed in data/ directory
queue-init scans and adds files to Redis queue
Workers pull files from queue with distributed locks
Docling parses PDF structure (GPU-accelerated)
YAML files written to data/output/
Completed files tracked in Redis file:done set

Testing

# Install test dependencies (run outside Docker)
pip install pytest fakeredis

# Run all tests
pytest tests/ -v

# Run specific test suite
pytest tests/test_lock_operations.py -v    # Lock atomicity tests
pytest tests/test_deduplicator.py -v       # Deduplication tests
pytest tests/test_yaml_converter.py -v     # Output format tests

Troubleshooting

"CUDA not available" or GPU errors

Solution: Switch to CPU mode:

# In .env:
DOCLING_DEVICE=cpu

Docker build fails

Solution 1: Build without cache:

docker build --no-cache -t pdf-pipeline:latest .

Solution 2: Check disk space:

df -h
docker system prune -a  # Warning: removes all unused images

Out of memory (OOM)

Solution 1: Reduce page limit:

# In .env:
MAX_PDF_PAGES=50

Solution 2: Reduce worker memory:

# In .env:
WORKER_MEMORY=4G

Solution 3: Use single worker only:

docker compose up -d  # Don't use --profile scale

Pipeline stuck / no output

Check Redis:

docker compose logs redis
docker compose restart redis

Check worker logs:

docker compose logs worker-0

Reset queue:

docker compose down
docker compose up -d
docker compose run --rm queue-init

Permission denied on data/

chmod -R 755 data/
# Or for Docker volume issues:
sudo chown -R $USER:$USER data/

Files not being processed

Ensure files are in the queue:

docker compose run --rm queue-init --pattern "**/*.pdf"

Worker crashes repeatedly

Check for problematic PDF in failed set - the pipeline will skip it on restart with SAFE_MODE=true.

API Reference (For Programmatic Use)

Direct Python Usage

from src.pipeline.parsers.docling_yaml_adapter import DoclingYAMLAdapter

# Initialize parser
adapter = DoclingYAMLAdapter(
    ocr_enabled=False,
    table_extraction=True
)

# Parse a PDF
result = adapter.parse("/path/to/document.pdf")

# Access results
print(result["document"]["page_count"])
print(result["content"]["paragraphs"])
print(result["tables"])

Redis Queue Keys

Key	Type	Description
`file:queue`	List	Files waiting to be processed
`file:queue:set`	Set	Deduplication set for queue
`file:processing`	List	Files currently being processed
`file:done`	Set	Successfully processed files
`file:failed`	Set	Failed files (dead letter queue)
`file:lock:{hash}`	String	Per-file distributed lock

License

MIT License - see LICENSE file.

Acknowledgments

Docling - PDF parsing engine
pypdfium2 - Text extraction
Redis - Queue management

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jan 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_yaml_pipeline-0.1.0.tar.gz (117.9 kB view details)

Uploaded Jan 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_yaml_pipeline-0.1.0-py3-none-any.whl (153.7 kB view details)

Uploaded Jan 19, 2026 Python 3

File details

Details for the file pdf_yaml_pipeline-0.1.0.tar.gz.

File metadata

Download URL: pdf_yaml_pipeline-0.1.0.tar.gz
Upload date: Jan 19, 2026
Size: 117.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pdf_yaml_pipeline-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`13192365678cadae89b55a3ea9ac3ec8e3edb4b0911b4d7a2de53a4988be96ac`
MD5	`06da4ce9b45623e12a16ad70e0b55200`
BLAKE2b-256	`edc549c1b0d393852f212387d4b56b61bcfe12f291dd49ad264b2dedec655782`

See more details on using hashes here.

File details

Details for the file pdf_yaml_pipeline-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdf_yaml_pipeline-0.1.0-py3-none-any.whl
Upload date: Jan 19, 2026
Size: 153.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pdf_yaml_pipeline-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6688b399ed1897e6e7a67302810864a450d672029b36b0ce1404e47f0566a3c6`
MD5	`9d5b6c301d6c070e0cf3a1b0084e5479`
BLAKE2b-256	`06049f3078edddce378ed1eebd4e70b4ac331a3d2ea158e4cdbdfebc5e2f34c3`

See more details on using hashes here.

pdf-yaml-pipeline 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF-YAML Pipeline

Table of Contents

Features

System Requirements

Minimum (CPU Mode)

Recommended (GPU Mode)

Verify GPU Setup

Quick Start

Option 1: One-Click Setup (Recommended)

Option 2: Manual Setup

Configuration

All Environment Variables

CPU Mode (No GPU Required)

Worker Configuration

Usage

Basic Workflow

Useful Commands

Processing States

Output Format

YAML Structure

Key Fields Explained

Architecture

Components

File Flow

Testing

Troubleshooting

"CUDA not available" or GPU errors

Docker build fails

Out of memory (OOM)

Pipeline stuck / no output

Permission denied on data/

Files not being processed

Worker crashes repeatedly

API Reference (For Programmatic Use)

Direct Python Usage

Redis Queue Keys

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes