Skip to main content

No project description provided

Project description

this_file: README.md

Vexy PDF Werk

Transform PDFs into high-quality, accessible formats with AI-enhanced processing

Vexy PDF Werk (VPW) is a Python package that converts PDF documents into multiple high-quality formats using modern tools and optional AI enhancement. Transform your PDFs into PDF/A archives, paginated Markdown, ePub books, and structured bibliographic metadata.

Features

🔧 Modern PDF Processing

  • PDF/A conversion for long-term archival
  • OCR enhancement using OCRmyPDF
  • Quality optimization with qpdf

📚 Multiple Output Formats

  • Paginated Markdown documents with smart naming
  • ePub generation from Markdown
  • Structured bibliographic YAML metadata
  • Preserves original PDF alongside enhanced versions

🤖 Optional AI Enhancement

  • Text correction using Claude or Gemini CLI
  • Content structure optimization
  • Fallback to proven traditional methods

⚙️ Flexible Architecture

  • Multiple conversion backends (Marker, MarkItDown, Docling, basic)
  • Platform-appropriate configuration storage
  • Robust error handling with graceful fallbacks

Quick Start

Installation

# Install from PyPI
pip install vexy-pdf-werk

# Or install in development mode
git clone https://github.com/vexyart/vexy-pdf-werk
cd vexy-pdf-werk
pip install -e .

Basic Usage

import vexy_pdf_werk

# Process a PDF with default settings
config = vexy_pdf_werk.Config(name="default", value="process")
result = vexy_pdf_werk.process_data(["document.pdf"], config=config)

CLI Usage (Coming Soon)

# Process a PDF into all formats
vpw process document.pdf

# Process with specific formats only
vpw process document.pdf --formats pdfa,markdown

# Enable AI enhancement
vpw process document.pdf --ai-enabled --ai-provider claude

Output Structure

VPW creates organized output with consistent naming:

output/
├── document_enhanced.pdf    # PDF/A version
├── 000--introduction.md     # Paginated Markdown files
├── 001--chapter-one.md
├── 002--conclusions.md
├── document.epub            # Generated ePub
└── metadata.yaml            # Bibliographic data

System Requirements

Required Dependencies

  • Python 3.10+
  • tesseract-ocr
  • qpdf
  • ghostscript

Optional Dependencies

  • pandoc (for ePub generation)
  • marker-pdf (advanced PDF conversion)
  • markitdown (Microsoft's document converter)
  • docling (IBM's document understanding)

Installation Commands

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng qpdf ghostscript pandoc

macOS:

brew install tesseract tesseract-lang qpdf ghostscript pandoc

Windows:

choco install tesseract qpdf ghostscript pandoc

Configuration

VPW stores configuration in platform-appropriate directories:

  • Linux/macOS: ~/.config/vexy-pdf-werk/config.toml
  • Windows: %APPDATA%\\vexy-pdf-werk\\config.toml

Example Configuration

[processing]
ocr_language = "eng"
pdf_quality = "high"
force_ocr = false

[conversion]
markdown_backend = "auto"  # auto, marker, markitdown, docling, basic
paginate_markdown = true
include_images = true

[ai]
enabled = false
provider = "claude"  # claude, gemini
correction_enabled = false

[output]
formats = ["pdfa", "markdown", "epub", "yaml"]
preserve_original = true
output_directory = "./output"

Development

This project uses modern Python tooling:

  • Package Management: uv + hatch
  • Code Quality: ruff + mypy
  • Testing: pytest
  • Version Control: git-tag-based semver with hatch-vcs

Development Setup

# Install uv and hatch
curl -LsSf https://astral.sh/uv/install.sh | sh
pip install hatch

# Clone and setup
git clone https://github.com/vexyart/vexy-pdf-werk
cd vexy-pdf-werk

# Run tests using hatch (automatically manages environment)
hatch run test

# Run linting and formatting
hatch run lint

# Type checking
hatch run type-check

# Or run individual commands
hatch run python -c "import vexy_pdf_werk; print(vexy_pdf_werk.__version__)"

Architecture

VPW follows a modular pipeline architecture:

PDF Input → Analysis → OCR Enhancement → Content Extraction → Format Generation → Multi-Format Output
                          ↓
                   Optional AI Enhancement

Core Components

  • PDF Processor: Handles OCR and PDF/A conversion
  • Content Extractors: Multiple backends for PDF-to-Markdown
  • Format Generators: Creates ePub and metadata outputs
  • AI Integrations: Optional LLM enhancement services
  • Configuration System: Platform-aware settings management

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes following the code quality standards
  4. Run tests and linting
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

Acknowledgments

  • Built on proven tools: qpdf, OCRmyPDF, tesseract
  • Integration with cutting-edge AI services
  • Inspired by the need for better PDF accessibility and archival

Project Status: Under active development

For detailed implementation specifications, see the spec/ directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vexy_pdf_werk-1.1.2.dev0.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vexy_pdf_werk-1.1.2.dev0-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file vexy_pdf_werk-1.1.2.dev0.tar.gz.

File metadata

  • Download URL: vexy_pdf_werk-1.1.2.dev0.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.15

File hashes

Hashes for vexy_pdf_werk-1.1.2.dev0.tar.gz
Algorithm Hash digest
SHA256 ac2f6e3f9834006a88e5361630513d75008ded30568d47d9e4573ef160797ac2
MD5 36a9e7584942832e8d2352966ee08030
BLAKE2b-256 7eff67d82732918efc81e785c9f176a5d2e3f631bb429dfc46560c0bd50e0415

See more details on using hashes here.

File details

Details for the file vexy_pdf_werk-1.1.2.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for vexy_pdf_werk-1.1.2.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 bd696d9115d3421c09b9f126a18d06a25c2c84414324ae20709c2dafdcec52d3
MD5 3079d082db06db262b256b31d648a4e5
BLAKE2b-256 a46068b6dc60c41f51c00e7eab0a492e93fbf5c952bf29141371d8257b72f1a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page