Skip to main content

No project description provided

Project description

Vexy PDF Werk

Transform PDFs into high-quality, accessible formats with AI-enhanced processing

Vexy PDF Werk (VPW) is a Python package that converts PDF documents into multiple high-quality formats using modern tools and optional AI enhancement. Transform your PDFs into PDF/A archives, paginated Markdown, ePub books, and structured bibliographic metadata.

  • SPEC.md is the full specification

Features

🔧 Modern PDF Processing

  • PDF/A conversion for long-term archival
  • OCR enhancement using OCRmyPDF
  • Quality optimization with qpdf
  • In-depth PDF analysis to detect text, images, and scanned documents.

📚 Multiple Output Formats

  • Paginated Markdown documents with smart naming and YAML frontmatter.
  • ePub generation from Markdown content.
  • Structured bibliographic YAML metadata, including estimated word count and content preview.
  • Preserves original PDF alongside enhanced versions.

🤖 Optional AI Enhancement (Future)

  • Text correction using Claude or Gemini CLI.
  • Content structure optimization.
  • Fallback to proven traditional methods.

⚙️ Flexible Architecture

  • Multiple conversion backends (Marker, MarkItDown, Docling, basic).
  • Platform-appropriate configuration storage (~/.config/vexy-pdf-werk/config.toml).
  • Robust error handling with graceful fallbacks.
  • Command-line interface for easy integration into workflows.

Quick Start

Installation

# Install from PyPI
pip install vexy-pdf-werk

# Or install in development mode
git clone https://github.com/vexyart/vexy-pdf-werk
cd vexy-pdf-werk
pip install -e .

CLI Usage

The primary way to use Vexy PDF Werk is through its command-line interface, vpw.

Process a PDF

# Process a PDF into all default formats (pdfa, markdown, epub, yaml)
vpw process document.pdf

# Specify output directory and formats
vpw process document.pdf --output_dir ./my-output --formats "markdown,epub"

# Enable verbose logging for debugging
vpw process document.pdf --verbose

Manage Configuration

# Display the current configuration
vpw config --show

# Create a default configuration file if one doesn't exist
vpw config --init

Output Structure

VPW creates organized output with consistent naming:

output/
├── document_enhanced.pdf    # PDF/A version
├── 000--introduction.md     # Paginated Markdown files
├── 001--chapter-one.md
├── 002--conclusions.md
├── document.epub            # Generated ePub
└── metadata.yaml            # Bibliographic data

System Requirements

Required Dependencies

  • Python 3.10+
  • tesseract-ocr
  • qpdf
  • ghostscript

Optional Dependencies

  • pandoc (for ePub generation)
  • marker-pdf (advanced PDF conversion)
  • markitdown (Microsoft's document converter)
  • docling (IBM's document understanding)

Installation Commands

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng qpdf ghostscript pandoc

macOS:

brew install tesseract tesseract-lang qpdf ghostscript pandoc

Windows:

choco install tesseract qpdf ghostscript pandoc

Configuration

VPW stores configuration in platform-appropriate directories:

  • Linux/macOS: ~/.config/vexy-pdf-werk/config.toml
  • Windows: %APPDATA%\vexy-pdf-werk\config.toml

You can initialize a default configuration file by running vpw config --init.

Example Configuration

[processing]
ocr_language = "eng"
pdf_quality = "high" # high, medium, low
force_ocr = false
deskew = true
rotate_pages = true

[conversion]
markdown_backend = "auto"  # auto, marker, markitdown, docling, basic
paginate_markdown = true
include_images = true
extract_tables = true

[ai]
enabled = false
provider = "claude"  # claude, gemini, custom
correction_enabled = false
enhancement_enabled = false
max_tokens = 4000

[output]
formats = ["pdfa", "markdown", "epub", "yaml"]
preserve_original = true
output_directory = "./output"
filename_template = "{stem}_{format}.{ext}"

Architecture

VPW follows a modular pipeline architecture:

PDF Input → Analysis → OCR Enhancement → Content Extraction → Format Generation → Multi-Format Output
                          ↓
                   Optional AI Enhancement

Core Components

  • PDFProcessor: Handles OCR, PDF/A conversion, and analysis of the PDF file. It uses ocrmypdf and qpdf for robust processing.
  • MarkdownGenerator: Converts the processed PDF into Markdown. It supports different backends (currently basic is implemented) and can create paginated or single-file output.
  • EpubCreator: Generates an ePub file from the Markdown content, creating chapters for each page.
  • MetadataExtractor: Extracts comprehensive metadata from the PDF and the processing results, saving it to a metadata.yaml file. This includes file info, PDF properties, and content summaries like word count.
  • cli.py: Provides the command-line interface using fire, allowing for easy configuration and execution of the processing pipeline.
  • config.py: Manages the application's configuration using pydantic and toml, with support for environment variable overrides.

Development

This project uses modern Python tooling:

  • Package Management: uv + hatch (use uv run to run but for other operations use hatch like hatch test)
  • Code Quality: ruff + mypy
  • Testing: pytest
  • Version Control: git-tag-based semver with hatch-vcs

Development Setup

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/vexyart/vexy-pdf-werk
cd vexy-pdf-werk
uv venv --python 3.12
uv sync --all-extras

# Run tests
hatch run test

# Run linting
hatch run lint:fmt

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes following the code quality standards
  4. Run tests and linting
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

Acknowledgments

  • Built on proven tools: qpdf, OCRmyPDF, tesseract
  • Integration with cutting-edge AI services
  • Inspired by the need for better PDF accessibility and archival

Project Status: Under active development

For detailed implementation specifications, see the spec/ directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vexy_pdf_werk-1.1.7.dev0.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vexy_pdf_werk-1.1.7.dev0-py3-none-any.whl (43.1 kB view details)

Uploaded Python 3

File details

Details for the file vexy_pdf_werk-1.1.7.dev0.tar.gz.

File metadata

  • Download URL: vexy_pdf_werk-1.1.7.dev0.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.15

File hashes

Hashes for vexy_pdf_werk-1.1.7.dev0.tar.gz
Algorithm Hash digest
SHA256 96ff3ec4b26dc12473fcf9417b7597054a7c9525a1698120b6898417f1b83d76
MD5 2335835afdd372b79b65570d5ecb76e8
BLAKE2b-256 70b21a0ee32d61eca4287e08dfb4c6713dc1fe3827e55bc549265b3b7d4a6e0a

See more details on using hashes here.

File details

Details for the file vexy_pdf_werk-1.1.7.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for vexy_pdf_werk-1.1.7.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 91f1f95dcec74953ead3c5037669def1c2575ba6273003fb0565c5080cd86517
MD5 d41c1cb99d0cce97cb9114217f5fac1d
BLAKE2b-256 370cb53c251d49d1abc46b0e86178da6bc044abbeb9325ed8b175d2a6d96c7f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page