No project description provided

These details have not been verified by PyPI

Project links

Project description

Vexy PDF Werk

Transform PDFs into high-quality, accessible formats with AI-enhanced processing

Vexy PDF Werk (VPW) is a Python package that converts PDF documents into multiple high-quality formats using modern tools and optional AI enhancement. Transform your PDFs into PDF/A archives, paginated Markdown, ePub books, and structured bibliographic metadata.

SPEC.md is the full specification

Features

🔧 Modern PDF Processing

PDF/A conversion for long-term archival
OCR enhancement using OCRmyPDF
Quality optimization with qpdf
In-depth PDF analysis to detect text, images, and scanned documents.

📚 Multiple Output Formats

Paginated Markdown documents with smart naming and YAML frontmatter.
ePub generation from Markdown content.
Structured bibliographic YAML metadata, including estimated word count and content preview.
Preserves original PDF alongside enhanced versions.

🤖 Optional AI Enhancement (Future)

Text correction using Claude or Gemini CLI.
Content structure optimization.
Fallback to proven traditional methods.

⚙️ Flexible Architecture

Multiple conversion backends (Marker, MarkItDown, Docling, basic).
Platform-appropriate configuration storage (~/.config/vexy-pdf-werk/config.toml).
Robust error handling with graceful fallbacks.
Command-line interface for easy integration into workflows.

Quick Start

Installation

# Install from PyPI
pip install vexy-pdf-werk

# Or install in development mode
git clone https://github.com/vexyart/vexy-pdf-werk
cd vexy-pdf-werk
pip install -e .

CLI Usage

The primary way to use Vexy PDF Werk is through its command-line interface, vpw.

Process a PDF

# Process a PDF into all default formats (pdfa, markdown, epub, yaml)
vpw process document.pdf

# Specify output directory and formats
vpw process document.pdf --output_dir ./my-output --formats "markdown,epub"

# Enable verbose logging for debugging
vpw process document.pdf --verbose

Manage Configuration

# Display the current configuration
vpw config --show

# Create a default configuration file if one doesn't exist
vpw config --init

Output Structure

VPW creates organized output with consistent naming:

output/
├── document_enhanced.pdf    # PDF/A version
├── 000--introduction.md     # Paginated Markdown files
├── 001--chapter-one.md
├── 002--conclusions.md
├── document.epub            # Generated ePub
└── metadata.yaml            # Bibliographic data

System Requirements

Required Dependencies

Python 3.10+
tesseract-ocr
qpdf
ghostscript

Optional Dependencies

pandoc (for ePub generation)
marker-pdf (advanced PDF conversion)
markitdown (Microsoft's document converter)
docling (IBM's document understanding)

Installation Commands

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng qpdf ghostscript pandoc

macOS:

brew install tesseract tesseract-lang qpdf ghostscript pandoc

Windows:

choco install tesseract qpdf ghostscript pandoc

Configuration

VPW stores configuration in platform-appropriate directories:

Linux/macOS: ~/.config/vexy-pdf-werk/config.toml
Windows: %APPDATA%\vexy-pdf-werk\config.toml

You can initialize a default configuration file by running vpw config --init.

Example Configuration

[processing]
ocr_language = "eng"
pdf_quality = "high" # high, medium, low
force_ocr = false
deskew = true
rotate_pages = true

[conversion]
markdown_backend = "auto"  # auto, marker, markitdown, docling, basic
paginate_markdown = true
include_images = true
extract_tables = true

[ai]
enabled = false
provider = "claude"  # claude, gemini, custom
correction_enabled = false
enhancement_enabled = false
max_tokens = 4000

[output]
formats = ["pdfa", "markdown", "epub", "yaml"]
preserve_original = true
output_directory = "./output"
filename_template = "{stem}_{format}.{ext}"

Architecture

VPW follows a modular pipeline architecture:

PDF Input → Analysis → OCR Enhancement → Content Extraction → Format Generation → Multi-Format Output
                          ↓
                   Optional AI Enhancement

Core Components

PDFProcessor: Handles OCR, PDF/A conversion, and analysis of the PDF file. It uses ocrmypdf and qpdf for robust processing.
MarkdownGenerator: Converts the processed PDF into Markdown. It supports different backends (currently basic is implemented) and can create paginated or single-file output.
EpubCreator: Generates an ePub file from the Markdown content, creating chapters for each page.
MetadataExtractor: Extracts comprehensive metadata from the PDF and the processing results, saving it to a metadata.yaml file. This includes file info, PDF properties, and content summaries like word count.
cli.py: Provides the command-line interface using fire, allowing for easy configuration and execution of the processing pipeline.
config.py: Manages the application's configuration using pydantic and toml, with support for environment variable overrides.

Development

This project uses modern Python tooling:

Package Management: uv + hatch (use uv run to run but for other operations use hatch like hatch test)
Code Quality: ruff + mypy
Testing: pytest
Version Control: git-tag-based semver with hatch-vcs

Development Setup

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/vexyart/vexy-pdf-werk
cd vexy-pdf-werk
uv venv --python 3.12
uv sync --all-extras

# Run tests
hatch run test

# Run linting
hatch run lint:fmt

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes following the code quality standards
Run tests and linting
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

Fontlab Ltd - Initial work - Vexy Art

Acknowledgments

Built on proven tools: qpdf, OCRmyPDF, tesseract
Integration with cutting-edge AI services
Inspired by the need for better PDF accessibility and archival

Project Status: Under active development

For detailed implementation specifications, see the spec/ directory.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.9

Sep 15, 2025

1.1.8.dev0 pre-release

Sep 15, 2025

1.1.7.dev0 pre-release

Sep 15, 2025

1.1.5.dev0 pre-release

Sep 15, 2025

1.1.4

Sep 14, 2025

This version

1.1.4.dev0 pre-release

Sep 14, 2025

1.1.2.dev0 pre-release

Sep 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vexy_pdf_werk-1.1.4.dev0.tar.gz (10.9 kB view details)

Uploaded Sep 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vexy_pdf_werk-1.1.4.dev0-py3-none-any.whl (38.6 kB view details)

Uploaded Sep 14, 2025 Python 3

File details

Details for the file vexy_pdf_werk-1.1.4.dev0.tar.gz.

File metadata

Download URL: vexy_pdf_werk-1.1.4.dev0.tar.gz
Upload date: Sep 14, 2025
Size: 10.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.15

File hashes

Hashes for vexy_pdf_werk-1.1.4.dev0.tar.gz
Algorithm	Hash digest
SHA256	`3e70852cf5d4a517eb13e2b5773806d74e40d3ec052a4c39c14355a6d1dc7637`
MD5	`cf2ec88c11642049e7e1d82e4f0de248`
BLAKE2b-256	`2cce73e20b7ed66cfbf38e474347bed75331e291833d3f71906f39efe98c4124`

See more details on using hashes here.

File details

Details for the file vexy_pdf_werk-1.1.4.dev0-py3-none-any.whl.

File metadata

Download URL: vexy_pdf_werk-1.1.4.dev0-py3-none-any.whl
Upload date: Sep 14, 2025
Size: 38.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.15

File hashes

Hashes for vexy_pdf_werk-1.1.4.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5253aaf0be12ba24d439b2b5f62f9eb1b0e1ae2587b07b688ed123e97acb9bd`
MD5	`075ae1cf5b25528cd7b39749c2f4b161`
BLAKE2b-256	`fffccb4680877aca24a09a2fe08de2572261e11b0bf60319ff0649b12014cee4`

See more details on using hashes here.

vexy-pdf-werk 1.1.4.dev0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Vexy PDF Werk

Features

Quick Start

Installation

CLI Usage

Process a PDF

Manage Configuration

Output Structure

System Requirements

Required Dependencies

Optional Dependencies

Installation Commands

Configuration

Example Configuration

Architecture

Core Components

Development

Development Setup

Contributing

License

Authors

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes