No project description provided
Project description
Vexy PDF Werk
Transform PDFs into high-quality, accessible formats with AI-enhanced processing
Vexy PDF Werk (VPW) is a Python package that converts PDF documents into multiple high-quality formats using modern tools and optional AI enhancement. Transform your PDFs into PDF/A archives, paginated Markdown, ePub books, and structured bibliographic metadata.
SPEC.mdis the full specification
Features
🔧 Modern PDF Processing
- PDF/A conversion for long-term archival
- OCR enhancement using OCRmyPDF
- Quality optimization with qpdf
- In-depth PDF analysis to detect text, images, and scanned documents.
📚 Multiple Output Formats
- Paginated Markdown documents with smart naming and YAML frontmatter.
- ePub generation from Markdown content.
- Structured bibliographic YAML metadata, including estimated word count and content preview.
- Preserves original PDF alongside enhanced versions.
🤖 Optional AI Enhancement (Future)
- Text correction using Claude or Gemini CLI.
- Content structure optimization.
- Fallback to proven traditional methods.
⚙️ Flexible Architecture
- Multiple conversion backends (Marker, MarkItDown, Docling, basic).
- Platform-appropriate configuration storage (
~/.config/vexy-pdf-werk/config.toml). - Robust error handling with graceful fallbacks.
- Command-line interface for easy integration into workflows.
Quick Start
Installation
# Install from PyPI
pip install vexy-pdf-werk
# Or install in development mode
git clone https://github.com/vexyart/vexy-pdf-werk
cd vexy-pdf-werk
pip install -e .
CLI Usage
The primary way to use Vexy PDF Werk is through its command-line interface, vpw.
Process a PDF
# Process a PDF into all default formats (pdfa, markdown, epub, yaml)
vpw process document.pdf
# Specify output directory and formats
vpw process document.pdf --output_dir ./my-output --formats "markdown,epub"
# Enable verbose logging for debugging
vpw process document.pdf --verbose
Manage Configuration
# Display the current configuration
vpw config --show
# Create a default configuration file if one doesn't exist
vpw config --init
Output Structure
VPW creates organized output with consistent naming:
output/
├── document_enhanced.pdf # PDF/A version
├── 000--introduction.md # Paginated Markdown files
├── 001--chapter-one.md
├── 002--conclusions.md
├── document.epub # Generated ePub
└── metadata.yaml # Bibliographic data
System Requirements
Required Dependencies
- Python 3.10+
- tesseract-ocr
- qpdf
- ghostscript
Optional Dependencies
- pandoc (for ePub generation)
- marker-pdf (advanced PDF conversion)
- markitdown (Microsoft's document converter)
- docling (IBM's document understanding)
Installation Commands
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng qpdf ghostscript pandoc
macOS:
brew install tesseract tesseract-lang qpdf ghostscript pandoc
Windows:
choco install tesseract qpdf ghostscript pandoc
Configuration
VPW stores configuration in platform-appropriate directories:
- Linux/macOS:
~/.config/vexy-pdf-werk/config.toml - Windows:
%APPDATA%\vexy-pdf-werk\config.toml
You can initialize a default configuration file by running vpw config --init.
Example Configuration
[processing]
ocr_language = "eng"
pdf_quality = "high" # high, medium, low
force_ocr = false
deskew = true
rotate_pages = true
[conversion]
markdown_backend = "auto" # auto, marker, markitdown, docling, basic
paginate_markdown = true
include_images = true
extract_tables = true
[ai]
enabled = false
provider = "claude" # claude, gemini, custom
correction_enabled = false
enhancement_enabled = false
max_tokens = 4000
[output]
formats = ["pdfa", "markdown", "epub", "yaml"]
preserve_original = true
output_directory = "./output"
filename_template = "{stem}_{format}.{ext}"
Architecture
VPW follows a modular pipeline architecture:
PDF Input → Analysis → OCR Enhancement → Content Extraction → Format Generation → Multi-Format Output
↓
Optional AI Enhancement
Core Components
PDFProcessor: Handles OCR, PDF/A conversion, and analysis of the PDF file. It usesocrmypdfandqpdffor robust processing.MarkdownGenerator: Converts the processed PDF into Markdown. It supports different backends (currentlybasicis implemented) and can create paginated or single-file output.EpubCreator: Generates an ePub file from the Markdown content, creating chapters for each page.MetadataExtractor: Extracts comprehensive metadata from the PDF and the processing results, saving it to ametadata.yamlfile. This includes file info, PDF properties, and content summaries like word count.cli.py: Provides the command-line interface usingfire, allowing for easy configuration and execution of the processing pipeline.config.py: Manages the application's configuration usingpydanticandtoml, with support for environment variable overrides.
Development
This project uses modern Python tooling:
- Package Management: uv + hatch (use
uv runto run but for other operations usehatchlikehatch test) - Code Quality: ruff + mypy
- Testing: pytest
- Version Control: git-tag-based semver with hatch-vcs
Development Setup
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone https://github.com/vexyart/vexy-pdf-werk
cd vexy-pdf-werk
uv venv --python 3.12
uv sync --all-extras
# Run tests
hatch run test
# Run linting
hatch run lint:fmt
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes following the code quality standards
- Run tests and linting
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Authors
- Fontlab Ltd - Initial work - Vexy Art
Acknowledgments
- Built on proven tools: qpdf, OCRmyPDF, tesseract
- Integration with cutting-edge AI services
- Inspired by the need for better PDF accessibility and archival
Project Status: Under active development
For detailed implementation specifications, see the spec/ directory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vexy_pdf_werk-1.1.4.tar.gz.
File metadata
- Download URL: vexy_pdf_werk-1.1.4.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6114688fafc89db5d3d0062469fb4c8cd1e51f38c1b3b6f35906f1fb6244294
|
|
| MD5 |
e87d1186fcdf7c3f69c53c3bb725043a
|
|
| BLAKE2b-256 |
9f796649bd0de7e906fdcc5cb226a7b219bf6187d673dfdac09f45f9d85855e0
|
File details
Details for the file vexy_pdf_werk-1.1.4-py3-none-any.whl.
File metadata
- Download URL: vexy_pdf_werk-1.1.4-py3-none-any.whl
- Upload date:
- Size: 40.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
918d95324bf2ea3925a8a06b13326b0ff81ca906f9fe72f6dd6252a5529486d4
|
|
| MD5 |
409793a5238656108c89de29b654cd85
|
|
| BLAKE2b-256 |
c1ab41d08494112eb15476c18b3d18c6104c704b2e25808ca0cfb13e77f49f03
|