No project description provided
Project description
this_file: README.md
Vexy PDF Werk
Transform PDFs into high-quality, accessible formats with AI-enhanced processing
Vexy PDF Werk (VPW) is a Python package that converts PDF documents into multiple high-quality formats using modern tools and optional AI enhancement. Transform your PDFs into PDF/A archives, paginated Markdown, ePub books, and structured bibliographic metadata.
Features
🔧 Modern PDF Processing
- PDF/A conversion for long-term archival
- OCR enhancement using OCRmyPDF
- Quality optimization with qpdf
📚 Multiple Output Formats
- Paginated Markdown documents with smart naming
- ePub generation from Markdown
- Structured bibliographic YAML metadata
- Preserves original PDF alongside enhanced versions
🤖 Optional AI Enhancement
- Text correction using Claude or Gemini CLI
- Content structure optimization
- Fallback to proven traditional methods
⚙️ Flexible Architecture
- Multiple conversion backends (Marker, MarkItDown, Docling, basic)
- Platform-appropriate configuration storage
- Robust error handling with graceful fallbacks
Quick Start
Installation
# Install from PyPI
pip install vexy-pdf-werk
# Or install in development mode
git clone https://github.com/vexyart/vexy-pdf-werk
cd vexy-pdf-werk
pip install -e .
Basic Usage
import vexy_pdf_werk
# Process a PDF with default settings
config = vexy_pdf_werk.Config(name="default", value="process")
result = vexy_pdf_werk.process_data(["document.pdf"], config=config)
CLI Usage (Coming Soon)
# Process a PDF into all formats
vpw process document.pdf
# Process with specific formats only
vpw process document.pdf --formats pdfa,markdown
# Enable AI enhancement
vpw process document.pdf --ai-enabled --ai-provider claude
Output Structure
VPW creates organized output with consistent naming:
output/
├── document_enhanced.pdf # PDF/A version
├── 000--introduction.md # Paginated Markdown files
├── 001--chapter-one.md
├── 002--conclusions.md
├── document.epub # Generated ePub
└── metadata.yaml # Bibliographic data
System Requirements
Required Dependencies
- Python 3.10+
- tesseract-ocr
- qpdf
- ghostscript
Optional Dependencies
- pandoc (for ePub generation)
- marker-pdf (advanced PDF conversion)
- markitdown (Microsoft's document converter)
- docling (IBM's document understanding)
Installation Commands
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng qpdf ghostscript pandoc
macOS:
brew install tesseract tesseract-lang qpdf ghostscript pandoc
Windows:
choco install tesseract qpdf ghostscript pandoc
Configuration
VPW stores configuration in platform-appropriate directories:
- Linux/macOS:
~/.config/vexy-pdf-werk/config.toml - Windows:
%APPDATA%\\vexy-pdf-werk\\config.toml
Example Configuration
[processing]
ocr_language = "eng"
pdf_quality = "high"
force_ocr = false
[conversion]
markdown_backend = "auto" # auto, marker, markitdown, docling, basic
paginate_markdown = true
include_images = true
[ai]
enabled = false
provider = "claude" # claude, gemini
correction_enabled = false
[output]
formats = ["pdfa", "markdown", "epub", "yaml"]
preserve_original = true
output_directory = "./output"
Development
This project uses modern Python tooling:
- Package Management: uv + hatch
- Code Quality: ruff + mypy
- Testing: pytest
- Version Control: git-tag-based semver with hatch-vcs
Development Setup
# Install uv and hatch
curl -LsSf https://astral.sh/uv/install.sh | sh
pip install hatch
# Clone and setup
git clone https://github.com/vexyart/vexy-pdf-werk
cd vexy-pdf-werk
# Run tests using hatch (automatically manages environment)
hatch run test
# Run linting and formatting
hatch run lint
# Type checking
hatch run type-check
# Or run individual commands
hatch run python -c "import vexy_pdf_werk; print(vexy_pdf_werk.__version__)"
Architecture
VPW follows a modular pipeline architecture:
PDF Input → Analysis → OCR Enhancement → Content Extraction → Format Generation → Multi-Format Output
↓
Optional AI Enhancement
Core Components
- PDF Processor: Handles OCR and PDF/A conversion
- Content Extractors: Multiple backends for PDF-to-Markdown
- Format Generators: Creates ePub and metadata outputs
- AI Integrations: Optional LLM enhancement services
- Configuration System: Platform-aware settings management
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes following the code quality standards
- Run tests and linting
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Authors
- Fontlab Ltd - Initial work - Vexy Art
Acknowledgments
- Built on proven tools: qpdf, OCRmyPDF, tesseract
- Integration with cutting-edge AI services
- Inspired by the need for better PDF accessibility and archival
Project Status: Under active development
For detailed implementation specifications, see the spec/ directory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vexy_pdf_werk-1.1.2.dev0.tar.gz.
File metadata
- Download URL: vexy_pdf_werk-1.1.2.dev0.tar.gz
- Upload date:
- Size: 10.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac2f6e3f9834006a88e5361630513d75008ded30568d47d9e4573ef160797ac2
|
|
| MD5 |
36a9e7584942832e8d2352966ee08030
|
|
| BLAKE2b-256 |
7eff67d82732918efc81e785c9f176a5d2e3f631bb429dfc46560c0bd50e0415
|
File details
Details for the file vexy_pdf_werk-1.1.2.dev0-py3-none-any.whl.
File metadata
- Download URL: vexy_pdf_werk-1.1.2.dev0-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd696d9115d3421c09b9f126a18d06a25c2c84414324ae20709c2dafdcec52d3
|
|
| MD5 |
3079d082db06db262b256b31d648a4e5
|
|
| BLAKE2b-256 |
a46068b6dc60c41f51c00e7eab0a492e93fbf5c952bf29141371d8257b72f1a3
|