vhtml (ang. Visual HyperText Markup Language) - Optical character recognition and HTML layout analysis library
Project description
_ _ _____ __ __ _
__ _| | | |_ _| \/ | |
\ \ / / |_| | | | | |\/| | |
\ V /| _ | | | | | | | |___
\_/ |_| |_| |_| |_| |_|_____|
Visual HTML Generator - Convert PDFs to structured HTML with OCR
A modular system for converting PDF documents to structured HTML with advanced OCR and layout analysis capabilities.
๐ Table of Contents
- โจ Features
- ๐ Quick Start
- ๐๏ธ Architecture
- ๐ง Installation
- ๐ป Usage
- ๐ Documentation
- ๐ค Contributing
- ๐ License
โจ Features
Core Capabilities
- ๐จ๏ธ PDF to image conversion with preprocessing (denoise, deskew, enhance)
- ๐ Advanced document layout analysis and segmentation
- ๐ Multi-language OCR support (Polish, English, German, more)
- ๐ท๏ธ Automatic document type detection
- ๐ฅ๏ธ Modern, responsive HTML output
Advanced Features
- ๐ Batch processing for multiple documents
- ๐ Metadata extraction and preservation
- ๐งฉ Modular architecture for easy extension
- ๐ High-performance processing with parallelization
- ๐ฑ Mobile-responsive output templates
- ๐ Searchable text output with confidence scoring
Integration
- ๐ณ Docker support for easy deployment
- ๐งช Comprehensive test suite
- ๐ฆ Well-documented Python API
- ๐ Plugin system for custom processors
๐ Quick Start
Prerequisites
- Python 3.8+
- Tesseract OCR
- Poppler utilities (
poppler-utils) - Git (for development)
System Setup (Ubuntu/Debian)
# Install system dependencies
sudo apt-get update
sudo apt-get install -y \
tesseract-ocr \
tesseract-ocr-pol \
tesseract-ocr-eng \
tesseract-ocr-deu \
poppler-utils \
python3-pip \
python3-venv
๐๏ธ Architecture
High-Level Overview
graph TD
A[PDF Input] --> B[PDF Processor]
B --> C[Layout Analyzer]
C --> D[OCR Engine]
D --> E[HTML Generator]
E --> F[Structured HTML Output]
G[Configuration] --> B
G --> C
G --> D
G --> E
H[Plugins] -->|Extend| B
H -->|Customize| C
H -->|Enhance| D
H -->|Theme| E
Component Interaction
+----------------+ +-----------------+ +---------------+
| | | | | |
| PDF Input |---->| PDF Processor |---->| Page Images |
| | | | | |
+----------------+ +-----------------+ +-------.-------+
|
v
+----------------+ +-----------------+ +-------+-------+
| | | | | |
| HTML Output |<----| HTML Generator |<----| OCR Results |
| | | | | |
+----------------+ +-----------------+ +-------.-------+
^
|
+----------------+ +-----------------+ +-------+-------+
| | | | | |
| Configuration |---->| Layout Analyzer |---->| Page Layout |
| | | | | |
+----------------+ +-----------------+ +---------------+
๐ง Installation
Using Poetry (Recommended)
# 1. Clone the repository
git clone https://github.com/fin-officer/vhtml.git
cd vhtml
# 2. Install Python dependencies
poetry install
# 3. Install system dependencies (if not already installed)
make install-deps
# 4. Verify installation
make validate
Using Docker
# Build the Docker image
docker build -t vhtml .
# Run the container
docker run -v $(pwd)/invoices:/app/invoices -v $(pwd)/output:/app/output vhtml \
python -m vhtml.main /app/invoices/sample.pdf -o /app/output
๐งช Validate Installation
To verify that all dependencies are correctly installed:
# Run validation script
make validate
# Or directly
python scripts/validate_installation.py
# Expected output:
# โ Python version: 3.8+
# โ Tesseract found: v5.0.0
# โ Poppler utils installed
# โ All Python dependencies satisfied
# โ Test document processed successfully
๐ป Usage
Command Line Interface
# Process a single PDF file
poetry run python -m vhtml.main /path/to/document.pdf -o output_directory
# Process a directory of PDF files (batch mode)
poetry run python -m vhtml.main /path/to/pdf_directory -b -o output_directory
# Process and open in browser
poetry run python -m vhtml.main /path/to/document.pdf -v
# Specify output format (html/mhtml)
poetry run python -m vhtml.main document.pdf --format mhtml
# Use specific OCR language
poetry run python -m vhtml.main document.pdf --lang pol+eng
Python API
from vhtml import DocumentAnalyzer
# Initialize with custom settings
analyzer = DocumentAnalyzer(
languages=['pol', 'eng'], # OCR languages
output_format='html', # 'html' or 'mhtml'
debug_mode=False # Enable debug output
)
# Process a single document
result = analyzer.process("document.pdf", "output_dir")
print(f"Generated: {result.output_path}")
print(f"Metadata: {result.metadata}")
# Batch processing
results = analyzer.process_batch("input_dir", "output_dir")
for result in results:
print(f"Processed: {result.input_path} -> {result.output_path}")
Example: Extract Text from PDF
from vhtml import PDFProcessor, OCREngine
# Load and preprocess PDF
processor = PDFProcessor()
pages = processor.process("document.pdf")
# Perform OCR
ocr = OCREngine(languages=['eng'])
for page_num, page_image in enumerate(pages):
text = ocr.extract_text(page_image)
print(f"Page {page_num + 1}:\n{text}\n{'='*50}")
๐ Documentation
Core Components
- PDF Processor - Handles PDF to image conversion
- Layout Analyzer - Analyzes document structure
- OCR Engine - Performs text recognition
- HTML Generator - Creates structured HTML output
Guides
๐ Development Workflow
graph LR
A[Clone Repository] --> B[Install Dependencies]
B --> C[Run Tests]
C --> D[Make Changes]
D --> E[Run Linters]
E --> F[Update Tests]
F --> G[Commit Changes]
G --> H[Create Pull Request]
Common Tasks
# Run tests
make test
# Format code
make format
# Run linters
make lint
# Generate documentation
make docs
# Build package
make build
๐ค Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
๐ Acknowledgments
- Tesseract OCR for text recognition
- Poppler for PDF processing
- All contributors who helped improve this project
Examples
Generate Standalone HTML
Generate a standalone HTML file with all images, JS, and JSON embedded:
poetry run python examples/pdf2html.py
- Input: Folder with HTML, images, JS, and JSON (e.g., output/mhtml_example/Invoice-30392B3C-0001)
- Output: Standalone HTML (e.g., output/html_example/Invoice-30392B3C-0001_standalone.html)
Generate MHTML (Web Archive)
Generate a fully self-contained MHTML file for browser archiving:
poetry run python examples/pdf2mhtml.py
- Input: PDF(s) in invoices/ (or other test files)
- Output: MHTML file (e.g., output/mhtml_example/Invoice-30392B3C-0001.mhtml)
- See
examples/html.pyandexamples/mhtml.pyfor usage patterns and batch processing. - Both scripts demonstrate how to use the vHTML API for document conversion and archiving.
Core Components
- PDFProcessor: Handles PDF to image conversion and preprocessing
- LayoutAnalyzer: Analyzes document layout and segments content blocks
- OCREngine: Performs OCR with language detection and confidence scoring
- HTMLGenerator: Generates HTML with embedded images and styling
- DocumentAnalyzer: Integrates all components into a complete workflow
Project Structure
vhtml/
โโโ vhtml/
โ โโโ core/
โ โ โโโ pdf_processor.py
โ โ โโโ layout_analyzer.py
โ โ โโโ ocr_engine.py
โ โ โโโ html_generator.py
โ โโโ main.py
โโโ scripts/
โ โโโ validate_installation.py
โ โโโ test_integration.py
โโโ docs/
โ โโโ ARCHITECTURE.md
โ โโโ IMPLEMENTATION.md
โ โโโ PROJECT_STRUCTURE.md
โโโ Makefile
โโโ pyproject.toml
โโโ README.md
Development
# Setup development environment
make setup
# Run tests
make test
# Format code
make format
# Lint code
make lint
# Build package
make build
Documentation
For more detailed information, see the documentation files:
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vhtml-0.2.7.tar.gz.
File metadata
- Download URL: vhtml-0.2.7.tar.gz
- Upload date:
- Size: 31.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-15-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db3fbfd73f4a403df3dd291dae7998255db57a8d9e32eea814e3cca5266ba036
|
|
| MD5 |
05d8e27080405e22749a4b63f6834976
|
|
| BLAKE2b-256 |
ec6a8bd5a324391a203db6091d18f9899e7e8912a505ef6efcac88ed86d932ea
|
File details
Details for the file vhtml-0.2.7-py3-none-any.whl.
File metadata
- Download URL: vhtml-0.2.7-py3-none-any.whl
- Upload date:
- Size: 34.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-15-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d464349e7b11938f09cbfd855dade0995c1f5b51a822bd79626965d26df28f8
|
|
| MD5 |
100a02d5b7f20c6ff62e4618b12feb54
|
|
| BLAKE2b-256 |
68af211f0ceba80773164430e6325c14e9d90d89e49ccac26d17f35f17a0af8a
|