vhtml (ang. Visual HyperText Markup Language) - Optical character recognition and HTML layout analysis library
Project description
vHTML - Visual HTML Generator
A modular system for converting PDF documents to HTML with OCR and layout analysis.
Features
- PDF to image conversion with preprocessing (denoise, deskew)
- Document layout analysis and segmentation
- OCR with multi-language support (Polish, English, German)
- Language detection and confidence scoring
- HTML generation with embedded images and metadata
- Batch processing capabilities
- Command-line interface
Installation
Prerequisites
- Python 3.8+
- Tesseract OCR
- Poppler utilities
Using Poetry (Recommended)
# Clone the repository
git clone https://github.com/fin-officer/vhtml.git
cd vhtml
# Install with Poetry
make install
Manual Installation
# Install system dependencies
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-pol tesseract-ocr-eng tesseract-ocr-deu poppler-utils
# Install Python dependencies
pip install poetry
poetry install
Validate Installation
To verify that all dependencies are correctly installed:
make validate
or
python scripts/validate_installation.py
Usage
Command Line Interface
# Process a single PDF file
poetry run python -m vhtml.main /path/to/document.pdf -o output_directory
# Process a directory of PDF files
poetry run python -m vhtml.main /path/to/pdf_directory -b -o output_directory
# Process and open in browser
poetry run python -m vhtml.main /path/to/document.pdf -v
Integration Test
# Run the integration test with your PDF file
poetry run python scripts/test_integration.py /path/to/document.pdf -v
Python API
from vhtml.main import DocumentAnalyzer
# Initialize the analyzer
analyzer = DocumentAnalyzer()
# Process a document
html_path = analyzer.analyze_document("document.pdf", "output_dir")
# Print the path to the generated HTML
print(f"Generated HTML: {html_path}")
Examples
Generate Standalone HTML
Generate a standalone HTML file with all images, JS, and JSON embedded:
poetry run python examples/pdf2html.py
- Input: Folder with HTML, images, JS, and JSON (e.g., output/mhtml_example/Invoice-30392B3C-0001)
- Output: Standalone HTML (e.g., output/html_example/Invoice-30392B3C-0001_standalone.html)
Generate MHTML (Web Archive)
Generate a fully self-contained MHTML file for browser archiving:
poetry run python examples/pdf2mhtml.py
- Input: PDF(s) in invoices/ (or other test files)
- Output: MHTML file (e.g., output/mhtml_example/Invoice-30392B3C-0001.mhtml)
- See
examples/html.pyandexamples/mhtml.pyfor usage patterns and batch processing. - Both scripts demonstrate how to use the vHTML API for document conversion and archiving.
Core Components
- PDFProcessor: Handles PDF to image conversion and preprocessing
- LayoutAnalyzer: Analyzes document layout and segments content blocks
- OCREngine: Performs OCR with language detection and confidence scoring
- HTMLGenerator: Generates HTML with embedded images and styling
- DocumentAnalyzer: Integrates all components into a complete workflow
Project Structure
vhtml/
├── vhtml/
│ ├── core/
│ │ ├── pdf_processor.py
│ │ ├── layout_analyzer.py
│ │ ├── ocr_engine.py
│ │ └── html_generator.py
│ └── main.py
├── scripts/
│ ├── validate_installation.py
│ └── test_integration.py
├── docs/
│ ├── ARCHITECTURE.md
│ ├── IMPLEMENTATION.md
│ └── PROJECT_STRUCTURE.md
├── Makefile
├── pyproject.toml
└── README.md
Development
# Setup development environment
make setup
# Run tests
make test
# Format code
make format
# Lint code
make lint
# Build package
make build
Documentation
For more detailed information, see the documentation files:
License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vhtml-0.2.3.tar.gz.
File metadata
- Download URL: vhtml-0.2.3.tar.gz
- Upload date:
- Size: 27.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-15-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d2239cd8d20a817887eef3b130d38793061bcbb71c554ae236b2e2716c5531d
|
|
| MD5 |
c52cc4935695288bb32f4117d2093174
|
|
| BLAKE2b-256 |
bc93a5d4bf16c739a9d27479cc78744dff32abedd01048e8a1e1888d5f39156b
|
File details
Details for the file vhtml-0.2.3-py3-none-any.whl.
File metadata
- Download URL: vhtml-0.2.3-py3-none-any.whl
- Upload date:
- Size: 32.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-15-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f60b768979aee487f0f3845a8bab77dd54e7cc936c317050e4378210262d38e6
|
|
| MD5 |
cd7e898dc7c6aa974818c6c3d550b773
|
|
| BLAKE2b-256 |
c984de5335066e167165dbe658b6144d70a91367adf786df530cdd773f346113
|