vhtml (ang. Visual HyperText Markup Language) - Optical character recognition and HTML layout analysis library

These details have not been verified by PyPI

Project description

vHTML - Visual HTML Generator

A modular system for converting PDF documents to HTML with OCR and layout analysis.

Features

PDF to image conversion with preprocessing (denoise, deskew)
Document layout analysis and segmentation
OCR with multi-language support (Polish, English, German)
Language detection and confidence scoring
HTML generation with embedded images and metadata
Batch processing capabilities
Command-line interface

Installation

Prerequisites

Python 3.8+
Tesseract OCR
Poppler utilities

Using Poetry (Recommended)

# Clone the repository
git clone https://github.com/fin-officer/vhtml.git
cd vhtml

# Install with Poetry
make install

Manual Installation

# Install system dependencies
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-pol tesseract-ocr-eng tesseract-ocr-deu poppler-utils

# Install Python dependencies
pip install poetry
poetry install

Validate Installation

To verify that all dependencies are correctly installed:

make validate

python scripts/validate_installation.py

Usage

Command Line Interface

# Process a single PDF file
poetry run python -m vhtml.main /path/to/document.pdf -o output_directory

# Process a directory of PDF files
poetry run python -m vhtml.main /path/to/pdf_directory -b -o output_directory

# Process and open in browser
poetry run python -m vhtml.main /path/to/document.pdf -v

Integration Test

# Run the integration test with your PDF file
poetry run python scripts/test_integration.py /path/to/document.pdf -v

Python API

from vhtml.main import DocumentAnalyzer

# Initialize the analyzer
analyzer = DocumentAnalyzer()

# Process a document
html_path = analyzer.analyze_document("document.pdf", "output_dir")

# Print the path to the generated HTML
print(f"Generated HTML: {html_path}")

Examples

Generate Standalone HTML

Generate a standalone HTML file with all images, JS, and JSON embedded:

poetry run python examples/pdf2html.py

Input: Folder with HTML, images, JS, and JSON (e.g., output/mhtml_example/Invoice-30392B3C-0001)
Output: Standalone HTML (e.g., output/html_example/Invoice-30392B3C-0001_standalone.html)

Generate MHTML (Web Archive)

Generate a fully self-contained MHTML file for browser archiving:

poetry run python examples/pdf2mhtml.py

Input: PDF(s) in invoices/ (or other test files)
Output: MHTML file (e.g., output/mhtml_example/Invoice-30392B3C-0001.mhtml)

See examples/html.py and examples/mhtml.py for usage patterns and batch processing.
Both scripts demonstrate how to use the vHTML API for document conversion and archiving.

Core Components

PDFProcessor: Handles PDF to image conversion and preprocessing
LayoutAnalyzer: Analyzes document layout and segments content blocks
OCREngine: Performs OCR with language detection and confidence scoring
HTMLGenerator: Generates HTML with embedded images and styling
DocumentAnalyzer: Integrates all components into a complete workflow

Project Structure

vhtml/
├── vhtml/
│   ├── core/
│   │   ├── pdf_processor.py
│   │   ├── layout_analyzer.py
│   │   ├── ocr_engine.py
│   │   └── html_generator.py
│   └── main.py
├── scripts/
│   ├── validate_installation.py
│   └── test_integration.py
├── docs/
│   ├── ARCHITECTURE.md
│   ├── IMPLEMENTATION.md
│   └── PROJECT_STRUCTURE.md
├── Makefile
├── pyproject.toml
└── README.md

Development

# Setup development environment
make setup

# Run tests
make test

# Format code
make format

# Lint code
make lint

# Build package
make build

Documentation

For more detailed information, see the documentation files:

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.8

Jun 18, 2025

0.2.7

Jun 18, 2025

0.2.6

Jun 18, 2025

0.2.5

Jun 18, 2025

0.2.4

Jun 18, 2025

This version

0.2.3

Jun 18, 2025

0.2.2

Jun 18, 2025

0.2.1

Jun 18, 2025

0.2.0

Jun 18, 2025

0.1.7

Jun 18, 2025

0.1.6

Jun 18, 2025

0.1.5

Jun 18, 2025

0.1.3

Jun 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vhtml-0.2.3.tar.gz (27.5 kB view details)

Uploaded Jun 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vhtml-0.2.3-py3-none-any.whl (32.4 kB view details)

Uploaded Jun 18, 2025 Python 3

File details

Details for the file vhtml-0.2.3.tar.gz.

File metadata

Download URL: vhtml-0.2.3.tar.gz
Upload date: Jun 18, 2025
Size: 27.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-15-generic

File hashes

Hashes for vhtml-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`1d2239cd8d20a817887eef3b130d38793061bcbb71c554ae236b2e2716c5531d`
MD5	`c52cc4935695288bb32f4117d2093174`
BLAKE2b-256	`bc93a5d4bf16c739a9d27479cc78744dff32abedd01048e8a1e1888d5f39156b`

See more details on using hashes here.

File details

Details for the file vhtml-0.2.3-py3-none-any.whl.

File metadata

Download URL: vhtml-0.2.3-py3-none-any.whl
Upload date: Jun 18, 2025
Size: 32.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-15-generic

File hashes

Hashes for vhtml-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f60b768979aee487f0f3845a8bab77dd54e7cc936c317050e4378210262d38e6`
MD5	`cd7e898dc7c6aa974818c6c3d550b773`
BLAKE2b-256	`c984de5335066e167165dbe658b6144d70a91367adf786df530cdd773f346113`

See more details on using hashes here.

vhtml 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

vHTML - Visual HTML Generator

Features

Installation

Prerequisites

Using Poetry (Recommended)

Manual Installation

Validate Installation

Usage

Command Line Interface

Integration Test

Python API

Examples

Generate Standalone HTML

Generate MHTML (Web Archive)

Core Components

Project Structure

Development

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes