Skip to main content

vhtml (ang. Visual HyperText Markup Language) - Optical character recognition and HTML layout analysis library

Project description

vHTML - Visual HTML Generator

A modular system for converting PDF documents to HTML with OCR and layout analysis.

Features

  • PDF to image conversion with preprocessing (denoise, deskew)
  • Document layout analysis and segmentation
  • OCR with multi-language support (Polish, English, German)
  • Language detection and confidence scoring
  • HTML generation with embedded images and metadata
  • Batch processing capabilities
  • Command-line interface

Installation

Prerequisites

  • Python 3.8+
  • Tesseract OCR
  • Poppler utilities

Using Poetry (Recommended)

# Clone the repository
git clone https://github.com/fin-officer/vhtml.git
cd vhtml

# Install with Poetry
make install

Manual Installation

# Install system dependencies
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-pol tesseract-ocr-eng tesseract-ocr-deu poppler-utils

# Install Python dependencies
pip install poetry
poetry install

Validate Installation

To verify that all dependencies are correctly installed:

make validate

or

python scripts/validate_installation.py

Usage

Command Line Interface

# Process a single PDF file
poetry run python -m vhtml.main /path/to/document.pdf -o output_directory

# Process a directory of PDF files
poetry run python -m vhtml.main /path/to/pdf_directory -b -o output_directory

# Process and open in browser
poetry run python -m vhtml.main /path/to/document.pdf -v

Integration Test

# Run the integration test with your PDF file
poetry run python scripts/test_integration.py /path/to/document.pdf -v

Python API

from vhtml.main import DocumentAnalyzer

# Initialize the analyzer
analyzer = DocumentAnalyzer()

# Process a document
html_path = analyzer.analyze_document("document.pdf", "output_dir")

# Print the path to the generated HTML
print(f"Generated HTML: {html_path}")

Examples

Generate Standalone HTML

Generate a standalone HTML file with all images, JS, and JSON embedded:

poetry run python examples/pdf2html.py
  • Input: Folder with HTML, images, JS, and JSON (e.g., output/mhtml_example/Invoice-30392B3C-0001)
  • Output: Standalone HTML (e.g., output/html_example/Invoice-30392B3C-0001_standalone.html)

Generate MHTML (Web Archive)

Generate a fully self-contained MHTML file for browser archiving:

poetry run python examples/pdf2mhtml.py
  • Input: PDF(s) in invoices/ (or other test files)
  • Output: MHTML file (e.g., output/mhtml_example/Invoice-30392B3C-0001.mhtml)

  • See examples/html.py and examples/mhtml.py for usage patterns and batch processing.
  • Both scripts demonstrate how to use the vHTML API for document conversion and archiving.

Core Components

  • PDFProcessor: Handles PDF to image conversion and preprocessing
  • LayoutAnalyzer: Analyzes document layout and segments content blocks
  • OCREngine: Performs OCR with language detection and confidence scoring
  • HTMLGenerator: Generates HTML with embedded images and styling
  • DocumentAnalyzer: Integrates all components into a complete workflow

Project Structure

vhtml/
├── vhtml/
│   ├── core/
│   │   ├── pdf_processor.py
│   │   ├── layout_analyzer.py
│   │   ├── ocr_engine.py
│   │   └── html_generator.py
│   └── main.py
├── scripts/
│   ├── validate_installation.py
│   └── test_integration.py
├── docs/
│   ├── ARCHITECTURE.md
│   ├── IMPLEMENTATION.md
│   └── PROJECT_STRUCTURE.md
├── Makefile
├── pyproject.toml
└── README.md

Development

# Setup development environment
make setup

# Run tests
make test

# Format code
make format

# Lint code
make lint

# Build package
make build

Documentation

For more detailed information, see the documentation files:

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vhtml-0.2.4.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vhtml-0.2.4-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file vhtml-0.2.4.tar.gz.

File metadata

  • Download URL: vhtml-0.2.4.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-15-generic

File hashes

Hashes for vhtml-0.2.4.tar.gz
Algorithm Hash digest
SHA256 a63e64ce9ecf32f123e970c0b29d7a5addcb39814a13c702f6f16381ffffe449
MD5 53c893923a84d622ccf410995549908c
BLAKE2b-256 2b778ae1a61513fc17784a15a00ad0e22692721bc9374813685b6f5e2cd4eb55

See more details on using hashes here.

File details

Details for the file vhtml-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: vhtml-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 32.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-15-generic

File hashes

Hashes for vhtml-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 57c241f48bc91d0777c9e8be20c2dc8c08e2cb7ef6178cae3714317f8713aaa2
MD5 adea8789abac921e311ad81cc2b00853
BLAKE2b-256 b41b02cb285ba02469ef15c7988b2a1461498bbd35cc80f3b697543f269216ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page