Skip to main content

vhtml (ang. Visual HyperText Markup Language) - Optical character recognition and HTML layout analysis library

Project description

        _   _ _____ __  __ _     
 __   _| | | |_   _|  \/  | |    
 \ \ / / |_| | | | | |\/| | |    
  \ V /|  _  | | | | |  | | |___ 
   \_/ |_| |_| |_| |_|  |_|_____|                                                                                           
Visual HTML Generator - Convert PDFs to structured HTML with OCR

A modular system for converting PDF documents to structured HTML with advanced OCR and layout analysis capabilities.

๐Ÿ“‹ Table of Contents

โœจ Features

Core Capabilities

  • ๐Ÿ–จ๏ธ PDF to image conversion with preprocessing (denoise, deskew, enhance)
  • ๐Ÿ” Advanced document layout analysis and segmentation
  • ๐ŸŒ Multi-language OCR support (Polish, English, German, more)
  • ๐Ÿท๏ธ Automatic document type detection
  • ๐Ÿ–ฅ๏ธ Modern, responsive HTML output

Advanced Features

  • ๐Ÿ”„ Batch processing for multiple documents
  • ๐Ÿ“Š Metadata extraction and preservation
  • ๐Ÿงฉ Modular architecture for easy extension
  • ๐Ÿš€ High-performance processing with parallelization
  • ๐Ÿ“ฑ Mobile-responsive output templates
  • ๐Ÿ” Searchable text output with confidence scoring

Integration

  • ๐Ÿณ Docker support for easy deployment
  • ๐Ÿงช Comprehensive test suite
  • ๐Ÿ“ฆ Well-documented Python API
  • ๐Ÿ”Œ Plugin system for custom processors

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.8+
  • Tesseract OCR
  • Poppler utilities (poppler-utils)
  • Git (for development)

System Setup (Ubuntu/Debian)

# Install system dependencies
sudo apt-get update
sudo apt-get install -y \
    tesseract-ocr \
    tesseract-ocr-pol \
    tesseract-ocr-eng \
    tesseract-ocr-deu \
    poppler-utils \
    python3-pip \
    python3-venv

๐Ÿ—๏ธ Architecture

High-Level Overview

graph TD
    A[PDF Input] --> B[PDF Processor]
    B --> C[Layout Analyzer]
    C --> D[OCR Engine]
    D --> E[HTML Generator]
    E --> F[Structured HTML Output]
    
    G[Configuration] --> B
    G --> C
    G --> D
    G --> E
    
    H[Plugins] -->|Extend| B
    H -->|Customize| C
    H -->|Enhance| D
    H -->|Theme| E

Component Interaction

+----------------+     +-----------------+     +---------------+
|                |     |                 |     |               |
|   PDF Input    |---->|  PDF Processor  |---->| Page Images   |
|                |     |                 |     |               |
+----------------+     +-----------------+     +-------.-------+
                                                    |
                                                    v
+----------------+     +-----------------+     +-------+-------+
|                |     |                 |     |               |
|  HTML Output   |<----|  HTML Generator |<----|  OCR Results  |
|                |     |                 |     |               |
+----------------+     +-----------------+     +-------.-------+
                                                    ^
                                                    |
+----------------+     +-----------------+     +-------+-------+
|                |     |                 |     |               |
| Configuration  |---->| Layout Analyzer |---->| Page Layout   |
|                |     |                 |     |               |
+----------------+     +-----------------+     +---------------+

๐Ÿ”ง Installation

Using Poetry (Recommended)

# 1. Clone the repository
git clone https://github.com/fin-officer/vhtml.git
cd vhtml

# 2. Install Python dependencies
poetry install

# 3. Install system dependencies (if not already installed)
make install-deps

# 4. Verify installation
make validate

Using Docker

# Build the Docker image
docker build -t vhtml .

# Run the container
docker run -v $(pwd)/invoices:/app/invoices -v $(pwd)/output:/app/output vhtml \
    python -m vhtml.main /app/invoices/sample.pdf -o /app/output

๐Ÿงช Validate Installation

To verify that all dependencies are correctly installed:

# Run validation script
make validate

# Or directly
python scripts/validate_installation.py

# Expected output:
# โœ“ Python version: 3.8+
# โœ“ Tesseract found: v5.0.0
# โœ“ Poppler utils installed
# โœ“ All Python dependencies satisfied
# โœ“ Test document processed successfully

๐Ÿ’ป Usage

Command Line Interface

# Process a single PDF file
poetry run python -m vhtml.main /path/to/document.pdf -o output_directory

# Process a directory of PDF files (batch mode)
poetry run python -m vhtml.main /path/to/pdf_directory -b -o output_directory

# Process and open in browser
poetry run python -m vhtml.main /path/to/document.pdf -v

# Specify output format (html/mhtml)
poetry run python -m vhtml.main document.pdf --format mhtml

# Use specific OCR language
poetry run python -m vhtml.main document.pdf --lang pol+eng

Python API

from vhtml import DocumentAnalyzer

# Initialize with custom settings
analyzer = DocumentAnalyzer(
    languages=['pol', 'eng'],  # OCR languages
    output_format='html',      # 'html' or 'mhtml'
    debug_mode=False          # Enable debug output
)

# Process a single document
result = analyzer.process("document.pdf", "output_dir")
print(f"Generated: {result.output_path}")
print(f"Metadata: {result.metadata}")

# Batch processing
results = analyzer.process_batch("input_dir", "output_dir")
for result in results:
    print(f"Processed: {result.input_path} -> {result.output_path}")

Example: Extract Text from PDF

from vhtml import PDFProcessor, OCREngine

# Load and preprocess PDF
processor = PDFProcessor()
pages = processor.process("document.pdf")

# Perform OCR
ocr = OCREngine(languages=['eng'])
for page_num, page_image in enumerate(pages):
    text = ocr.extract_text(page_image)
    print(f"Page {page_num + 1}:\n{text}\n{'='*50}")

๐Ÿ“š Documentation

Core Components

Guides

๐Ÿ”„ Development Workflow

graph LR
    A[Clone Repository] --> B[Install Dependencies]
    B --> C[Run Tests]
    C --> D[Make Changes]
    D --> E[Run Linters]
    E --> F[Update Tests]
    F --> G[Commit Changes]
    G --> H[Create Pull Request]

Common Tasks

# Run tests
make test

# Format code
make format

# Run linters
make lint

# Generate documentation
make docs

# Build package
make build

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Tesseract OCR for text recognition
  • Poppler for PDF processing
  • All contributors who helped improve this project

Made with โค๏ธ by the vHTML Team

Examples

Generate Standalone HTML

Generate a standalone HTML file with all images, JS, and JSON embedded:

poetry run python examples/pdf2html.py
  • Input: Folder with HTML, images, JS, and JSON (e.g., output/mhtml_example/Invoice-30392B3C-0001)
  • Output: Standalone HTML (e.g., output/html_example/Invoice-30392B3C-0001_standalone.html)

Generate MHTML (Web Archive)

Generate a fully self-contained MHTML file for browser archiving:

poetry run python examples/pdf2mhtml.py
  • Input: PDF(s) in invoices/ (or other test files)
  • Output: MHTML file (e.g., output/mhtml_example/Invoice-30392B3C-0001.mhtml)

  • See examples/html.py and examples/mhtml.py for usage patterns and batch processing.
  • Both scripts demonstrate how to use the vHTML API for document conversion and archiving.

Core Components

  • PDFProcessor: Handles PDF to image conversion and preprocessing
  • LayoutAnalyzer: Analyzes document layout and segments content blocks
  • OCREngine: Performs OCR with language detection and confidence scoring
  • HTMLGenerator: Generates HTML with embedded images and styling
  • DocumentAnalyzer: Integrates all components into a complete workflow

Project Structure

vhtml/
โ”œโ”€โ”€ vhtml/
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ pdf_processor.py
โ”‚   โ”‚   โ”œโ”€โ”€ layout_analyzer.py
โ”‚   โ”‚   โ”œโ”€โ”€ ocr_engine.py
โ”‚   โ”‚   โ””โ”€โ”€ html_generator.py
โ”‚   โ””โ”€โ”€ main.py
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ validate_installation.py
โ”‚   โ””โ”€โ”€ test_integration.py
โ”œโ”€โ”€ docs/
โ”‚   โ”œโ”€โ”€ ARCHITECTURE.md
โ”‚   โ”œโ”€โ”€ IMPLEMENTATION.md
โ”‚   โ””โ”€โ”€ PROJECT_STRUCTURE.md
โ”œโ”€โ”€ Makefile
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ README.md

Development

# Setup development environment
make setup

# Run tests
make test

# Format code
make format

# Lint code
make lint

# Build package
make build

Documentation

For more detailed information, see the documentation files:

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vhtml-0.2.7.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vhtml-0.2.7-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file vhtml-0.2.7.tar.gz.

File metadata

  • Download URL: vhtml-0.2.7.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-15-generic

File hashes

Hashes for vhtml-0.2.7.tar.gz
Algorithm Hash digest
SHA256 db3fbfd73f4a403df3dd291dae7998255db57a8d9e32eea814e3cca5266ba036
MD5 05d8e27080405e22749a4b63f6834976
BLAKE2b-256 ec6a8bd5a324391a203db6091d18f9899e7e8912a505ef6efcac88ed86d932ea

See more details on using hashes here.

File details

Details for the file vhtml-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: vhtml-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 34.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.3 Linux/6.14.0-15-generic

File hashes

Hashes for vhtml-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1d464349e7b11938f09cbfd855dade0995c1f5b51a822bd79626965d26df28f8
MD5 100a02d5b7f20c6ff62e4618b12feb54
BLAKE2b-256 68af211f0ceba80773164430e6325c14e9d90d89e49ccac26d17f35f17a0af8a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page