Intelligent HTML content extraction and merge tool for bidirectional document transformation

These details have not been verified by PyPI

Project links

Project description

HTMLAdapt: Intelligent HTML Content Extraction and Merge Tool

HTMLAdapt is a Python-based tool for bidirectional HTML document transformation that preserves structural integrity while enabling seamless content modification through an intermediate representation. Perfect for translation workflows, content editing, and HTML processing where maintaining original formatting and styling is critical.

Why HTMLAdapt?

When working with complex HTML documents that need translation or content editing, traditional approaches fall short:

Manual editing risks breaking structure and styling
Simple find-replace can't handle complex markup patterns
Existing tools lose crucial formatting and hierarchical relationships
Translation tools often mangle HTML or require extensive post-processing

HTMLAdapt solves these challenges with intelligent algorithms that understand HTML structure and preserve it through the entire edit-merge cycle.

How It Works

HTMLAdapt implements a sophisticated two-phase workflow:

1. Extract Phase

Transforms a complex original HTML document into two complementary representations:

Superset Document: The original HTML enhanced with unique IDs on all text-containing elements
Subset Document: A lightweight version containing only translatable content with preserved IDs

from htmladapt import HTMLExtractMergeTool

tool = HTMLExtractMergeTool()
superset_html, subset_html = tool.extract(original_html)

2. Merge Phase

Intelligently recombines edited content with the original structure using advanced reconciliation algorithms:

final_html = tool.merge(
    edited_subset_html,
    original_subset_html,
    superset_html,
    original_html
)

Key Features

Perfect Structure Preservation

Maintains all original HTML structure, CSS classes, JavaScript references, and formatting while allowing content modification.

Intelligent Element Matching

Uses multiple sophisticated algorithms to match content between versions:

Perfect ID matching for unchanged elements
Hash-based signatures for content similarity
Fuzzy matching for modified text
LLM integration for ambiguous cases

High Performance

Optimized for large documents with:

lxml parser for speed (2-3x faster than alternatives)
O(n) hash-based matching for most cases
Memory-efficient processing
Configurable performance profiles

AI-Powered Conflict Resolution

Integrates with Large Language Models to resolve complex matching scenarios that pure algorithms cannot handle.

Robust Error Handling

Handles malformed HTML, deeply nested structures, and edge cases gracefully with comprehensive fallback mechanisms.

Installation

pip install htmladapt

Or install with LLM support:

pip install htmladapt[llm]

Quick Start

Basic Usage

from htmladapt import HTMLExtractMergeTool

# Initialize the tool
tool = HTMLExtractMergeTool(id_prefix="trans_")

# Step 1: Extract content from original HTML
original_html = open('document.html', 'r').read()
superset_html, subset_html = tool.extract(original_html)

# Step 2: Edit the subset (translate, modify content, etc.)
# This is where you would integrate your translation workflow
edited_subset = subset_html.replace('Hello', 'Hola').replace('World', 'Mundo')

# Step 3: Merge edited content back into original structure
final_html = tool.merge(
    edited_subset,      # Your edited content
    subset_html,        # Original subset for comparison
    superset_html,      # Enhanced original with IDs
    original_html       # Original document
)

# Save the result
with open('translated_document.html', 'w') as f:
    f.write(final_html)

Advanced Configuration

from htmladapt import HTMLExtractMergeTool, ProcessingConfig

# Custom configuration
config = ProcessingConfig(
    id_prefix="my_prefix_",
    similarity_threshold=0.8,
    enable_llm_resolution=True,
    llm_model="gpt-4o-mini",
    performance_profile="accurate"  # fast|balanced|accurate
)

tool = HTMLExtractMergeTool(config=config)

With LLM Integration

import os
from htmladapt import HTMLExtractMergeTool, LLMReconciler

# Set up LLM for conflict resolution
llm = LLMReconciler(
    api_key=os.environ['OPENAI_API_KEY'],
    model="gpt-4o-mini"
)

tool = HTMLExtractMergeTool(llm_reconciler=llm)

# The tool will automatically use LLM for ambiguous matches
final_html = tool.merge(edited_subset, subset_html, superset_html, original_html)

Use Cases

Website Translation

Translate website content while preserving all CSS classes, JavaScript functionality, and visual design.

# Extract translatable content
superset, subset = tool.extract(webpage_html)

# Send subset to translation service
translated_subset = translation_service.translate(subset, target_lang='es')

# Merge back maintaining all original styling
localized_webpage = tool.merge(translated_subset, subset, superset, webpage_html)

Content Management

Edit HTML content in a simplified interface while maintaining complex original structure.

# Extract editable content for CMS
_, editable_content = tool.extract(article_html)

# User edits in simplified interface
edited_content = cms.edit_interface(editable_content)

# Merge back preserving article layout and styling
updated_article = tool.merge(edited_content, editable_content, superset, article_html)

Documentation Maintenance

Update technical documentation while preserving code highlighting, navigation, and styling.

# Extract documentation text
superset, docs_text = tool.extract(documentation_html)

# Update content while preserving code blocks and formatting
updated_text = update_documentation(docs_text)

# Merge maintaining syntax highlighting and navigation
final_docs = tool.merge(updated_text, docs_text, superset, documentation_html)

Architecture Deep Dive

HTMLAdapt uses a multi-layered approach to ensure reliable HTML processing:

Layer 1: Robust HTML Parsing

Primary: BeautifulSoup with lxml backend for performance
Fallback: html.parser for malformed HTML
Error Recovery: Automatic tag closure and structure repair

Layer 2: Intelligent ID Generation

Base36 encoding for compact, collision-free IDs
Hierarchical numbering for traceability
Collision detection and prevention

Layer 3: Multi-Strategy Matching

Perfect Matching: Identical ID preservation (fastest)
Hash Matching: Content signature comparison (fast)
Fuzzy Matching: Similarity scoring with difflib (accurate)
LLM Matching: Semantic understanding for edge cases (most accurate)

Layer 4: Structural Analysis

LCS algorithms for sequence reordering detection
Tree diff algorithms for hierarchical changes
Conflict identification for manual resolution

Layer 5: Smart Reconciliation

Three-way merge logic from version control systems
Contextual conflict resolution using minimal LLM calls
Fallback heuristics for offline operation

Performance Characteristics

Document Size	Processing Time	Memory Usage	Recommended Profile
< 1MB	~100ms	4-8MB	balanced
1-10MB	~1-5s	20-80MB	fast
> 10MB	~5-30s	100-400MB	fast

Error Handling

HTMLAdapt gracefully handles common HTML issues:

Malformed tags: Automatic closure and repair
Deeply nested structures: Configurable depth limits
Large documents: Memory-efficient streaming
Encoding issues: Automatic detection and conversion
Missing elements: Intelligent fallback matching

Testing and Quality Assurance

HTMLAdapt includes comprehensive test suites:

# Run all tests
pytest tests/

# Run with coverage
pytest --cov=htmladapt tests/

# Performance benchmarks
pytest tests/benchmarks/

Test categories:

Unit tests for individual components
Integration tests for end-to-end workflows
Performance tests with various document sizes
Edge case tests for malformed HTML
Round-trip tests to ensure content preservation

API Reference

Core Classes

`HTMLExtractMergeTool`

Main interface for extraction and merging operations.

Methods:

extract(html: str) -> Tuple[str, str]: Create superset and subset
merge(edited: str, subset: str, superset: str, original: str) -> str: Merge content

`ProcessingConfig`

Configuration object for customizing behavior.

Parameters:

id_prefix: str: Prefix for generated IDs (default: "auto_")
similarity_threshold: float: Minimum similarity for fuzzy matching (default: 0.7)
enable_llm_resolution: bool: Use LLM for conflicts (default: False)
performance_profile: str: Processing profile - fast|balanced|accurate (default: "balanced")

`LLMReconciler`

Interface for LLM-powered conflict resolution.

Parameters:

api_key: str: OpenAI API key
model: str: Model name (default: "gpt-4o-mini")
max_context_tokens: int: Maximum tokens per request (default: 1000)

Utility Functions

from htmladapt.utils import (
    validate_html,
    estimate_processing_time,
    optimize_for_size
)

# Validate HTML before processing
is_valid, issues = validate_html(html_content)

# Estimate processing requirements
time_estimate, memory_estimate = estimate_processing_time(html_content)

# Optimize large documents
optimized_html = optimize_for_size(html_content, target_size_mb=5)

Integration Examples

Flask Web Application

from flask import Flask, request, jsonify
from htmladapt import HTMLExtractMergeTool

app = Flask(__name__)
tool = HTMLExtractMergeTool()

@app.route('/extract', methods=['POST'])
def extract_content():
    html = request.json['html']
    superset, subset = tool.extract(html)
    return jsonify({
        'superset': superset,
        'subset': subset
    })

@app.route('/merge', methods=['POST'])
def merge_content():
    data = request.json
    result = tool.merge(
        data['edited'],
        data['subset'],
        data['superset'],
        data['original']
    )
    return jsonify({'result': result})

Django Integration

# models.py
from django.db import models

class Document(models.Model):
    original_html = models.TextField()
    superset_html = models.TextField()
    subset_html = models.TextField()

    def extract_content(self):
        from htmladapt import HTMLExtractMergeTool
        tool = HTMLExtractMergeTool()
        self.superset_html, self.subset_html = tool.extract(self.original_html)
        self.save()

    def merge_content(self, edited_html):
        from htmladapt import HTMLExtractMergeTool
        tool = HTMLExtractMergeTool()
        return tool.merge(
            edited_html,
            self.subset_html,
            self.superset_html,
            self.original_html
        )

Celery Background Processing

from celery import Celery
from htmladapt import HTMLExtractMergeTool

app = Celery('htmladapt_tasks')
tool = HTMLExtractMergeTool()

@app.task
def process_large_document(html_content, user_id):
    try:
        superset, subset = tool.extract(html_content)
        # Store results or notify user
        return {'status': 'success', 'subset_id': store_subset(subset)}
    except Exception as e:
        return {'status': 'error', 'message': str(e)}

@app.task
def merge_edited_content(edited_html, subset_html, superset_html, original_html):
    result = tool.merge(edited_html, subset_html, superset_html, original_html)
    return result

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Development Setup

# Clone the repository
git clone https://github.com/yourusername/htmladapt.git
cd htmladapt

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev,test,llm]"

# Run tests
pytest

# Run type checking
mypy htmladapt/

# Format code
black htmladapt/
ruff check htmladapt/

Architecture for Contributors

The codebase is organized into logical modules:

htmladapt/
├── core/
│   ├── parser.py          # HTML parsing logic
│   ├── extractor.py       # Content extraction
│   ├── matcher.py         # Element matching algorithms
│   └── merger.py          # Content reconciliation
├── algorithms/
│   ├── id_generation.py   # ID generation strategies
│   ├── tree_diff.py       # Tree comparison algorithms
│   └── fuzzy_match.py     # Similarity scoring
├── llm/
│   ├── reconciler.py      # LLM integration
│   └── prompts.py         # Prompt templates
├── utils/
│   ├── html_utils.py      # HTML processing utilities
│   └── performance.py    # Performance optimization
└── tests/
    ├── unit/              # Unit tests
    ├── integration/       # Integration tests
    └── benchmarks/        # Performance tests

License

MIT License - see LICENSE file for details.

Support and Community

Documentation: https://htmladapt.readthedocs.io
Issues: GitHub Issues
Discussions: GitHub Discussions
Email: support@htmladapt.dev

Citation

If you use HTMLAdapt in academic research, please cite:

@software{htmladapt2024,
  title={HTMLAdapt: Intelligent HTML Content Extraction and Merge Tool},
  author={Your Name},
  year={2024},
  url={https://github.com/yourusername/htmladapt}
}

HTMLAdapt - Making HTML content transformation intelligent, reliable, and effortless.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.12

Sep 25, 2025

1.0.11

Sep 25, 2025

1.0.10

Sep 25, 2025

1.0.9

Sep 25, 2025

1.0.8

Sep 25, 2025

This version

1.0.4

Sep 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htmladapt-1.0.4.tar.gz (18.1 kB view details)

Uploaded Sep 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

htmladapt-1.0.4-py3-none-any.whl (23.0 kB view details)

Uploaded Sep 25, 2025 Python 3

File details

Details for the file htmladapt-1.0.4.tar.gz.

File metadata

Download URL: htmladapt-1.0.4.tar.gz
Upload date: Sep 25, 2025
Size: 18.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.15

File hashes

Hashes for htmladapt-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`de12528fb0889c1429c6e3d2e6239ba8198aceaa35ce57ba1021e36e177c7ec4`
MD5	`9df4b67be087ff1e65e1faee9f20cee1`
BLAKE2b-256	`9386f01e019b71631c6924c8b31744d020dae0c931f7992dc8b40cb5119388ef`

See more details on using hashes here.

File details

Details for the file htmladapt-1.0.4-py3-none-any.whl.

File metadata

Download URL: htmladapt-1.0.4-py3-none-any.whl
Upload date: Sep 25, 2025
Size: 23.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.15

File hashes

Hashes for htmladapt-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ec4f43a031ef2d29221de815b09f23e0da68ed23a0656b968675b0747b35104`
MD5	`db9f535343a7763af00df586566e5bf8`
BLAKE2b-256	`d0ebec1a2d0e5c70f2606e923cce82de6fc731ea0a0f7c603034be327f68f7a9`

See more details on using hashes here.

htmladapt 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HTMLAdapt: Intelligent HTML Content Extraction and Merge Tool

Why HTMLAdapt?

How It Works

1. Extract Phase

2. Merge Phase

Key Features

Perfect Structure Preservation

Intelligent Element Matching

High Performance

AI-Powered Conflict Resolution

Robust Error Handling

Installation

Quick Start

Basic Usage

Advanced Configuration

With LLM Integration

Use Cases

Website Translation

Content Management

Documentation Maintenance

Architecture Deep Dive

Layer 1: Robust HTML Parsing

Layer 2: Intelligent ID Generation

Layer 3: Multi-Strategy Matching

Layer 4: Structural Analysis

Layer 5: Smart Reconciliation

Performance Characteristics

Error Handling

Testing and Quality Assurance

API Reference

Core Classes

HTMLExtractMergeTool

ProcessingConfig

LLMReconciler

Utility Functions

Integration Examples

Flask Web Application

Django Integration

Celery Background Processing

Contributing

Development Setup

Architecture for Contributors

License

Support and Community

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`HTMLExtractMergeTool`

`ProcessingConfig`

`LLMReconciler`