Intelligent HTML content extraction and merge tool for bidirectional document transformation
Project description
HTMLAdapt: HTML Content Extraction and Merge Tool
HTMLAdapt is a Python tool for bidirectional HTML document transformation that preserves structural integrity while enabling content modification through an intermediate representation. Useful for translation workflows, content editing, and HTML processing where maintaining original formatting and styling matters.
Why HTMLAdapt?
When working with complex HTML documents that need translation or content editing, traditional approaches often fail:
- Manual editing risks breaking structure and styling
- Simple find-replace can't handle complex markup
- Existing tools lose formatting and hierarchy
- Translation tools often mangle HTML
HTMLAdapt solves these problems with algorithms that understand HTML structure and preserve it through the entire edit-merge cycle.
How It Works
HTMLAdapt uses a two-phase workflow:
1. Extract Phase
Transforms the original HTML into two representations:
- Superset Document: Original HTML with unique IDs added to all text-containing elements
- Subset Document: Simplified version with only translatable content, preserving IDs
from htmladapt import HTMLExtractMergeTool
tool = HTMLExtractMergeTool()
map_html, comp_html = tool.extract(original_html)
2. Merge Phase
Recombines edited content with original structure using reconciliation algorithms:
final_html = tool.merge(
edited_comp_html,
original_comp_html,
map_html,
original_html
)
Key Features
Structure Preservation
Maintains all original HTML structure, CSS classes, JavaScript references, and formatting during content modification.
Element Matching
Uses multiple strategies to match content between versions:
- Perfect ID matching for unchanged elements
- Hash-based signatures for content similarity
- Fuzzy matching for modified text
- LLM integration for ambiguous cases
Performance
Optimized for large documents:
- lxml parser for speed (2-3x faster than alternatives)
- O(n) hash-based matching in most cases
- Memory-efficient processing
- Configurable performance profiles
AI Conflict Resolution
Integrates with Large Language Models to resolve complex matching scenarios that algorithms alone cannot handle.
Error Handling
Handles malformed HTML, deeply nested structures, and edge cases gracefully with fallback mechanisms.
Installation
pip install htmladapt
Or with LLM support:
pip install htmladapt[llm]
Quick Start
Basic Usage
from htmladapt import HTMLExtractMergeTool
# Initialize the tool
tool = HTMLExtractMergeTool(id_prefix="trans_")
# Step 1: Extract content
original_html = open('document.html', 'r').read()
map_html, comp_html = tool.extract(original_html)
# Step 2: Edit the subset
cnew_path = comp_html.replace('Hello', 'Hola').replace('World', 'Mundo')
# Step 3: Merge back
final_html = tool.merge(
cnew_path, # Edited content
comp_html, # Original subset for comparison
map_html, # Enhanced original with IDs
original_html # Original document
)
# Save result
with open('translated_document.html', 'w') as f:
f.write(final_html)
Advanced Configuration
from htmladapt import HTMLExtractMergeTool, ProcessingConfig
# Custom configuration
config = ProcessingConfig(
id_prefix="my_prefix_",
simi_level=0.8,
llm_use=True,
model_llm="gpt-4o-mini",
perf="accurate" # fast|balanced|accurate
)
tool = HTMLExtractMergeTool(config=config)
With LLM Integration
import os
from htmladapt import HTMLExtractMergeTool, LLMReconciler
# Set up LLM
llm = LLMReconciler(
api_key=os.environ['OPENAI_API_KEY'],
model="gpt-4o-mini"
)
tool = HTMLExtractMergeTool(llm_reconciler=llm)
# Automatic LLM use for ambiguous matches
final_html = tool.merge(cnew_path, comp_html, map_html, original_html)
Use Cases
Website Translation
Translate content while preserving CSS classes, JavaScript, and design.
# Extract content
superset, subset = tool.extract(webpage_html)
# Send to translation service
translated_subset = translation_service.translate(subset, target_lang='es')
# Merge back with styling intact
localized_webpage = tool.merge(translated_subset, subset, superset, webpage_html)
Content Management
Edit HTML in a simplified interface while maintaining complex structure.
# Extract for CMS
_, editable_content = tool.extract(article_html)
# User edits content
edited_content = cms.edit_interface(editable_content)
# Merge back with layout preserved
updated_article = tool.merge(edited_content, editable_content, superset, article_html)
Documentation Maintenance
Update docs while preserving code highlighting and navigation.
# Extract text
superset, docs_text = tool.extract(documentation_html)
# Update content
updated_text = update_documentation(docs_text)
# Merge with formatting intact
final_docs = tool.merge(updated_text, docs_text, superset, documentation_html)
Architecture
HTMLAdapt uses a layered approach:
Layer 1: HTML Parsing
- Primary: BeautifulSoup with lxml backend
- Fallback: html.parser for malformed HTML
- Error Recovery: Automatic tag closure and structure repair
Layer 2: ID Generation
- Base36 encoding for compact IDs
- Hierarchical numbering for traceability
- Collision detection and prevention
Layer 3: Matching Strategies
- Perfect Matching: Identical ID preservation (fastest)
- Hash Matching: Content signature comparison (fast)
- Fuzzy Matching: Similarity scoring with difflib (accurate)
- LLM Matching: Semantic understanding for edge cases (most accurate)
Layer 4: Structural Analysis
- LCS algorithms for sequence reordering
- Tree diff algorithms for hierarchical changes
- Conflict identification for manual resolution
Layer 5: Reconciliation
- Three-way merge logic from version control
- Contextual conflict resolution with minimal LLM calls
- Fallback heuristics for offline operation
Performance
| Document Size | Processing Time | Memory Usage | Recommended Profile |
|---|---|---|---|
| < 1MB | ~100ms | 4-8MB | balanced |
| 1-10MB | ~1-5s | 20-80MB | fast |
| > 10MB | ~5-30s | 100-400MB | fast |
Error Handling
HTMLAdapt handles common issues:
- Malformed tags: Automatic closure and repair
- Deeply nested structures: Configurable depth limits
- Large documents: Memory-efficient streaming
- Encoding issues: Automatic detection and conversion
- Missing elements: Fallback matching
Testing
HTMLAdapt includes comprehensive test suites:
# Run all tests
pytest tests/
# Run with coverage
pytest --cov=htmladapt tests/
# Performance benchmarks
pytest tests/benchmarks/
Test categories:
- Unit tests for components
- Integration tests for workflows
- Performance tests with various document sizes
- Edge case tests for malformed HTML
- Round-trip tests for content preservation
API Reference
Core Classes
HTMLExtractMergeTool
Main interface for extraction and merging.
Methods:
extract(html: str) -> Tuple[str, str]: Create superset and subsetmerge(edited: str, subset: str, superset: str, original: str) -> str: Merge content
ProcessingConfig
Configuration object.
Parameters:
id_prefix: str: ID prefix (default: "xhq")simi_level: float: Minimum similarity for fuzzy matching (default: 0.7)llm_use: bool: Use LLM for conflicts (default: False)perf: str: fast|balanced|accurate (default: "balanced")
LLMReconciler
LLM conflict resolution interface.
Parameters:
api_key: str: OpenAI API keymodel: str: Model name (default: "gpt-4o-mini")max_context_tokens: int: Maximum tokens per request (default: 1000)
Utility Functions
from htmladapt.utils import (
validate_html,
estimate_processing_time,
optimize_for_size
)
# Validate HTML
is_valid, issues = validate_html(html_content)
# Estimate processing time
time_estimate, memory_estimate = estimate_processing_time(html_content)
# Optimize large documents
optimized_html = optimize_for_size(html_content, target_size_mb=5)
Integration Examples
Flask Application
from flask import Flask, request, jsonify
from htmladapt import HTMLExtractMergeTool
app = Flask(__name__)
tool = HTMLExtractMergeTool()
@app.route('/extract', methods=['POST'])
def extract_content():
html = request.json['html']
superset, subset = tool.extract(html)
return jsonify({
'superset': superset,
'subset': subset
})
@app.route('/merge', methods=['POST'])
def merge_content():
data = request.json
result = tool.merge(
data['edited'],
data['subset'],
data['superset'],
data['original']
)
return jsonify({'result': result})
Django Integration
# models.py
from django.db import models
class Document(models.Model):
original_html = models.TextField()
map_html = models.TextField()
comp_html = models.TextField()
def extract_content(self):
from htmladapt import HTMLExtractMergeTool
tool = HTMLExtractMergeTool()
self.m_html, self.c_html = tool.extract(self.original_html)
self.save()
def merge_content(self, cnew_html):
from htmladapt import HTMLExtractMergeTool
tool = HTMLExtractMergeTool()
return tool.merge(
cnew_html,
self.c_html,
self.m_html,
self.original_html
)
Celery Processing
from celery import Celery
from htmladapt import HTMLExtractMergeTool
app = Celery('htmladapt_tasks')
tool = HTMLExtractMergeTool()
@app.task
def process_large_document(html_content, user_id):
try:
superset, subset = tool.extract(html_content)
return {'status': 'success', 'comp_id': store_subset(subset)}
except Exception as e:
return {'status': 'error', 'message': str(e)}
@app.task
def merge_edited_content(cnew_html, comp_html, map_html, original_html):
result = tool.merge(cnew_html, comp_html, map_html, original_html)
return result
Contributing
See CONTRIBUTING.md for guidelines.
Development Setup
# Clone repository
git clone https://github.com/yourusername/htmladapt.git
cd htmladapt
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev,test,llm]"
# Run tests
pytest
# Run type checking
mypy htmladapt/
# Format code
black htmladapt/
ruff check htmladapt/
Code Structure
htmladapt/
├── core/
│ ├── parser.py # HTML parsing
│ ├── extractor.py # Content extraction
│ ├── matcher.py # Element matching
│ └── merger.py # Content reconciliation
├── algorithms/
│ ├── id_generation.py # ID generation
│ ├── tree_diff.py # Tree comparison
│ └── fuzzy_match.py # Similarity scoring
├── llm/
│ ├── reconciler.py # LLM integration
│ └── prompts.py # Prompt templates
├── utils/
│ ├── html_utils.py # HTML utilities
│ └── performance.py # Performance optimization
└── tests/
├── unit/ # Unit tests
├── integration/ # Integration tests
└── benchmarks/ # Performance tests
License
MIT License - see LICENSE file.
Support
- Documentation: https://htmladapt.readthedocs.io
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: support@htmladapt.dev
Citation
For academic use:
@software{htmladapt2024,
title={HTMLAdapt: HTML Content Extraction and Merge Tool},
author={Your Name},
year={2024},
url={https://github.com/yourusername/htmladapt}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file htmladapt-1.0.11.tar.gz.
File metadata
- Download URL: htmladapt-1.0.11.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c888c16d2717a8855fe43e60b35d90359770cfa754d56ee6ebea37cf5bb03de0
|
|
| MD5 |
3c77516fbbff3060749c3e147efcabb3
|
|
| BLAKE2b-256 |
c2052db1781a6739cdc13a0d9ef5017cb65da6d3e9b143ecb4f8dcb4e997838e
|
File details
Details for the file htmladapt-1.0.11-py3-none-any.whl.
File metadata
- Download URL: htmladapt-1.0.11-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92a9029a05ff46fc0aa69d0e40af99ba836dddb3489c792e99327aa3f3d58aea
|
|
| MD5 |
e5a2765f9ecc4a95c0dac5bf62af5a68
|
|
| BLAKE2b-256 |
3e08b4d1f2c1ddad51c6a1e5aefa397595daec4e999a3415f889883a5c31ea89
|