Advanced text alignment and semantic containment analysis tool

Project description

Semantic Comparer

Advanced text alignment and semantic containment analysis tool using modern Python practices.

Features

Semantic Alignment: Uses Smith-Waterman algorithm with sentence transformers for intelligent text comparison
Modern CLI: Built with Typer and Rich for beautiful, user-friendly interface
Async Processing: High-performance asynchronous operations
File Support: Direct text input or file-based processing
Rich Output: Colorized, formatted results with detailed statistics
Type Safety: Full type annotations and modern Python practices

Installation

# Install dependencies
uv add rich typer aiofiles

# Install the package
uv pip install -e .

# Or run directly as a module
python -m semantic_comparer compare "text1" "text2"

Usage

Basic Comparison

# Compare two texts directly
python -m semantic_comparer compare "This is the first text." "This is the second text."

# Compare with custom parameters
python -m semantic_comparer compare \
  "First text content" \
  "Second text content" \
  --model paraphrase-multilingual-MiniLM-L12-v2 \
  --gap-penalty 0.3 \
  --similarity-threshold 0.5

File-based Comparison

# Compare text files (prefix with @)
python -m semantic_comparer compare @file1.txt @file2.txt

# Mix direct text and file
python -m semantic_comparer compare "Direct text" @file.txt

Output Options

# Quiet mode (summary only)
python -m semantic_comparer compare text1 text2 --quiet

# Save detailed results to JSON
python -m semantic_comparer compare text1 text2 --output results.json

Command Line Options

Option	Short	Description	Default
`--model`	`-m`	Sentence transformer model	`paraphrase-multilingual-MiniLM-L12-v2`
`--gap-penalty`	`-g`	Penalty for gaps (0.0-1.0)	`0.3`
`--similarity-threshold`	`-t`	Minimum similarity for matches (0.0-1.0)	`0.5`
`--output`	`-o`	Output file for JSON results	None
`--quiet`	`-q`	Suppress detailed output	False

Understanding the Results

Alignment Types

✓ Match: Paragraphs that are semantically similar
⚠ Only in A/B: Paragraphs present in one text but not the other
✗ Unaligned: Paragraphs that couldn't be matched

Containment Score

The semantic containment score measures how much of text A's semantic content is found in text B:

0.0-0.4: Low similarity (Red)
0.4-0.7: Moderate similarity (Yellow)
0.7-1.0: High similarity (Green)

Advanced Usage

Custom Models

# Use a different sentence transformer model
python -m semantic_comparer compare text1 text2 --model all-mpnet-base-v2

Fine-tuning Parameters

# Stricter matching (higher threshold)
python -m semantic_comparer compare text1 text2 --similarity-threshold 0.8

# More lenient gap handling (lower penalty)
python -m semantic_comparer compare text1 text2 --gap-penalty 0.1

Development

Project Structure

semantic_comparer/
├── __init__.py          # Package initialization
├── core.py              # Core alignment logic
├── cli.py               # Command-line interface
└── utils.py             # Utility functions

Running Tests

# Install dev dependencies
uv add --dev pytest pytest-asyncio black isort mypy ruff

# Run tests
pytest

# Format code
black .
isort .

# Type checking
mypy .

# Linting
ruff check .

Technical Details

Algorithm

The tool uses the Smith-Waterman algorithm adapted for semantic similarity:

Text Segmentation: Split texts into paragraphs
Embedding Generation: Convert paragraphs to semantic vectors
Similarity Calculation: Compute cosine similarity between vectors
Dynamic Programming: Apply Smith-Waterman for optimal alignment
Score Calculation: Weighted containment score based on matches

Performance

Async Processing: Non-blocking I/O operations
Memory Efficient: Streaming file processing for large texts
Progress Tracking: Real-time progress indicators
Error Handling: Robust error handling with user-friendly messages

Security

Input Validation: Comprehensive parameter validation
File Safety: Secure file operations with size limits
Text Sanitization: Removal of problematic characters
Error Isolation: Graceful error handling without data exposure

Contributing

Fork the repository
Create a feature branch
Make your changes with proper type annotations
Add tests for new functionality
Ensure code passes linting and type checking
Submit a pull request

License

MIT License - see LICENSE file for details.

Project details

Release history Release notifications | RSS feed

0.1.8

Jul 18, 2025

0.1.7

Jul 18, 2025

0.1.6

Jul 18, 2025

0.1.3

Jul 18, 2025

This version

0.1.2

Jul 18, 2025

0.1.1

Jul 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_comparer-0.1.2.tar.gz (62.2 kB view details)

Uploaded Jul 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semantic_comparer-0.1.2-py3-none-any.whl (11.4 kB view details)

Uploaded Jul 18, 2025 Python 3

File details

Details for the file semantic_comparer-0.1.2.tar.gz.

File metadata

Download URL: semantic_comparer-0.1.2.tar.gz
Upload date: Jul 18, 2025
Size: 62.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.8.0

File hashes

Hashes for semantic_comparer-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`981c00bd1e87d441ed3e6cb35d61d727e3aac1f5ac069e9c7525ef6498abd9bb`
MD5	`9211da84ed757e0c62aaf8161a1e5f40`
BLAKE2b-256	`e3163ff23522f556cb6116b125e657d6ec129da32f5a821c6bc4d05480d45e73`

See more details on using hashes here.

File details

Details for the file semantic_comparer-0.1.2-py3-none-any.whl.

File metadata

Download URL: semantic_comparer-0.1.2-py3-none-any.whl
Upload date: Jul 18, 2025
Size: 11.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.8.0

File hashes

Hashes for semantic_comparer-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`72eb4083d366dcd1a444c7b0419dcebc91e72d09c32233a40c1b7de5c2570ae9`
MD5	`f228c4ddbdc7b8ef695c01a2c0d860ca`
BLAKE2b-256	`80e108764a203d0640b72ebb346fe13139bc625aa503cacde17c25defa007a39`

See more details on using hashes here.

semantic-comparer 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Semantic Comparer

Features

Installation

Usage

Basic Comparison

File-based Comparison

Output Options

Command Line Options

Understanding the Results

Alignment Types

Containment Score

Advanced Usage

Custom Models

Fine-tuning Parameters

Development

Project Structure

Running Tests

Technical Details

Algorithm

Performance

Security

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes