Skip to main content

Advanced text alignment and semantic containment analysis tool

Project description

Semantic Comparer

Advanced text alignment and semantic containment analysis tool using modern Python practices.

Features

  • Semantic Alignment: Uses Smith-Waterman algorithm with sentence transformers for intelligent text comparison
  • Modern CLI: Built with Typer and Rich for beautiful, user-friendly interface
  • Async Processing: High-performance asynchronous operations
  • File Support: Direct text input or file-based processing
  • Rich Output: Colorized, formatted results with detailed statistics
  • Type Safety: Full type annotations and modern Python practices

Installation

# Install dependencies
uv add rich typer aiofiles

# Install the package
uv pip install -e .

# Or run directly as a module
python -m semantic_comparer compare "text1" "text2"

Usage

Basic Comparison

# Compare two texts directly
python -m semantic_comparer compare "This is the first text." "This is the second text."

# Compare with custom parameters
python -m semantic_comparer compare \
  "First text content" \
  "Second text content" \
  --model paraphrase-multilingual-MiniLM-L12-v2 \
  --gap-penalty 0.3 \
  --similarity-threshold 0.5

File-based Comparison

# Compare text files (prefix with @)
python -m semantic_comparer compare @file1.txt @file2.txt

# Mix direct text and file
python -m semantic_comparer compare "Direct text" @file.txt

Output Options

# Quiet mode (summary only)
python -m semantic_comparer compare text1 text2 --quiet

# Save detailed results to JSON
python -m semantic_comparer compare text1 text2 --output results.json

Command Line Options

Option Short Description Default
--model -m Sentence transformer model paraphrase-multilingual-MiniLM-L12-v2
--gap-penalty -g Penalty for gaps (0.0-1.0) 0.3
--similarity-threshold -t Minimum similarity for matches (0.0-1.0) 0.5
--output -o Output file for JSON results None
--quiet -q Suppress detailed output False

Understanding the Results

Alignment Types

  • ✓ Match: Paragraphs that are semantically similar
  • ⚠ Only in A/B: Paragraphs present in one text but not the other
  • ✗ Unaligned: Paragraphs that couldn't be matched

Containment Score

The semantic containment score measures how much of text A's semantic content is found in text B:

  • 0.0-0.4: Low similarity (Red)
  • 0.4-0.7: Moderate similarity (Yellow)
  • 0.7-1.0: High similarity (Green)

Advanced Usage

Custom Models

# Use a different sentence transformer model
python -m semantic_comparer compare text1 text2 --model all-mpnet-base-v2

Fine-tuning Parameters

# Stricter matching (higher threshold)
python -m semantic_comparer compare text1 text2 --similarity-threshold 0.8

# More lenient gap handling (lower penalty)
python -m semantic_comparer compare text1 text2 --gap-penalty 0.1

Development

Project Structure

semantic_comparer/
├── __init__.py          # Package initialization
├── core.py              # Core alignment logic
├── cli.py               # Command-line interface
└── utils.py             # Utility functions

Running Tests

# Install dev dependencies
uv add --dev pytest pytest-asyncio black isort mypy ruff

# Run tests
pytest

# Format code
black .
isort .

# Type checking
mypy .

# Linting
ruff check .

Technical Details

Algorithm

The tool uses the Smith-Waterman algorithm adapted for semantic similarity:

  1. Text Segmentation: Split texts into paragraphs
  2. Embedding Generation: Convert paragraphs to semantic vectors
  3. Similarity Calculation: Compute cosine similarity between vectors
  4. Dynamic Programming: Apply Smith-Waterman for optimal alignment
  5. Score Calculation: Weighted containment score based on matches

Performance

  • Async Processing: Non-blocking I/O operations
  • Memory Efficient: Streaming file processing for large texts
  • Progress Tracking: Real-time progress indicators
  • Error Handling: Robust error handling with user-friendly messages

Security

  • Input Validation: Comprehensive parameter validation
  • File Safety: Secure file operations with size limits
  • Text Sanitization: Removal of problematic characters
  • Error Isolation: Graceful error handling without data exposure

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with proper type annotations
  4. Add tests for new functionality
  5. Ensure code passes linting and type checking
  6. Submit a pull request

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_comparer-0.1.6.tar.gz (63.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_comparer-0.1.6-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file semantic_comparer-0.1.6.tar.gz.

File metadata

  • Download URL: semantic_comparer-0.1.6.tar.gz
  • Upload date:
  • Size: 63.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.8.0

File hashes

Hashes for semantic_comparer-0.1.6.tar.gz
Algorithm Hash digest
SHA256 d99420572ecfe06161758733c59c8e965407a89562dd217f4ffae15191e0b648
MD5 d8c21e0ad86c9d96a6a6b5cf188f4e12
BLAKE2b-256 ed196188536d699899bfacc3beda4f167fefb8109e3cdf0b56eeacc30e20f27b

See more details on using hashes here.

File details

Details for the file semantic_comparer-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_comparer-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2084f4b34fdef361fb0dc1ba0f65ac1d3fbcab09760d59087f6f58b1d3283cb4
MD5 54ef2f77c3c40cd113c5e97c61114e6a
BLAKE2b-256 eb43a01107f2411ffb4df4edf54d50add3b7a0dfb4c9129d886563ce4acbcf8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page