Skip to main content

Advanced text similarity and classification using AI and Machine Learning

Project description

SimilarityText

PyPI version Python 3.7+ License: MIT

Advanced text similarity and classification using AI and Machine Learning

SimilarityText is a powerful Python library that leverages state-of-the-art AI and traditional NLP techniques to measure semantic similarity between texts and classify documents. With support for transformer models, machine learning classifiers, and 50+ languages, it's the perfect tool for modern NLP applications.


🌟 Key Features

🎯 Text Similarity

  • Classic TF-IDF: Fast and efficient lexical similarity
  • Neural Transformers: State-of-the-art semantic understanding using BERT-based models
  • Cross-lingual: Compare texts across different languages
  • Auto-method Selection: Automatically chooses the best available method

🏷️ Text Classification

  • Word Frequency: Simple baseline method
  • Machine Learning: SVM and Naive Bayes classifiers with TF-IDF features
  • Deep Learning: Transformer-based classification for maximum accuracy
  • Confidence Scores: Get prediction probabilities for all methods

🌍 Multilingual Support

  • 50+ languages supported out of the box
  • 17 languages with advanced stemming (Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish)
  • Automatic language detection with graceful fallbacks
  • Cross-lingual transformers for multilingual tasks

🚀 Easy to Use

  • Simple, intuitive API
  • sklearn-compatible interface (predict, predict_proba)
  • Extensive documentation and examples
  • Backward compatible (v0.2.0 code still works)

📦 Installation

Basic Installation

pip install SimilarityText

This installs the core library with TF-IDF and ML classification support.

Advanced Installation (with Transformers)

For state-of-the-art neural network support:

pip install SimilarityText[transformers]

Or install dependencies separately:

pip install sentence-transformers torch transformers

From Source

git clone https://github.com/fabiocax/SimilarityText.git
cd SimilarityText
pip install -e .

🚀 Quick Start

Text Similarity

from similarity import Similarity

# Initialize (downloads required NLTK data on first run)
sim = Similarity()

# Calculate similarity between two texts
score = sim.similarity(
    'The cat is sleeping on the couch',
    'A feline is resting on the sofa'
)
print(f"Similarity: {score:.2f}")  # Output: ~0.75

Text Classification

from similarity import Classification

# Prepare training data
training_data = [
    {"class": "positive", "word": "I love this product! Amazing quality."},
    {"class": "positive", "word": "Excellent service, highly recommend!"},
    {"class": "negative", "word": "Terrible experience, very disappointed."},
    {"class": "negative", "word": "Poor quality, waste of money."},
]

# Train classifier
classifier = Classification(use_ml=True)  # Use ML for better accuracy
classifier.learning(training_data)

# Classify new text
text = "This is absolutely wonderful! Best purchase ever."
predicted_class, confidence = classifier.calculate_score(
    text,
    return_confidence=True
)
print(f"Class: {predicted_class}, Confidence: {confidence:.2f}")
# Output: Class: positive, Confidence: 0.89

📚 Comprehensive Guide

Similarity Methods

1. TF-IDF Method (Default - Fast)

from similarity import Similarity

sim = Similarity(
    language='english',      # Target language
    langdetect=False,        # Auto-detect language
    quiet=True              # Suppress output
)

# Compare texts
score = sim.similarity('Python programming', 'Java programming')
print(f"TF-IDF Score: {score:.4f}")

Best for: Quick comparisons, large-scale batch processing, production systems with latency constraints

2. Transformer Method (Most Accurate)

from similarity import Similarity

sim = Similarity(
    use_transformers=True,
    model_name='paraphrase-multilingual-MiniLM-L12-v2'  # Default model
)

# Compare texts with deep semantic understanding
score = sim.similarity(
    'The quick brown fox jumps over the lazy dog',
    'A fast auburn fox leaps above an idle canine'
)
print(f"Transformer Score: {score:.4f}")

# Cross-lingual comparison
score = sim.similarity(
    'I love artificial intelligence',
    'Eu amo inteligência artificial'  # Portuguese
)
print(f"Cross-lingual Score: {score:.4f}")

Best for: Semantic understanding, cross-lingual tasks, when accuracy is critical

3. Method Selection

sim = Similarity(use_transformers=True)

# Auto: Uses transformers if available, falls back to TF-IDF
score = sim.similarity(text1, text2, method='auto')

# Force TF-IDF
score = sim.similarity(text1, text2, method='tfidf')

# Force transformers
score = sim.similarity(text1, text2, method='transformer')

Similarity Parameters

Similarity(
    update=True,              # Download NLTK data on initialization
    language='english',       # Default language for processing
    langdetect=False,         # Enable automatic language detection
    nltk_downloads=[],        # Additional NLTK packages to download
    quiet=True,              # Suppress informational messages
    use_transformers=False,   # Enable transformer models
    model_name='paraphrase-multilingual-MiniLM-L12-v2'  # Transformer model
)

Classification Methods

1. Word Frequency Method (Baseline)

from similarity import Classification

classifier = Classification(
    language='english',
    use_ml=False  # Disable ML, use word frequency
)

classifier.learning(training_data)
predicted_class = classifier.calculate_score("Sample text")

Best for: Simple categorization, understanding, baseline comparisons

2. Machine Learning Method (Recommended)

classifier = Classification(
    language='english',
    use_ml=True  # Enable SVM/Naive Bayes
)

classifier.learning(training_data)

# Get prediction with confidence
predicted_class, confidence = classifier.calculate_score(
    "Sample text",
    return_confidence=True
)

# sklearn-like interface
predicted = classifier.predict("Sample text")
probabilities = classifier.predict_proba("Sample text")
print(f"Probabilities: {probabilities}")

Best for: Production systems, when you have training data, balanced accuracy/speed

3. Transformer Method (Highest Accuracy)

classifier = Classification(
    language='english',
    use_transformers=True,
    model_name='paraphrase-multilingual-MiniLM-L12-v2'
)

classifier.learning(training_data)
predicted_class, confidence = classifier.calculate_score(
    "Sample text",
    return_confidence=True
)

Best for: Maximum accuracy, semantic understanding, sufficient compute resources

Classification Parameters

Classification(
    language='english',      # Language for text processing
    use_ml=True,            # Enable ML classifiers (SVM/Naive Bayes)
    use_transformers=False, # Enable transformer-based classification
    model_name='paraphrase-multilingual-MiniLM-L12-v2'  # Model name
)

🎯 Complete Examples

Example 1: Semantic Similarity Comparison

from similarity import Similarity

# Initialize both methods
sim_classic = Similarity()
sim_neural = Similarity(use_transformers=True)

# Test pairs
pairs = [
    ("The car is red", "The automobile is crimson"),
    ("Python is a programming language", "Java is used for coding"),
    ("I love machine learning", "Machine learning is fascinating"),
]

print("Method Comparison:")
print("-" * 60)
for text1, text2 in pairs:
    score_tfidf = sim_classic.similarity(text1, text2)
    score_neural = sim_neural.similarity(text1, text2)

    print(f"\nText A: {text1}")
    print(f"Text B: {text2}")
    print(f"TF-IDF:      {score_tfidf:.4f}")
    print(f"Transformer: {score_neural:.4f}")
    print(f"Difference:  {abs(score_neural - score_tfidf):.4f}")

Example 2: Sentiment Analysis

from similarity import Classification

# Training data
training_data = [
    {"class": "positive", "word": "excellent product quality amazing"},
    {"class": "positive", "word": "love it best purchase ever"},
    {"class": "positive", "word": "highly recommend great service"},
    {"class": "negative", "word": "terrible waste of money disappointed"},
    {"class": "negative", "word": "poor quality broke immediately"},
    {"class": "negative", "word": "awful experience never again"},
    {"class": "neutral", "word": "okay average nothing special"},
    {"class": "neutral", "word": "it works as expected"},
]

# Train classifier
classifier = Classification(use_ml=True)
classifier.learning(training_data)

# Test reviews
reviews = [
    "This is the best thing I've ever bought!",
    "Complete disaster, total waste of money.",
    "It's fine, does what it says.",
    "Absolutely fantastic, exceeded expectations!",
]

print("Sentiment Analysis Results:")
print("-" * 60)
for review in reviews:
    sentiment, confidence = classifier.calculate_score(
        review,
        return_confidence=True
    )
    print(f"\nReview: {review}")
    print(f"Sentiment: {sentiment.upper()}")
    print(f"Confidence: {confidence:.2f}")

Example 3: Multilingual Document Classification

from similarity import Classification

# Multilingual training data
training_data = [
    {"class": "technology", "word": "artificial intelligence machine learning"},
    {"class": "technology", "word": "inteligência artificial aprendizado de máquina"},
    {"class": "technology", "word": "intelligence artificielle apprentissage automatique"},
    {"class": "sports", "word": "football soccer championship tournament"},
    {"class": "sports", "word": "futebol campeonato torneio"},
    {"class": "sports", "word": "football championnat tournoi"},
]

# Use transformer for multilingual understanding
classifier = Classification(use_transformers=True)
classifier.learning(training_data)

# Test in different languages
test_texts = [
    "Deep learning neural networks are fascinating",  # English
    "O campeonato de futebol foi emocionante",       # Portuguese
    "L'intelligence artificielle change le monde",    # French
]

print("Multilingual Classification:")
print("-" * 60)
for text in test_texts:
    category, confidence = classifier.calculate_score(
        text,
        return_confidence=True
    )
    print(f"\nText: {text}")
    print(f"Category: {category}")
    print(f"Confidence: {confidence:.2f}")

🔬 Performance Comparison

Similarity Methods

Method Speed Accuracy Cross-lingual Memory Best Use Case
TF-IDF ⚡⚡⚡ Very Fast ⭐⭐⭐ Good ❌ No Low Quick comparisons, batch processing
Transformers ⚡ Slow ⭐⭐⭐⭐⭐ Excellent ✅ Yes High Semantic understanding, cross-lingual

Classification Methods

Method Speed Accuracy Training Time Memory Best Use Case
Word Frequency ⚡⚡⚡ Very Fast ⭐⭐ Fair Instant Very Low Baseline, simple tasks
ML (SVM) ⚡⚡ Fast ⭐⭐⭐⭐ Very Good Fast Low Production systems
Transformers ⚡ Slow ⭐⭐⭐⭐⭐ Excellent Medium High Maximum accuracy

Benchmark Results

Tested on Intel i7, 16GB RAM, using 1000 text pairs:

Similarity Benchmarks:
├── TF-IDF:       0.05s (20,000 pairs/sec)
├── Transformers: 2.30s (435 pairs/sec)

Classification Benchmarks (100 documents):
├── Word Frequency: 0.02s
├── ML (SVM):      0.15s
├── Transformers:  1.80s

📖 Available Transformer Models

Recommended Models

Model Size Speed Languages Best For
paraphrase-multilingual-MiniLM-L12-v2 418MB Fast 50+ General purpose (default)
all-MiniLM-L6-v2 80MB Very Fast EN English-only, speed critical
paraphrase-mpnet-base-v2 420MB Medium EN English, highest accuracy
distiluse-base-multilingual-cased-v2 480MB Medium 50+ Multilingual, good balance
all-mpnet-base-v2 420MB Medium EN English, semantic search

Usage

# Use a specific model
sim = Similarity(
    use_transformers=True,
    model_name='all-MiniLM-L6-v2'  # Fast English model
)

Browse all models: https://www.sbert.net/docs/pretrained_models.html


🌐 Supported Languages

Full language support includes:

European: English, Portuguese, Spanish, French, German, Italian, Dutch, Russian, Polish, Romanian, Hungarian, Czech, Swedish, Danish, Finnish, Norwegian, Turkish, Greek

Asian: Chinese, Japanese, Korean, Arabic, Hebrew, Thai, Vietnamese, Indonesian

Others: Hindi, Bengali, Tamil, Urdu, Persian, and 30+ more

Advanced stemming available for: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish


📊 API Reference

See API.md for complete API documentation.

Similarity Class

class Similarity:
    def __init__(self, update=True, language='english', langdetect=False,
                 nltk_downloads=[], quiet=True, use_transformers=False,
                 model_name='paraphrase-multilingual-MiniLM-L12-v2'):
        """Initialize Similarity analyzer"""

    def similarity(self, text_a, text_b, method='auto'):
        """Calculate similarity between two texts (returns float 0.0-1.0)"""

    def detectlang(self, text):
        """Detect language of text (returns language name)"""

Classification Class

class Classification:
    def __init__(self, language='english', use_ml=True, use_transformers=False,
                 model_name='paraphrase-multilingual-MiniLM-L12-v2'):
        """Initialize classifier"""

    def learning(self, training_data):
        """Train classifier with list of {"class": str, "word": str} dicts"""

    def calculate_score(self, sentence, return_confidence=False):
        """Classify sentence, optionally return confidence"""

    def predict(self, sentence):
        """Predict class (sklearn-compatible)"""

    def predict_proba(self, sentence):
        """Get class probabilities (sklearn-compatible)"""

🆕 What's New in v0.3.0

🎯 Major Features

  • Transformer support: State-of-the-art neural models via sentence-transformers
  • 🧠 ML classifiers: SVM and Naive Bayes with TF-IDF
  • 🌍 Better multilingual: Improved language handling with 17 stemmers
  • 📊 Confidence scores: Get prediction probabilities
  • 🔧 Flexible API: sklearn-like interface with predict() and predict_proba()

🐛 Critical Bug Fixes

  • Fixed typo: requeriments.txtrequirements.txt
  • Fixed RSLPStemmer being used for all languages (now language-aware)
  • Fixed crashes when stopwords unavailable for languages
  • Fixed language detection failures on short texts
  • Fixed exception messages for better debugging
  • Added punkt_tab to NLTK downloads for compatibility

🔄 Backwards Compatibility

All v0.2.0 code continues to work without modifications. New features are opt-in.

See CHANGELOG.md for complete version history.


📝 Examples

Explore the example/ directory:

  • example.py: Basic TF-IDF similarity examples
  • exemplo2.py: Classification examples
  • example_advanced.py: Advanced AI features with transformers and comparisons

Run examples:

python example/example.py
python example/example_advanced.py

🤝 Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.

Quick Start for Contributors

# Clone repository
git clone https://github.com/fabiocax/SimilarityText.git
cd SimilarityText

# Install in development mode
pip install -e .[transformers]

# Run examples
python example/example_advanced.py

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


👤 Author

Fabio Alberti


🔗 Links


🙏 Acknowledgments

  • sentence-transformers: For providing excellent pre-trained models
  • scikit-learn: For robust ML algorithms
  • NLTK: For comprehensive NLP tools
  • All contributors and users of this library

⭐ Star History

If you find this project useful, please consider giving it a star on GitHub!

Star History Chart


📈 Roadmap

  • Add more pre-trained models
  • Batch processing API
  • GPU acceleration support
  • REST API server
  • Caching mechanisms
  • More language-specific optimizations
  • Integration with popular frameworks (FastAPI, Flask)

Made with ❤️ using Python and AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

similaritytext-0.3.1.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

similaritytext-0.3.1-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file similaritytext-0.3.1.tar.gz.

File metadata

  • Download URL: similaritytext-0.3.1.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for similaritytext-0.3.1.tar.gz
Algorithm Hash digest
SHA256 f3149116da02a94912a9f6db0df629b4591185d9c56766da3e24c1bead8321b2
MD5 85759bfca453226ef363fe219f69083d
BLAKE2b-256 f9e6a0839b3d30c845e53363f743b002b54a337097560a434526bbeac4a31f78

See more details on using hashes here.

File details

Details for the file similaritytext-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: similaritytext-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for similaritytext-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 da9bc641ac9fd5bf25425ca89aab2698da3e182d62a23cf8cc7420104d493792
MD5 5d94f98ee1d1090b6d42d5816f418be8
BLAKE2b-256 bf990ecf7c8479ed6574e64b05adff2103206f132198c54a6d1b52a87beb77bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page