Advanced text similarity and classification using AI and Machine Learning
Project description
SimilarityText
Advanced text similarity and classification using AI and Machine Learning
SimilarityText is a powerful Python library that leverages state-of-the-art AI and traditional NLP techniques to measure semantic similarity between texts and classify documents. With support for transformer models, machine learning classifiers, and 50+ languages, it's the perfect tool for modern NLP applications.
🌟 Key Features
🎯 Text Similarity
- Classic TF-IDF: Fast and efficient lexical similarity
- Neural Transformers: State-of-the-art semantic understanding using BERT-based models
- Cross-lingual: Compare texts across different languages
- Auto-method Selection: Automatically chooses the best available method
🏷️ Text Classification
- Word Frequency: Simple baseline method
- Machine Learning: SVM and Naive Bayes classifiers with TF-IDF features
- Deep Learning: Transformer-based classification for maximum accuracy
- Confidence Scores: Get prediction probabilities for all methods
🌍 Multilingual Support
- 50+ languages supported out of the box
- 17 languages with advanced stemming (Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish)
- Automatic language detection with graceful fallbacks
- Cross-lingual transformers for multilingual tasks
🚀 Easy to Use
- Simple, intuitive API
- sklearn-compatible interface (
predict,predict_proba) - Extensive documentation and examples
- Backward compatible (v0.2.0 code still works)
📦 Installation
Basic Installation
pip install SimilarityText
This installs the core library with TF-IDF and ML classification support.
Advanced Installation (with Transformers)
For state-of-the-art neural network support:
pip install SimilarityText[transformers]
Or install dependencies separately:
pip install sentence-transformers torch transformers
From Source
git clone https://github.com/fabiocax/SimilarityText.git
cd SimilarityText
pip install -e .
🚀 Quick Start
Text Similarity
from similarity import Similarity
# Initialize (downloads required NLTK data on first run)
sim = Similarity()
# Calculate similarity between two texts
score = sim.similarity(
'The cat is sleeping on the couch',
'A feline is resting on the sofa'
)
print(f"Similarity: {score:.2f}") # Output: ~0.75
Text Classification
from similarity import Classification
# Prepare training data
training_data = [
{"class": "positive", "word": "I love this product! Amazing quality."},
{"class": "positive", "word": "Excellent service, highly recommend!"},
{"class": "negative", "word": "Terrible experience, very disappointed."},
{"class": "negative", "word": "Poor quality, waste of money."},
]
# Train classifier
classifier = Classification(use_ml=True) # Use ML for better accuracy
classifier.learning(training_data)
# Classify new text
text = "This is absolutely wonderful! Best purchase ever."
predicted_class, confidence = classifier.calculate_score(
text,
return_confidence=True
)
print(f"Class: {predicted_class}, Confidence: {confidence:.2f}")
# Output: Class: positive, Confidence: 0.89
📚 Comprehensive Guide
Similarity Methods
1. TF-IDF Method (Default - Fast)
from similarity import Similarity
sim = Similarity(
language='english', # Target language
langdetect=False, # Auto-detect language
quiet=True # Suppress output
)
# Compare texts
score = sim.similarity('Python programming', 'Java programming')
print(f"TF-IDF Score: {score:.4f}")
Best for: Quick comparisons, large-scale batch processing, production systems with latency constraints
2. Transformer Method (Most Accurate)
from similarity import Similarity
sim = Similarity(
use_transformers=True,
model_name='paraphrase-multilingual-MiniLM-L12-v2' # Default model
)
# Compare texts with deep semantic understanding
score = sim.similarity(
'The quick brown fox jumps over the lazy dog',
'A fast auburn fox leaps above an idle canine'
)
print(f"Transformer Score: {score:.4f}")
# Cross-lingual comparison
score = sim.similarity(
'I love artificial intelligence',
'Eu amo inteligência artificial' # Portuguese
)
print(f"Cross-lingual Score: {score:.4f}")
Best for: Semantic understanding, cross-lingual tasks, when accuracy is critical
3. Method Selection
sim = Similarity(use_transformers=True)
# Auto: Uses transformers if available, falls back to TF-IDF
score = sim.similarity(text1, text2, method='auto')
# Force TF-IDF
score = sim.similarity(text1, text2, method='tfidf')
# Force transformers
score = sim.similarity(text1, text2, method='transformer')
Similarity Parameters
Similarity(
update=True, # Download NLTK data on initialization
language='english', # Default language for processing
langdetect=False, # Enable automatic language detection
nltk_downloads=[], # Additional NLTK packages to download
quiet=True, # Suppress informational messages
use_transformers=False, # Enable transformer models
model_name='paraphrase-multilingual-MiniLM-L12-v2' # Transformer model
)
Classification Methods
1. Word Frequency Method (Baseline)
from similarity import Classification
classifier = Classification(
language='english',
use_ml=False # Disable ML, use word frequency
)
classifier.learning(training_data)
predicted_class = classifier.calculate_score("Sample text")
Best for: Simple categorization, understanding, baseline comparisons
2. Machine Learning Method (Recommended)
classifier = Classification(
language='english',
use_ml=True # Enable SVM/Naive Bayes
)
classifier.learning(training_data)
# Get prediction with confidence
predicted_class, confidence = classifier.calculate_score(
"Sample text",
return_confidence=True
)
# sklearn-like interface
predicted = classifier.predict("Sample text")
probabilities = classifier.predict_proba("Sample text")
print(f"Probabilities: {probabilities}")
Best for: Production systems, when you have training data, balanced accuracy/speed
3. Transformer Method (Highest Accuracy)
classifier = Classification(
language='english',
use_transformers=True,
model_name='paraphrase-multilingual-MiniLM-L12-v2'
)
classifier.learning(training_data)
predicted_class, confidence = classifier.calculate_score(
"Sample text",
return_confidence=True
)
Best for: Maximum accuracy, semantic understanding, sufficient compute resources
Classification Parameters
Classification(
language='english', # Language for text processing
use_ml=True, # Enable ML classifiers (SVM/Naive Bayes)
use_transformers=False, # Enable transformer-based classification
model_name='paraphrase-multilingual-MiniLM-L12-v2' # Model name
)
🎯 Complete Examples
Example 1: Semantic Similarity Comparison
from similarity import Similarity
# Initialize both methods
sim_classic = Similarity()
sim_neural = Similarity(use_transformers=True)
# Test pairs
pairs = [
("The car is red", "The automobile is crimson"),
("Python is a programming language", "Java is used for coding"),
("I love machine learning", "Machine learning is fascinating"),
]
print("Method Comparison:")
print("-" * 60)
for text1, text2 in pairs:
score_tfidf = sim_classic.similarity(text1, text2)
score_neural = sim_neural.similarity(text1, text2)
print(f"\nText A: {text1}")
print(f"Text B: {text2}")
print(f"TF-IDF: {score_tfidf:.4f}")
print(f"Transformer: {score_neural:.4f}")
print(f"Difference: {abs(score_neural - score_tfidf):.4f}")
Example 2: Sentiment Analysis
from similarity import Classification
# Training data
training_data = [
{"class": "positive", "word": "excellent product quality amazing"},
{"class": "positive", "word": "love it best purchase ever"},
{"class": "positive", "word": "highly recommend great service"},
{"class": "negative", "word": "terrible waste of money disappointed"},
{"class": "negative", "word": "poor quality broke immediately"},
{"class": "negative", "word": "awful experience never again"},
{"class": "neutral", "word": "okay average nothing special"},
{"class": "neutral", "word": "it works as expected"},
]
# Train classifier
classifier = Classification(use_ml=True)
classifier.learning(training_data)
# Test reviews
reviews = [
"This is the best thing I've ever bought!",
"Complete disaster, total waste of money.",
"It's fine, does what it says.",
"Absolutely fantastic, exceeded expectations!",
]
print("Sentiment Analysis Results:")
print("-" * 60)
for review in reviews:
sentiment, confidence = classifier.calculate_score(
review,
return_confidence=True
)
print(f"\nReview: {review}")
print(f"Sentiment: {sentiment.upper()}")
print(f"Confidence: {confidence:.2f}")
Example 3: Multilingual Document Classification
from similarity import Classification
# Multilingual training data
training_data = [
{"class": "technology", "word": "artificial intelligence machine learning"},
{"class": "technology", "word": "inteligência artificial aprendizado de máquina"},
{"class": "technology", "word": "intelligence artificielle apprentissage automatique"},
{"class": "sports", "word": "football soccer championship tournament"},
{"class": "sports", "word": "futebol campeonato torneio"},
{"class": "sports", "word": "football championnat tournoi"},
]
# Use transformer for multilingual understanding
classifier = Classification(use_transformers=True)
classifier.learning(training_data)
# Test in different languages
test_texts = [
"Deep learning neural networks are fascinating", # English
"O campeonato de futebol foi emocionante", # Portuguese
"L'intelligence artificielle change le monde", # French
]
print("Multilingual Classification:")
print("-" * 60)
for text in test_texts:
category, confidence = classifier.calculate_score(
text,
return_confidence=True
)
print(f"\nText: {text}")
print(f"Category: {category}")
print(f"Confidence: {confidence:.2f}")
🔬 Performance Comparison
Similarity Methods
| Method | Speed | Accuracy | Cross-lingual | Memory | Best Use Case |
|---|---|---|---|---|---|
| TF-IDF | ⚡⚡⚡ Very Fast | ⭐⭐⭐ Good | ❌ No | Low | Quick comparisons, batch processing |
| Transformers | ⚡ Slow | ⭐⭐⭐⭐⭐ Excellent | ✅ Yes | High | Semantic understanding, cross-lingual |
Classification Methods
| Method | Speed | Accuracy | Training Time | Memory | Best Use Case |
|---|---|---|---|---|---|
| Word Frequency | ⚡⚡⚡ Very Fast | ⭐⭐ Fair | Instant | Very Low | Baseline, simple tasks |
| ML (SVM) | ⚡⚡ Fast | ⭐⭐⭐⭐ Very Good | Fast | Low | Production systems |
| Transformers | ⚡ Slow | ⭐⭐⭐⭐⭐ Excellent | Medium | High | Maximum accuracy |
Benchmark Results
Tested on Intel i7, 16GB RAM, using 1000 text pairs:
Similarity Benchmarks:
├── TF-IDF: 0.05s (20,000 pairs/sec)
├── Transformers: 2.30s (435 pairs/sec)
Classification Benchmarks (100 documents):
├── Word Frequency: 0.02s
├── ML (SVM): 0.15s
├── Transformers: 1.80s
📖 Available Transformer Models
Recommended Models
| Model | Size | Speed | Languages | Best For |
|---|---|---|---|---|
paraphrase-multilingual-MiniLM-L12-v2 |
418MB | Fast | 50+ | General purpose (default) |
all-MiniLM-L6-v2 |
80MB | Very Fast | EN | English-only, speed critical |
paraphrase-mpnet-base-v2 |
420MB | Medium | EN | English, highest accuracy |
distiluse-base-multilingual-cased-v2 |
480MB | Medium | 50+ | Multilingual, good balance |
all-mpnet-base-v2 |
420MB | Medium | EN | English, semantic search |
Usage
# Use a specific model
sim = Similarity(
use_transformers=True,
model_name='all-MiniLM-L6-v2' # Fast English model
)
Browse all models: https://www.sbert.net/docs/pretrained_models.html
🌐 Supported Languages
Full language support includes:
European: English, Portuguese, Spanish, French, German, Italian, Dutch, Russian, Polish, Romanian, Hungarian, Czech, Swedish, Danish, Finnish, Norwegian, Turkish, Greek
Asian: Chinese, Japanese, Korean, Arabic, Hebrew, Thai, Vietnamese, Indonesian
Others: Hindi, Bengali, Tamil, Urdu, Persian, and 30+ more
Advanced stemming available for: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish
📊 API Reference
See API.md for complete API documentation.
Similarity Class
class Similarity:
def __init__(self, update=True, language='english', langdetect=False,
nltk_downloads=[], quiet=True, use_transformers=False,
model_name='paraphrase-multilingual-MiniLM-L12-v2'):
"""Initialize Similarity analyzer"""
def similarity(self, text_a, text_b, method='auto'):
"""Calculate similarity between two texts (returns float 0.0-1.0)"""
def detectlang(self, text):
"""Detect language of text (returns language name)"""
Classification Class
class Classification:
def __init__(self, language='english', use_ml=True, use_transformers=False,
model_name='paraphrase-multilingual-MiniLM-L12-v2'):
"""Initialize classifier"""
def learning(self, training_data):
"""Train classifier with list of {"class": str, "word": str} dicts"""
def calculate_score(self, sentence, return_confidence=False):
"""Classify sentence, optionally return confidence"""
def predict(self, sentence):
"""Predict class (sklearn-compatible)"""
def predict_proba(self, sentence):
"""Get class probabilities (sklearn-compatible)"""
🆕 What's New in v0.3.0
🎯 Major Features
- ✨ Transformer support: State-of-the-art neural models via sentence-transformers
- 🧠 ML classifiers: SVM and Naive Bayes with TF-IDF
- 🌍 Better multilingual: Improved language handling with 17 stemmers
- 📊 Confidence scores: Get prediction probabilities
- 🔧 Flexible API: sklearn-like interface with
predict()andpredict_proba()
🐛 Critical Bug Fixes
- Fixed typo:
requeriments.txt→requirements.txt - Fixed RSLPStemmer being used for all languages (now language-aware)
- Fixed crashes when stopwords unavailable for languages
- Fixed language detection failures on short texts
- Fixed exception messages for better debugging
- Added
punkt_tabto NLTK downloads for compatibility
🔄 Backwards Compatibility
All v0.2.0 code continues to work without modifications. New features are opt-in.
See CHANGELOG.md for complete version history.
📝 Examples
Explore the example/ directory:
example.py: Basic TF-IDF similarity examplesexemplo2.py: Classification examplesexample_advanced.py: Advanced AI features with transformers and comparisons
Run examples:
python example/example.py
python example/example_advanced.py
🤝 Contributing
Contributions are welcome! Please read CONTRIBUTING.md for guidelines.
Quick Start for Contributors
# Clone repository
git clone https://github.com/fabiocax/SimilarityText.git
cd SimilarityText
# Install in development mode
pip install -e .[transformers]
# Run examples
python example/example_advanced.py
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
👤 Author
Fabio Alberti
- Email: fabiocax@gmail.com
- GitHub: @fabiocax
🔗 Links
- GitHub: https://github.com/fabiocax/SimilarityText
- PyPI: https://pypi.org/project/SimilarityText/
- Documentation: https://github.com/fabiocax/SimilarityText/blob/main/README.md
- Issues: https://github.com/fabiocax/SimilarityText/issues
🙏 Acknowledgments
- sentence-transformers: For providing excellent pre-trained models
- scikit-learn: For robust ML algorithms
- NLTK: For comprehensive NLP tools
- All contributors and users of this library
⭐ Star History
If you find this project useful, please consider giving it a star on GitHub!
📈 Roadmap
- Add more pre-trained models
- Batch processing API
- GPU acceleration support
- REST API server
- Caching mechanisms
- More language-specific optimizations
- Integration with popular frameworks (FastAPI, Flask)
Made with ❤️ using Python and AI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file similaritytext-0.3.1.tar.gz.
File metadata
- Download URL: similaritytext-0.3.1.tar.gz
- Upload date:
- Size: 38.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3149116da02a94912a9f6db0df629b4591185d9c56766da3e24c1bead8321b2
|
|
| MD5 |
85759bfca453226ef363fe219f69083d
|
|
| BLAKE2b-256 |
f9e6a0839b3d30c845e53363f743b002b54a337097560a434526bbeac4a31f78
|
File details
Details for the file similaritytext-0.3.1-py3-none-any.whl.
File metadata
- Download URL: similaritytext-0.3.1-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da9bc641ac9fd5bf25425ca89aab2698da3e182d62a23cf8cc7420104d493792
|
|
| MD5 |
5d94f98ee1d1090b6d42d5816f418be8
|
|
| BLAKE2b-256 |
bf990ecf7c8479ed6574e64b05adff2103206f132198c54a6d1b52a87beb77bf
|