Tool for analyzing and comparing embedding models through pairwise cosine similarity distributions
Project description
Embeddings Evaluator
A Python package for analyzing and comparing embedding models through pairwise cosine similarity distributions.
Features
- Pairwise cosine similarity distribution analysis
- Statistical measures:
- Mean (μ)
- Standard deviation (σ)
- Median (m)
- Peak location and amplitude
- Multi-model comparison visualization
Installation
pip install -r requirements.txt
Usage
import numpy as np
from embeddings_evaluator import plot_model_comparison
from embeddings_evaluator.comparison import save_comparison_plot
# Load your embeddings into a dictionary
embeddings_dict = {
"Model A": embeddings_a, # numpy array of shape (n_docs, embedding_dim)
"Model B": embeddings_b
}
# Generate comparison plot
fig = plot_model_comparison(embeddings_dict)
save_comparison_plot(fig, 'comparison.png')
Example with Faiss Indices
import faiss
import numpy as np
from embeddings_evaluator import plot_model_comparison
# Load embeddings from faiss indices
def load_faiss_embeddings(index_path):
index = faiss.read_index(index_path)
if isinstance(index, faiss.IndexFlatL2):
num_vectors = index.ntotal
dimension = index.d
embeddings = np.zeros((num_vectors, dimension), dtype=np.float32)
for i in range(num_vectors):
embeddings[i] = index.reconstruct(i)
return embeddings
raise ValueError("Unsupported index type")
# Load multiple models
embeddings_dict = {}
for size in [250, 500, 1000, 2000, 4000]:
embeddings = load_faiss_embeddings(f"faiss_embeddings/{size}/index.faiss")
# Normalize for cosine similarity
embeddings = embeddings / np.linalg.norm(embeddings, axis=1)[:, np.newaxis]
embeddings_dict[f"Model {size}"] = embeddings
# Generate visualization
fig = plot_model_comparison(embeddings_dict)
save_comparison_plot(fig, 'model_comparison.png')
Output
The tool provides:
- Statistical Measures for each model:
- Mean cosine similarity (μ)
- Standard deviation (σ)
- Median (m)
- Peak location and amplitude
- Visualization:
- Overlaid probability density histograms
- Statistical annotations
- Peak coordinates
- Vertical lines at mean values
- [0,1] bounded cosine similarity range
Requirements
- numpy
- pandas
- plotly
- scipy
- faiss-cpu (for faiss index support)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for embeddings_evaluator-1.0.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3222163f40b06b8c13284a48d6fb6a0e1a374c2992a3a948da391aac6ca1c01e |
|
MD5 | 6ce77820ac8e1afb449a01a3e717a076 |
|
BLAKE2b-256 | b6a3784a3accb865d0288ca9b4618dc9f630cb6212f448e76092bc037d076df3 |
Close
Hashes for embeddings_evaluator-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0ab83a1de09b8c7720eddc503dcc2820bbe031cd40776d97a9ce7f13c861838 |
|
MD5 | 79d34843742eec0b6fd89204369a99c3 |
|
BLAKE2b-256 | 0806b8948bc7f7fe7e6505a2e6e3b39d37fd3eaa9c1e26900659ca7308a15c67 |