XMPro Vector Insight is a Python library for analyzing and visualizing embedding data. It provides tools for calculating various similarity measures (cosine, Euclidean, inner product) between embeddings and reference vectors, applying different scaling techniques, and performing basic statistical analysis on embedding data. The library is designed with flexibility in mind, allowing for easy extension and customization of its core functionalities.
Project description
XMCluster
XMCluster is a Python library designed for performing density-based clustering on embedding data using DBSCAN (Density-Based Spatial Clustering of Applications with Noise). It provides a flexible and extensible framework for conducting cluster analysis, optimizing clustering parameters, and analyzing cluster characteristics.
Features
- DBSCAN Clustering: Perform density-based clustering on your embedding data.
- Cluster Statistics: Get basic statistics about the clusters, including the number of clusters, number of noise points, and cluster sizes.
- Cluster Quality Metrics: Calculate various cluster quality metrics such as Silhouette score, Calinski-Harabasz index, and Davies-Bouldin index.
- Optimal Parameter Finding: Find the optimal epsilon parameter for DBSCAN using a grid search and the Silhouette score.
- Cluster Centroids: Calculate the centroids of each cluster.
- Intra-cluster Distances: Calculate the average distance of points to their cluster centroid.
Installation
Install XMCluster using pip:
pip install xmcluster
Usage
Here's a basic example of how to use XMCluster:
from xmcluster import DBSCANAnalyzer, ClusterMetric
# Sample embeddings
embeddings = {
'key1': [[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]],
'key2': [[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]]
}
# Create a DBSCANAnalyzer instance
analyzer = DBSCANAnalyzer(embeddings)
# Perform DBSCAN clustering
cluster_labels = analyzer.perform_dbscan(eps=3, min_samples=2)
# Get cluster statistics
stats = analyzer.get_cluster_statistics()
# Calculate silhouette score
silhouette = analyzer.calculate_cluster_metric(ClusterMetric.SILHOUETTE)
# Find optimal epsilon
optimal_eps = analyzer.find_optimal_epsilon(min_samples=2, eps_range=(0.1, 10.0), n_steps=20)
print("Cluster Labels:", cluster_labels)
print("Cluster Statistics:", stats)
print("Silhouette Scores:", silhouette)
print("Optimal Epsilon:", optimal_eps)
Advanced Usage
Cluster Centroids and Intra-cluster Distances
centroids = analyzer.get_cluster_centroids()
intra_distances = analyzer.get_intra_cluster_distances()
print("Cluster Centroids:", centroids)
print("Intra-cluster Distances:", intra_distances)
Different Cluster Quality Metrics
calinski_harabasz = analyzer.calculate_cluster_metric(ClusterMetric.CALINSKI_HARABASZ)
davies_bouldin = analyzer.calculate_cluster_metric(ClusterMetric.DAVIES_BOULDIN)
print("Calinski-Harabasz Index:", calinski_harabasz)
print("Davies-Bouldin Index:", davies_bouldin)
Dependencies
- numpy
- scikit-learn
Contributing
We welcome contributions! Please see our contributing guidelines for more details.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For any queries or support, please contact [your contact information].
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xmcluster-0.0.1.tar.gz.
File metadata
- Download URL: xmcluster-0.0.1.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2200ccc20225a8b4bfd80329729511e912ed5410afc88a39c1f38d6cacba327f
|
|
| MD5 |
1255c1ca0bf6ae987a54220227ac01f0
|
|
| BLAKE2b-256 |
ee815286c970f25df3b72bdd6c00ca43db95cf7034018520c7957f9c09b9053b
|
File details
Details for the file xmcluster-0.0.1-py3-none-any.whl.
File metadata
- Download URL: xmcluster-0.0.1-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5b08c47952737de5b8f8c96f2d0d52b206e16ec194f4af637839ed165cb72d4
|
|
| MD5 |
826ba974de88f22eb5c57a4dec6b6812
|
|
| BLAKE2b-256 |
cd5dab95e6f760e8c17ac12fd8f251cf00fb4b5aa6d9e1ef2d13dabae36cd455
|