Skip to main content

XMPro Vector Insight is a Python library for analyzing and visualizing embedding data. It provides tools for calculating various similarity measures (cosine, Euclidean, inner product) between embeddings and reference vectors, applying different scaling techniques, and performing basic statistical analysis on embedding data. The library is designed with flexibility in mind, allowing for easy extension and customization of its core functionalities.

Project description

XMCluster

XMCluster is a Python library designed for performing density-based clustering on embedding data using DBSCAN (Density-Based Spatial Clustering of Applications with Noise). It provides a flexible and extensible framework for conducting cluster analysis, optimizing clustering parameters, and analyzing cluster characteristics.

Features

  • DBSCAN Clustering: Perform density-based clustering on your embedding data.
  • Cluster Statistics: Get basic statistics about the clusters, including the number of clusters, number of noise points, and cluster sizes.
  • Cluster Quality Metrics: Calculate various cluster quality metrics such as Silhouette score, Calinski-Harabasz index, and Davies-Bouldin index.
  • Optimal Parameter Finding: Find the optimal epsilon parameter for DBSCAN using a grid search and the Silhouette score.
  • Cluster Centroids: Calculate the centroids of each cluster.
  • Intra-cluster Distances: Calculate the average distance of points to their cluster centroid.

Installation

Install XMCluster using pip:

pip install xmcluster

Usage

Here's a basic example of how to use XMCluster:

from xmcluster import DBSCANAnalyzer, ClusterMetric

# Sample embeddings
embeddings = {
    'key1': [[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]],
    'key2': [[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]]
}

# Create a DBSCANAnalyzer instance
analyzer = DBSCANAnalyzer(embeddings)

# Perform DBSCAN clustering
cluster_labels = analyzer.perform_dbscan(eps=3, min_samples=2)

# Get cluster statistics
stats = analyzer.get_cluster_statistics()

# Calculate silhouette score
silhouette = analyzer.calculate_cluster_metric(ClusterMetric.SILHOUETTE)

# Find optimal epsilon
optimal_eps = analyzer.find_optimal_epsilon(min_samples=2, eps_range=(0.1, 10.0), n_steps=20)

print("Cluster Labels:", cluster_labels)
print("Cluster Statistics:", stats)
print("Silhouette Scores:", silhouette)
print("Optimal Epsilon:", optimal_eps)

Advanced Usage

Cluster Centroids and Intra-cluster Distances

centroids = analyzer.get_cluster_centroids()
intra_distances = analyzer.get_intra_cluster_distances()

print("Cluster Centroids:", centroids)
print("Intra-cluster Distances:", intra_distances)

Different Cluster Quality Metrics

calinski_harabasz = analyzer.calculate_cluster_metric(ClusterMetric.CALINSKI_HARABASZ)
davies_bouldin = analyzer.calculate_cluster_metric(ClusterMetric.DAVIES_BOULDIN)

print("Calinski-Harabasz Index:", calinski_harabasz)
print("Davies-Bouldin Index:", davies_bouldin)

Dependencies

  • numpy
  • scikit-learn

Contributing

We welcome contributions! Please see our contributing guidelines for more details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any queries or support, please contact [your contact information].

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xmcluster-0.0.1.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xmcluster-0.0.1-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file xmcluster-0.0.1.tar.gz.

File metadata

  • Download URL: xmcluster-0.0.1.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for xmcluster-0.0.1.tar.gz
Algorithm Hash digest
SHA256 2200ccc20225a8b4bfd80329729511e912ed5410afc88a39c1f38d6cacba327f
MD5 1255c1ca0bf6ae987a54220227ac01f0
BLAKE2b-256 ee815286c970f25df3b72bdd6c00ca43db95cf7034018520c7957f9c09b9053b

See more details on using hashes here.

File details

Details for the file xmcluster-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: xmcluster-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for xmcluster-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f5b08c47952737de5b8f8c96f2d0d52b206e16ec194f4af637839ed165cb72d4
MD5 826ba974de88f22eb5c57a4dec6b6812
BLAKE2b-256 cd5dab95e6f760e8c17ac12fd8f251cf00fb4b5aa6d9e1ef2d13dabae36cd455

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page