Skip to main content

High-performance PAM (k-medoids) clustering implemented in Rust with Python bindings

Project description

RustPAM - High-Performance PAM Clustering in Rust

License: MIT

RustPAM is a Rust reimplementation of OneBatchPAM (k-medoids clustering) using modern engineering practices and the Rayon parallelization framework, providing better performance and maintainability than the original Cython version.

Features

  • 🚀 High Performance: Core algorithm implemented in Rust with zero-cost abstractions
  • Parallelization: Data parallelism based on Rayon, fully utilizing multi-core CPUs
  • 🔧 Engineering Excellence: Built with PyO3 + maturin, seamlessly integrating into Python ecosystem
  • 📦 User Friendly: scikit-learn compatible API
  • 🎯 Memory Efficient: Batch sampling reduces memory footprint

Installation

Install from Source

# Clone the repository
git clone <repository-url>
cd rustpam

# Build and install with maturin
pip install maturin
maturin develop --release

Install with pip (after building)

pip install rustpam

Requirements

  • Python >= 3.8
  • NumPy >= 1.20
  • scikit-learn >= 1.0
  • Rust (required for building)

Quick Start

import numpy as np
from rustpam import OneBatchPAM

# Generate sample data
X = np.random.randn(1000, 10).astype(np.float32)

# Create model
model = OneBatchPAM(
    n_medoids=5,
    distance='euclidean',
    max_iter=100,
    random_state=42,
    n_threads=4  # Use 4 threads
)

# Fit model
model.fit(X)

# Get cluster centers and labels
centers = model.cluster_centers_
labels = model.labels_

# Predict new data
X_new = np.random.randn(100, 10).astype(np.float32)
new_labels = model.predict(X_new)

print(f"Medoid indices: {model.medoid_indices_}")
print(f"Inertia: {model.inertia_:.4f}")
print(f"Iterations: {model.n_iter_}")

API Documentation

OneBatchPAM

Parameters:

  • n_medoids (int, default=10): Number of clusters
  • distance (str, default='euclidean'): Distance metric, supports all scikit-learn distances
  • batch_size ('auto' or int, default='auto'): Batch size
  • weighting (bool, default=True): Whether to use cluster size weighting
  • max_iter (int, default=100): Maximum number of iterations
  • tol (float, default=1e-6): Convergence tolerance
  • n_jobs (int or None, default=None): Parallelism for sklearn distance computation
  • random_state (int or None, default=None): Random seed
  • n_threads (int or None, default=None): Number of threads for Rust core

Attributes:

  • medoid_indices_: Indices of selected medoids
  • labels_: Cluster label for each sample
  • inertia_: Objective function value
  • dist_to_nearest_medoid_: Distance to nearest medoid
  • n_iter_: Actual number of iterations
  • cluster_centers_: Medoid feature vectors

Methods:

  • fit(X): Fit the model
  • predict(X): Predict cluster labels
  • fit_predict(X): Fit and return medoid indices

Performance Comparison

Compared to the original Cython implementation, RustPAM offers:

  1. Better Parallel Scalability: Rayon's work-stealing scheduler is more efficient than OpenMP
  2. Memory Safety: Rust's ownership system prevents memory leaks and data races
  3. Easier Maintenance: Type system and modern toolchain improve code quality
  4. Cross-Platform: Better Windows/macOS/Linux support

Algorithm Description

OneBatchPAM is an optimized variant of PAM (Partitioning Around Medoids):

  1. Batch Sampling: Uses a sampled batch to approximate instead of full distance matrix
  2. Greedy Swap: Finds the best medoid swap pair in each iteration
  3. Parallelization: Independent evaluation steps can be executed in parallel
  4. Weighting: Optional cluster size weighting improves stability for small samples

Development

# Install development dependencies
pip install maturin pytest numpy scikit-learn

# Development mode build
maturin develop

# Run tests
pytest tests/

# Release mode build
maturin build --release

Project Structure

rustpam/
├── src/
│   └── lib.rs           # Rust core implementation
├── rustpam/
│   ├── __init__.py      # Python package initialization
│   └── onebatchpam.py   # Python wrapper layer
├── Cargo.toml           # Rust dependencies
├── pyproject.toml       # Python project configuration
└── README.md

Tech Stack

  • Rust: Core algorithm implementation
  • PyO3: Python-Rust bindings
  • maturin: Build system
  • ndarray: Rust array library
  • rayon: Data parallelism framework
  • numpy: Python array interface

License

MIT License

Contributing

Contributions are welcome! Please submit Issues or Pull Requests.

Acknowledgments

This project is based on the original Cython implementation, rewritten in Rust to provide better performance and maintainability.

Contact

For questions or suggestions, please submit a GitHub Issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rustpam-0.1.0-cp313-cp313-win_amd64.whl (207.4 kB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file rustpam-0.1.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: rustpam-0.1.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 207.4 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for rustpam-0.1.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 fa27502ed6d6842deb842809140813591a764b633c8c494d2e2466a2fa97732d
MD5 231c9c8125cf377b85f1ffa22fbf4de4
BLAKE2b-256 0ccc49a890e9fc4f4fdc37cae03958cf4509f359a246826aa9e923f54a5952a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page