High-performance PAM (k-medoids) clustering implemented in Rust with Python bindings
Project description
RustPAM - High-Performance PAM Clustering in Rust
RustPAM is a Rust reimplementation of OneBatchPAM (k-medoids clustering) using modern engineering practices and the Rayon parallelization framework, providing better performance and maintainability than the original Cython version.
Features
- 🚀 High Performance: Core algorithm implemented in Rust with zero-cost abstractions
- ⚡ Parallelization: Data parallelism based on Rayon, fully utilizing multi-core CPUs
- 🔧 Engineering Excellence: Built with PyO3 + maturin, seamlessly integrating into Python ecosystem
- 📦 User Friendly: scikit-learn compatible API
- 🎯 Memory Efficient: Batch sampling reduces memory footprint
Installation
Install from Source
# Clone the repository
git clone <repository-url>
cd rustpam
# Build and install with maturin
pip install maturin
maturin develop --release
Install with pip (after building)
pip install rustpam
Requirements
- Python >= 3.8
- NumPy >= 1.20
- scikit-learn >= 1.0
- Rust (required for building)
Quick Start
import numpy as np
from rustpam import OneBatchPAM
# Generate sample data
X = np.random.randn(1000, 10).astype(np.float32)
# Create model
model = OneBatchPAM(
n_medoids=5,
distance='euclidean',
max_iter=100,
random_state=42,
n_threads=4 # Use 4 threads
)
# Fit model
model.fit(X)
# Get cluster centers and labels
centers = model.cluster_centers_
labels = model.labels_
# Predict new data
X_new = np.random.randn(100, 10).astype(np.float32)
new_labels = model.predict(X_new)
print(f"Medoid indices: {model.medoid_indices_}")
print(f"Inertia: {model.inertia_:.4f}")
print(f"Iterations: {model.n_iter_}")
API Documentation
OneBatchPAM
Parameters:
n_medoids(int, default=10): Number of clustersdistance(str, default='euclidean'): Distance metric, supports all scikit-learn distancesbatch_size('auto' or int, default='auto'): Batch sizeweighting(bool, default=True): Whether to use cluster size weightingmax_iter(int, default=100): Maximum number of iterationstol(float, default=1e-6): Convergence tolerancen_jobs(int or None, default=None): Parallelism for sklearn distance computationrandom_state(int or None, default=None): Random seedn_threads(int or None, default=None): Number of threads for Rust core
Attributes:
medoid_indices_: Indices of selected medoidslabels_: Cluster label for each sampleinertia_: Objective function valuedist_to_nearest_medoid_: Distance to nearest medoidn_iter_: Actual number of iterationscluster_centers_: Medoid feature vectors
Methods:
fit(X): Fit the modelpredict(X): Predict cluster labelsfit_predict(X): Fit and return medoid indices
Performance Comparison
Compared to the original Cython implementation, RustPAM offers:
- Better Parallel Scalability: Rayon's work-stealing scheduler is more efficient than OpenMP
- Memory Safety: Rust's ownership system prevents memory leaks and data races
- Easier Maintenance: Type system and modern toolchain improve code quality
- Cross-Platform: Better Windows/macOS/Linux support
Algorithm Description
OneBatchPAM is an optimized variant of PAM (Partitioning Around Medoids):
- Batch Sampling: Uses a sampled batch to approximate instead of full distance matrix
- Greedy Swap: Finds the best medoid swap pair in each iteration
- Parallelization: Independent evaluation steps can be executed in parallel
- Weighting: Optional cluster size weighting improves stability for small samples
Development
# Install development dependencies
pip install maturin pytest numpy scikit-learn
# Development mode build
maturin develop
# Run tests
pytest tests/
# Release mode build
maturin build --release
Project Structure
rustpam/
├── src/
│ └── lib.rs # Rust core implementation
├── rustpam/
│ ├── __init__.py # Python package initialization
│ └── onebatchpam.py # Python wrapper layer
├── Cargo.toml # Rust dependencies
├── pyproject.toml # Python project configuration
└── README.md
Tech Stack
- Rust: Core algorithm implementation
- PyO3: Python-Rust bindings
- maturin: Build system
- ndarray: Rust array library
- rayon: Data parallelism framework
- numpy: Python array interface
License
MIT License
Contributing
Contributions are welcome! Please submit Issues or Pull Requests.
Acknowledgments
This project is based on the original Cython implementation, rewritten in Rust to provide better performance and maintainability.
Contact
For questions or suggestions, please submit a GitHub Issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rustpam-0.1.0-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: rustpam-0.1.0-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 207.4 kB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa27502ed6d6842deb842809140813591a764b633c8c494d2e2466a2fa97732d
|
|
| MD5 |
231c9c8125cf377b85f1ffa22fbf4de4
|
|
| BLAKE2b-256 |
0ccc49a890e9fc4f4fdc37cae03958cf4509f359a246826aa9e923f54a5952a1
|