PyTorch and RAPIDS (cuVS/cuML) accelerated dimensionality reduction
Project description
DiRe Rapids
GPU-accelerated implementation of DiRe using PyTorch and optionally NVIDIA RAPIDS for massive-scale datasets.
Installation
From Repository (development)
# Clone the repository
git clone https://github.com/sashakolpakov/dire-rapids.git
cd dire-rapids
# Basic installation (CPU + PyTorch)
pip install -e .
# With CUDA support
pip install -e .[cuda]
# For development (includes testing and dev tools)
pip install -e .[dev]
With RAPIDS Support (Optional, GPU only)
First, install RAPIDS following provided installation instructions.
# Then install dire-rapids with RAPIDS support
pip install -e .[rapids]
From PyPI (stable)
# Use the above installation options
pip install dire-rapids[options]
Quick Start 
You can import the standard or memory-efficient backend for DiRe. Also, some datasets is needed: we shall use higher-dimensional Blobs as a simple visual test.
from dire_rapids import DiRePyTorch, DiRePyTorchMemoryEfficient
from sklearn.datasets import make_blobs
The standard backend will work for the example below, but not necessarily for a larger (100x) dataset.
# Generate sample data
X, _ = make_blobs(n_samples=1_000, centers=12, n_features=10, random_state=42)
# Standard PyTorch implementation
reducer = DiRePyTorch(n_components=2, n_neighbors=16, verbose=True)
X_embedded = reducer.fit_transform(X)
The memory-efficient version gets you there (how soon, depends on the hardware).
reducer = DiRePyTorchMemoryEfficient(n_components=2, n_neighbors=16, verbose=True)
X_embedded = reducer.fit_transform(X)
Custom Distance Metrics
DiRe Rapids now supports custom distance metrics for k-nearest neighbor computation, while keeping the layout forces Euclidean for optimal embedding quality:
# Using L1 (Manhattan) distance for k-NN
reducer = DiRePyTorch(
metric='(x - y).abs().sum(-1)',
n_neighbors=32,
verbose=True
)
X_embedded = reducer.fit_transform(X)
# Using cosine distance for k-NN
def cosine_distance(x, y):
return 1 - (x * y).sum(-1) / (x.norm(dim=-1, keepdim=True) * y.norm(dim=-1, keepdim=True) + 1e-8)
reducer = DiRePyTorch(
metric=cosine_distance,
n_neighbors=32
)
X_embedded = reducer.fit_transform(X)
# Custom metrics work with all backends
reducer = DiRePyTorchMemoryEfficient(
metric='(x - y).abs().sum(-1)', # L1 distance
use_fp16=True,
n_neighbors=32
)
X_embedded = reducer.fit_transform(X)
Supported Metric Types:
- None or
'euclidean'/'l2': Fast built-in Euclidean distance (default) - String expressions: Evaluated tensor expressions (e.g.,
'(x - y).abs().sum(-1)'for L1) - Callable functions: Custom Python functions taking (x, y) tensors
After starting the above example, you should see a verbose output similar to the below:
[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
2025-09-04 16:03:54.409 | INFO | dire_rapids.dire_cuvs:<module>:25 - cuVS available - GPU-accelerated k-NN enabled
2025-09-04 16:03:59.060 | INFO | dire_rapids.dire_cuvs:<module>:36 - cuML available - GPU-accelerated PCA enabled
2025-09-04 16:03:59.581 | INFO | dire_rapids.dire_pytorch:__init__:105 - Using CUDA device: Tesla T4
2025-09-04 16:03:59.581 | INFO | dire_rapids.dire_pytorch_memory_efficient:__init__:89 - Memory-efficient mode enabled
2025-09-04 16:03:59.582 | INFO | dire_rapids.dire_pytorch_memory_efficient:__init__:91 - FP16 enabled for k-NN computation
2025-09-04 16:03:59.583 | INFO | dire_rapids.dire_pytorch_memory_efficient:__init__:93 - PyKeOps repulsion enabled (threshold: 50000 points)
2025-09-04 16:03:59.598 | INFO | dire_rapids.dire_pytorch_memory_efficient:fit_transform:302 - Memory-efficient processing: 100000 samples, 100 features
2025-09-04 16:03:59.599 | INFO | dire_rapids.dire_pytorch_memory_efficient:fit_transform:306 - Large dataset (100000 > 50000): using random sampling for repulsion
2025-09-04 16:03:59.614 | INFO | dire_rapids.dire_pytorch:fit_transform:476 - Processing 100000 samples with 100 features
2025-09-04 16:03:59.619 | INFO | dire_rapids.dire_pytorch:_find_ab_params:123 - Found kernel params: a=1.8956, b=0.8006
2025-09-04 16:03:59.619 | INFO | dire_rapids.dire_pytorch_memory_efficient:_compute_knn:109 - Forcing FP16 for large dataset (100000 samples, 100D)
2025-09-04 16:03:59.834 | INFO | dire_rapids.dire_pytorch_memory_efficient:_compute_knn:123 - Memory-efficient k-NN: chunk_size=11790, FP16=True
2025-09-04 16:03:59.834 | INFO | dire_rapids.dire_pytorch:_compute_knn:138 - Computing 16-NN graph for 100000 points in 100D...
2025-09-04 16:03:59.835 | INFO | dire_rapids.dire_pytorch:_compute_knn:150 - Using FP16 for k-NN (2x memory, faster on H100/A100)
2025-09-04 16:03:59.893 | INFO | dire_rapids.dire_pytorch:_compute_knn:166 - Using PyTorch for k-NN
2025-09-04 16:03:59.893 | INFO | dire_rapids.dire_pytorch:_compute_knn:186 - Using chunk size: 23580 (GPU memory: 14.6GB, dtype: torch.float16)
2025-09-04 16:03:59.894 | INFO | dire_rapids.dire_pytorch:_compute_knn:197 - Processing chunk 1/5
2025-09-04 16:04:00.665 | INFO | dire_rapids.dire_pytorch:_compute_knn:197 - Processing chunk 2/5
2025-09-04 16:04:00.962 | INFO | dire_rapids.dire_pytorch:_compute_knn:197 - Processing chunk 3/5
2025-09-04 16:04:01.259 | INFO | dire_rapids.dire_pytorch:_compute_knn:197 - Processing chunk 4/5
2025-09-04 16:04:01.556 | INFO | dire_rapids.dire_pytorch:_compute_knn:197 - Processing chunk 5/5
2025-09-04 16:04:01.636 | INFO | dire_rapids.dire_pytorch:_compute_knn:237 - k-NN graph computed: shape (100000, 16)
2025-09-04 16:04:01.833 | INFO | dire_rapids.dire_pytorch:_initialize_embedding:243 - Initializing with PCA
2025-09-04 16:04:01.908 | INFO | dire_rapids.dire_pytorch_memory_efficient:_optimize_layout:253 - Memory-efficient optimization for 100000 points...
2025-09-04 16:04:01.921 | INFO | dire_rapids.dire_pytorch_memory_efficient:_optimize_layout:259 - Initial GPU memory: 0.01/15.8 GB
2025-09-04 16:04:02.097 | DEBUG | dire_rapids.dire_pytorch_memory_efficient:_compute_forces:207 - Using random sampling for repulsion
2025-09-04 16:04:02.272 | INFO | dire_rapids.dire_pytorch_memory_efficient:_optimize_layout:272 - Iteration 0/128, avg force: 14.770476
2025-09-04 16:04:02.288 | DEBUG | dire_rapids.dire_pytorch_memory_efficient:_optimize_layout:281 - GPU memory: 0.01 GB
2025-09-04 16:04:02.295 | DEBUG | dire_rapids.dire_pytorch_memory_efficient:_compute_forces:207 - Using random sampling for repulsion
2025-09-04 16:04:02.313 | DEBUG | dire_rapids.dire_pytorch_memory_efficient:_compute_forces:207 - Using random sampling for repulsion
2025-09-04 16:04:02.330 | DEBUG | dire_rapids.dire_pytorch_memory_efficient:_compute_forces:207 - Using random sampling for repulsion
2025-09-04 16:04:02.347 | DEBUG | dire_rapids.dire_pytorch_memory_efficient:_compute_forces:207 - Using random sampling for repulsion
The final result is the expected image of 2D blobs
Available Backends
- DiRePyTorch: Standard PyTorch implementation with adaptive chunking
- DiRePyTorchMemoryEfficient: Memory-optimized version with:
- FP16 support for 2x memory savings
- Point-by-point force computation
- More aggressive memory management
- PyKeOps LazyTensors for efficient repulsion (when available)
- DiReCuVS: RAPIDS cuVS backend for massive-scale datasets
Auto Backend Selection
Use the create_dire() function for automatic backend selection based on available hardware:
from dire_rapids import create_dire
# Auto-select optimal backend
reducer = create_dire(
n_neighbors=32,
metric='(x - y).abs().sum(-1)', # Custom L1 metric
verbose=True
)
X_embedded = reducer.fit_transform(X)
# Force memory-efficient backend
reducer = create_dire(
memory_efficient=True,
use_fp16=True,
metric=cosine_distance # Custom callable metric
)
X_embedded = reducer.fit_transform(X)
ReducerRunner Framework
General-purpose framework for running dimensionality reduction algorithms with automatic data loading and reducer comparison. See benchmarking/dire_rapids_benchmarks.ipynb for complete examples.
Quick Start with ReducerRunner
from dire_rapids.utils import ReducerRunner, ReducerConfig
from dire_rapids import create_dire
# Create a configuration
config = ReducerConfig(
name="DiRe",
reducer_class=create_dire,
reducer_kwargs={"n_neighbors": 16},
visualize=True,
max_points=10000 # Max points for visualization (uses WebGL, subsamples if larger)
)
# Run on various datasets
runner = ReducerRunner(config=config)
result = runner.run("sklearn:blobs")
result = runner.run("dire:sphere_uniform", dataset_kwargs={"n_features": 10, "n_samples": 1000})
Data Sources
sklearn:name- sklearn datasets (blobs, digits, iris, wine, moons, swiss_roll, etc.)openml:name- OpenML datasets by name or IDcytof:name- CyTOF datasets (levine13, levine32)dire:name- DiRe geometric datasets (disk_uniform, sphere_uniform, ellipsoid_uniform)file:path- Local files (.csv, .npy, .npz, .parquet)
Comparing Multiple Reducers
from benchmarking.compare_reducers import compare_reducers, print_comparison_summary
from dire_rapids.utils import ReducerConfig
from dire_rapids import create_dire
# Compare default reducers (DiRe, cuML UMAP, cuML TSNE)
results = compare_reducers("sklearn:blobs", metrics=['distortion', 'context'])
print_comparison_summary(results)
# Compare specific reducers
from cuml import UMAP
reducers = [
ReducerConfig("DiRe", create_dire, {"n_neighbors": 16}),
ReducerConfig("UMAP", UMAP, {"n_neighbors": 15})
]
results = compare_reducers("digits", reducers=reducers)
Metrics Module
Evaluation metrics for dimensionality reduction quality:
from dire_rapids.metrics import evaluate_embedding
# Full evaluation
results = evaluate_embedding(data, layout, labels, compute_topology=True)
print(f"Stress: {results['local']['stress']:.4f}")
print(f"SVM accuracy: {results['context']['svm'][1]:.4f}")
print(f"DTW β₀: {results['topology']['metrics']['dtw_beta0']:.6f}")
print(f"DTW β₁: {results['topology']['metrics']['dtw_beta1']:.6f}")
Metrics:
- Distortion: stress, neighborhood preservation
- Context: SVM/kNN classification accuracy
- Topology: DTW distances between Betti curves (β₀, β₁)
See METRICS_README.md and examples/metrics_swiss_roll.py.
Testing
# Run CPU tests (CI)
pytest tests/test_cpu_basic.py tests/test_reducer_runner.py -v
# Run all tests
pytest tests/ -v
# Test comprehensive suite
python tests/test_comprehensive.py
Citation
If you use this work, please cite it as:
BibTeX:
@misc{kolpakov-rivin-2025dimensionality,
title={Dimensionality reduction for homological stability and global structure preservation},
author={Kolpakov, Alexander and Rivin, Igor},
year={2025},
eprint={2503.03156},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.03156}
}
APA Style:
Kolpakov, A., & Rivin, I. (2025). Dimensionality reduction for homological stability and global structure preservation. arXiv preprint arXiv:2503.03156. https://arxiv.org/abs/2503.03156
Requirements
- Python 3.8-3.12
- PyTorch 2.0+
- PyKeOps 2.1+
- NumPy, SciPy, scikit-learn
- (Optional) CUDA 12.x+ for GPU acceleration
- (Optional) RAPIDS 23.08+ for cuVS backend
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dire_rapids-0.2.0.tar.gz.
File metadata
- Download URL: dire_rapids-0.2.0.tar.gz
- Upload date:
- Size: 81.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54211e6c90314104282eb3c1ac581471b131939bc205a491699597416079dcef
|
|
| MD5 |
4de46dae24efed65c3943836aad073a2
|
|
| BLAKE2b-256 |
20e6fb0f947f969e164505d649d9e9116697aa7ea4da74bc7e16465ba5ef7416
|
File details
Details for the file dire_rapids-0.2.0-py3-none-any.whl.
File metadata
- Download URL: dire_rapids-0.2.0-py3-none-any.whl
- Upload date:
- Size: 57.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4f2798b94a5e0793586cb1ded9ca52c4d3b7ab1766c84ee8923498f8a9d2c84
|
|
| MD5 |
b60bbfd235576c3f93f814021d05db39
|
|
| BLAKE2b-256 |
bc15fd3f6d788be21b3ca4faf9c640d410c3a21ed8e35aa4af9fc8110486057a
|