GPU-accelerated subsampling of large datasets for Machine Learning Interatomic Potentials (MLIPs).
Project description
SubDataPy
SubDataPy is a GPU-accelerated Python toolkit for subsampling large datasets, designed for Machine Learning Interatomic Potentials (MLIPs). It implements statistical subsampling methods — random, leverage score, and Cook's distance — that operate on grouped rows (atomic configurations with energy + force rows).
Subsampling has been shown to not only increase training speed by reducing dataset size, but also potentially increase testing accuracy by removing redundant data.
Table of Contents
- Key Features
- Installation
- Quick Start
- Core Concepts
- Data Loading
- Subsampling Methods
- Training and Error Evaluation
- Batch Evaluation
- Multi-GPU and Distributed Usage
- Generating Descriptor Matrices
- API Reference
- Examples
- Testing
- Mathematical Reference
- Citation
Key Features
- Three subsampling strategies: Random, Leverage Score, Cook's Distance (one-step and stepwise)
- Configuration-aware: Treats groups of rows (1 energy + N force rows per atomic configuration) as atomic units — never splits a configuration
- Block mode: For leverage and Cook's, aggregate scores across all rows of a configuration rather than using only the energy row
- GPU-accelerated: All linear algebra (SVD, QR, least squares, Woodbury updates) runs on CUDA GPUs via PyTorch
- Distributed TSQR: Multi-GPU support via
torch.distributed(NCCL backend) for datasets that don't fit in GPU memory - Flexible data input: Accepts
.npyfile paths, NumPy arrays, Pandas DataFrames, or PyTorch tensors - Batch evaluation:
create_subsample_errors_dataframe()sweeps over multiple subsample fractions with repeats and returns a structured Pandas DataFrame
Installation
pip install subdatapy
Note on PyTorch:
pip install subdatapypulls the default PyTorch wheel from PyPI — a CPU build on macOS/Windows, and the CUDA-enabled build on Linux x86_64. On some machines (e.g. if you need a specific CUDA or ROCm version), you may have to install PyTorch first, following the official instructions, and then runpip install subdatapy.
For development (tests and coverage):
git clone https://github.com/IlgarBaghishov/SubDataPy.git
cd SubDataPy
# Install in editable mode with test dependencies
pip install -e ".[test]"
Requirements
- Python >= 3.11
- PyTorch >= 2.8 (with CUDA support recommended)
- NumPy >= 1.20
- Pandas >= 1.0
Quick Start
import numpy as np
from subdatapy.subsampler import RandomSubSampler, LeverageSubSampler, CookSubSampler
# Load data (or pass file paths directly)
X = np.load("X.npy") # (n_rows, n_features) descriptor matrix
y = np.load("y.npy") # (n_rows,) targets (energies + forces)
w = np.load("w.npy") # (n_rows,) weights
config_idxs = np.load("config_idxs.npy") # (n_rows,) integer config IDs
# Create a random subsample at 50% of configurations
rs = RandomSubSampler(
X, y=y, w=w,
config_idxs=config_idxs,
test_fraction=0.2,
seed=42,
device='cuda',
)
rs.create_subsample(subsample_fraction=0.5, seed=123)
rs.train_subsample()
errors = rs.compute_subsample_errors(verbose=True)
# Returns: (sub_e_train, sub_f_train, e_train, f_train, e_test, f_test)
Core Concepts
Configurations and Grouping
In MLIP training, each atomic configuration produces one energy row and N force rows (where N = 3 * number of atoms). SubDataPy always selects or removes entire configurations, never individual rows.
config_idxs: An integer array of lengthn_rowsmapping each row to its configuration ID. All rows with the same ID belong to the same configuration.enrow_mask: A boolean array of lengthn_rowswhereTruemarks energy rows. If not provided, SubDataPy uses the first row of each configuration as the energy row.
Block vs Non-Block Mode
- Non-block mode (
block=False, default): Leverage scores and Cook's distances are computed using only the energy row of each configuration. Faster, good for initial exploration. - Block mode (
block=True): Scores are computed using all rows (energy + forces) of each configuration as a block. More statistically principled but more expensive.
Weighted Least Squares
All fitting uses Weighted Least Squares (WLS). The weight vector w allows different importance for energy vs force rows (e.g., w_energy=100, w_force=1). Internally, the design matrix and targets are pre-multiplied by weights before solving.
Data Flow
- Data loads to CPU (from
.npyfiles, NumPy arrays, or DataFrames) - An intercept column of ones is optionally prepended to X (
intercept=True) - Train/test split moves subsets to GPU (
device='cuda') - All linear algebra runs on GPU in
torch.float64(double precision)
Data Loading
SubDataPy accepts multiple input formats for X, y, w, config_idxs, and enrow_mask:
# File paths (.npy)
rs = RandomSubSampler(X="path/to/X.npy", y="path/to/y.npy", ...)
# NumPy arrays
rs = RandomSubSampler(X=np_array, y=np_array, ...)
# Pandas DataFrames
rs = RandomSubSampler(X=df_features, y=df_targets, ...)
# PyTorch tensors
rs = RandomSubSampler(X=torch_tensor, y=torch_tensor, ...)
All inputs are converted to torch.float64 tensors internally.
Subsampling Methods
Random Subsampling
Uniformly random selection of configurations. Serves as a baseline.
from subdatapy.subsampler import RandomSubSampler
rs = RandomSubSampler(
X, y=y, w=w,
config_idxs=config_idxs,
enrow_mask=enrow_mask, # optional, auto-detected if None
test_fraction=0.2, # fraction of configs for test set
seed=42, # seed for train/test split
intercept=True, # prepend intercept column (default)
device='cuda', # 'cuda' or 'cpu'
)
# Create subsample: keep 50% of training configurations
mask = rs.create_subsample(subsample_fraction=0.5, seed=123)
# mask is a boolean tensor over ALL rows (train + test)
# Train on the subsample
rs.train_subsample()
# Evaluate errors
e_sub, f_sub, e_train, f_train, e_test, f_test = rs.compute_subsample_errors(verbose=True)
Leverage Score Subsampling
Samples configurations proportional to their statistical leverage scores. Configurations with higher leverage (more influential on the fit) are more likely to be selected.
from subdatapy.subsampler import LeverageSubSampler
ls = LeverageSubSampler(
X, y=y, w=w,
config_idxs=config_idxs,
test_fraction=0.2,
seed=42,
device='cuda',
# Leverage-specific parameters:
block=False, # True: sum leverage over all rows per config
# False: use energy row leverage only
# Factorization method:
factorization='auto', # 'auto', 'svd', or 'qr'
# auto: SVD for single-GPU, QR if chunked/distributed
n_chunks=None, # number of TSQR chunks (for large matrices)
# Pre-computed SVD factors (optional, skip SVD computation):
U=None, S=None, Vh=None,
)
ls.create_subsample(subsample_fraction=0.3, seed=123)
ls.train_subsample()
errors = ls.compute_subsample_errors()
# Access leverage scores after create_subsample:
print(ls.leverage_scores) # (n_configs,) tensor
How it works:
- Compute weighted X:
X_w = diag(w) @ X - Compute SVD:
U, S, Vh = svd(X_w)(or QR factorization) - Row leverage scores:
h_i = sum(U[i,:]^2)(orh_i = ||R^{-T} x_i||^2for QR) - Block mode: sum leverage scores within each configuration
- Non-block mode: use the energy row's leverage score
- Sample configurations proportional to normalized leverage scores
Cook's Distance Subsampling
Selects configurations by their influence on the model fit, measured by Cook's distance. Supports two approaches:
One-Step Cook's Distance
Computes Cook's distance for each configuration in a single pass, then samples proportional to scores (or selects top-k).
from subdatapy.subsampler import CookSubSampler
# One-step (non-block only)
cs = CookSubSampler(
X, y=y, w=w,
config_idxs=config_idxs,
test_fraction=0.2,
seed=42,
device='cuda',
stepwise=False, # one-step mode
sampling=True, # True: sample proportional to Cook's scores
# False: select top-k configurations
)
cs.create_subsample(subsample_fraction=0.5, seed=123)
cs.train_subsample()
errors = cs.compute_subsample_errors()
# Access Cook's distance scores:
print(cs.onestep_en_cooks) # (n_configs,) tensor
One-step formula (energy rows only):
D_i = e_i^2 * h_i / (1 - h_i)^2
where e_i is the residual and h_i is the leverage score of the i-th energy row.
Stepwise Cook's Distance
Greedy iterative method: starts from a small initial subset, then adds (ascending) or removes (descending) one configuration per iteration based on which has the highest Cook's distance influence on the current model.
# Stepwise ascending with random initial subset
cs = CookSubSampler(
X, y=y, w=w,
config_idxs=config_idxs,
test_fraction=0.2,
seed=42,
device='cuda',
stepwise=True, # stepwise mode
ascending=True, # True: grow subset from small initial
# False: shrink from large initial
# Initial subset:
initial_subsampler="random", # "random" or "leverage"
initial_subsample_fraction=0.05, # start with 5% of configs
# Block mode:
block=False, # True: block Cook's (all rows per config)
# False: energy-row Cook's only
# Factorization for initial (X'X)^{-1}:
factorization='auto', # 'auto', 'svd', or 'qr'
# auto: SVD single-process, QR if chunked/distributed
# Update method per iteration:
update_method='auto', # 'auto', 'woodbury', or 'qr'
# auto: QR for block (stable), Woodbury for non-block (fast)
# Chunked/distributed:
n_chunks=None, # TSQR chunks (for large data, or multi-GPU)
tree_reduction_threshold=10, # reduce R matrices after this many
)
cs.create_subsample(subsample_fraction=0.5, seed=123)
cs.train_subsample()
errors = cs.compute_subsample_errors()
Stepwise algorithm:
- Create initial subset (random or leverage-based, at
initial_subsample_fraction) - Compute initial factorization:
(X'X)^{-1}via SVD or QR - For each iteration until target fraction is reached:
- Compute coefficients:
beta = (X'X)^{-1} X'y - Compute Cook's distance for each candidate configuration
- Select the configuration with highest (ascending) or lowest (descending) Cook's distance
- Update
(X'X)^{-1}incrementally via Woodbury identity or QR re-factorization
- Compute coefficients:
- Return the final subset mask
Non-block Cook's formula (per energy row):
D_i = e_i^2 * h_i / (1 + h_i)
Block Cook's formula (per configuration group c):
D_c = r_c' @ inv(I +/- H_c) @ H_c @ r_c
where H_c = X_c @ (X'X)^{-1} @ X_c' is the block leverage matrix and r_c = X_c @ beta - y_c.
Training and Error Evaluation
Full Training (No Subsampling)
from subdatapy.data import BaseData
bd = BaseData(
X, y=y, w=w,
config_idxs=config_idxs,
enrow_mask=enrow_mask,
intercept=True,
device='cuda',
)
bd.train_test_split(test_fraction=0.2, seed=42)
# Standard least squares
bd.train() # method='lstsq' (default)
# Or QR-based (supports chunking for large matrices)
bd.train(method='qr', n_chunks=4)
# Evaluate
e_train, f_train, e_test, f_test = bd.compute_errors(verbose=True)
# Access coefficients
print(bd.coeffs.shape) # (n_features + 1, 1) if intercept=True
Subsample Training
After creating a subsample with any subsampler:
# Train on the subsample
subsampler.train_subsample()
# Evaluate errors on subsampled training, full training, and test sets
e_sub, f_sub, e_train, f_train, e_test, f_test = subsampler.compute_subsample_errors(verbose=True)
The 6 returned RMSE values are:
- Subsampled Training Energy RMSE — energy rows in the subsample
- Subsampled Training Force RMSE — force rows in the subsample
- Entire Training Energy RMSE — all training energy rows
- Entire Training Force RMSE — all training force rows
- Testing Energy RMSE — test set energy rows
- Testing Force RMSE — test set force rows
Batch Evaluation
Sweep over multiple subsample fractions with repeated trials in a single call:
errors_df = subsampler.create_subsample_errors_dataframe(
subsample_fractions_list=[0.05, 0.1, 0.2, 0.5],
repeat_count_list=5, # 5 repeats per fraction (int applies to all)
seed=42,
verbose=False,
)
# errors_df is a Pandas DataFrame with MultiIndex columns:
# Level 0: Error Type (e.g., "Testing Energy RMSE")
# Level 1: Subsample Fraction (e.g., 0.05, 0.1, 0.2, 0.5)
# Rows are repeats.
print(errors_df["Testing Energy RMSE"])
For stepwise Cook's distance, each successive fraction in subsample_fractions_list reuses the previous subset as initial state (the greedy loop continues rather than restarting), allowing efficient evaluation of the subsampling curve.
Variable repeats per fraction:
# More repeats for small fractions (high variance), fewer for large fractions
errors_df = subsampler.create_subsample_errors_dataframe(
subsample_fractions_list=[0.05, 0.1, 0.2, 0.5],
repeat_count_list=[10, 10, 5, 5], # must be non-increasing
seed=42,
)
Multi-GPU and Distributed Usage
For datasets that exceed single-GPU memory, SubDataPy supports chunked processing and multi-GPU distribution via torch.distributed with the NCCL backend.
Single-GPU Chunking
Stream data through GPU in chunks (no distributed setup needed):
cs = CookSubSampler(
X, y=y, w=w,
config_idxs=config_idxs,
test_fraction=0.2, seed=42,
device='cuda',
stepwise=True, ascending=True,
factorization='qr', # must use QR for chunking
n_chunks=8, # process in 8 sequential chunks
)
cs.create_subsample(subsample_fraction=0.5, seed=123)
This keeps only one chunk on GPU at a time, reducing peak memory from O(n*p) to O(n/8 * p).
Multi-GPU (torchrun)
# 4 GPUs on one node
torchrun --nproc_per_node=4 my_script.py
# Two nodes, one rank per node, each rank using all 4 local GPUs via
# the `local_devices` parameter (see Partitioned Mode below)
srun -N 2 --ntasks-per-node=1 --gpus-per-task=4 \
torchrun --nnodes=2 --nproc_per_node=1 my_script.py
# my_script.py
import torch.distributed as dist
from subdatapy.subsampler import CookSubSampler
dist.init_process_group(backend='nccl')
# Under torchrun, SubDataPy requires partitioned mode — pass .npy file
# paths for X and config_idxs so BaseData auto-enables it. Passing
# in-memory tensors while distributed raises ValueError.
cs = CookSubSampler(
X="data/X.npy", y="data/y.npy", w="data/w.npy",
config_idxs="data/config_idxs.npy",
test_fraction=0.2, seed=42,
device='cuda',
stepwise=True, ascending=True,
factorization='qr',
n_chunks=16, # 16 total chunks, 4 per GPU
)
cs.create_subsample(subsample_fraction=0.5, seed=123)
cs.train_subsample()
errors = cs.compute_subsample_errors(verbose=True) # all ranks participate
dist.destroy_process_group()
Partitioned Mode (OOM-free distributed)
When BaseData receives .npy file paths for both X and config_idxs under torch.distributed, it auto-enters partitioned mode:
- Rank 0 scans the
config_idxsfile and computes partition boundaries at configuration boundaries (never mid-config) viapartition.compute_partition_ranges. The ranges are broadcast to all ranks. - Each rank
mmap-loads only its slice ofX,y,w,config_idxs,enrow_mask— memory isD/world_sizeper rank. - Train/test split uses the global config-id list so every rank agrees on which configs are test. Subsamplers use a global
unique_config_idxs_train(built viapartition.build_global_config_ids) so sampling with the same seed yields the same chosen configs across ranks. - TSQR runs in
partitioned=Truemode: each rank QRs its local rows; rank 0 gathers per-rank R matrices and does a final QR; R is broadcast back to all ranks for downstream use. Only small tensors (p×p R, p×1 XTy, and the single winning config's rows per stepwise iteration) travel across NCCL.
# Target: 2 nodes × 1 rank × 4 local GPUs = 8 GPUs, 2 CPU-RAM partitions.
# Each rank sees all 4 local GPUs and round-robins chunks across them.
cs = CookSubSampler(
X="data/X.npy", y="data/y.npy", w="data/w.npy",
config_idxs="data/config_idxs.npy",
enrow_mask="data/enrow_mask.npy",
device='cuda:0',
local_devices=['cuda:0','cuda:1','cuda:2','cuda:3'], # round-robin TSQR chunks
stepwise=True, ascending=True,
factorization='qr',
)
local_devices is auto-detected from torch.cuda.device_count() if not supplied. When LOCAL_WORLD_SIZE==1 (one rank per node), it defaults to every visible GPU; when multiple ranks share a node, it defaults to [device] and each rank gets one GPU.
_is_partitioned is the flag set by the auto-detection. You can force it via the partitioned_override= kwarg (used internally by CookSubSampler to inherit the parent's partition state in nested samplers).
Partitioned mode requires world_size > 1. Calling tsqr_r(partitioned=True) under a single process raises ValueError.
TSQR Execution Modes
n_chunks |
Setup | Mode | Description |
|---|---|---|---|
| None | single-process, one local_devices |
Single-pass | QR(X) directly on the device |
| N | single-process | Chunked | Stream N chunks through the primary device (or round-robin across local_devices) |
| None / per-rank N | partitioned multi-rank (torchrun + file paths) | Partitioned TSQR | Each rank QRs its local partition; rank 0 gathers and reduces across ranks |
Only two NCCL collectives are needed: gather (R matrices) and reduce (X'y sum). All communication is GPU-to-GPU. Replicated-distributed mode — where every rank holds the full matrix and TSQR merely splits the row range — was removed because partitioned is strictly more memory-efficient.
Distributed Leverage Scores
from subdatapy.subsampler import LeverageSubSampler
ls = LeverageSubSampler(
X, y=y, w=w,
config_idxs=config_idxs,
test_fraction=0.2, seed=42,
device='cuda',
factorization='qr',
n_chunks=8,
)
ls.create_subsample(subsample_fraction=0.3, seed=123)
Distributed Full Training
from subdatapy.data import BaseData
bd = BaseData(X, y=y, w=w, config_idxs=config_idxs, device='cuda')
bd.train_test_split(test_fraction=0.2, seed=42)
bd.train(method='qr', n_chunks=8)
Generating Descriptor Matrices
SubDataPy works with pre-computed descriptor matrices. For MLIP workflows, you typically generate these using a descriptor calculator like FitSNAP.
An example FitSNAP helper script is provided in tools/generate_matrices/FitSNAP/qSNAP.py. It:
- Reads atomic structures and DFT energies/forces from HDF5 files
- Computes SNAP bispectrum descriptors via FitSNAP
- Saves
X.npy,y.npy,w.npy,config_idxs.npyfor use with SubDataPy
For benchmarking with synthetic data, see examples/mpi_benchmark/generate_data.py.
API Reference
BaseData
Base class for data handling. All subsamplers inherit from this.
from subdatapy.data import BaseData
bd = BaseData(X, y=None, w=None, config_idxs=None, enrow_mask=None,
intercept=True, device='cuda')
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
X |
array-like or str | required | Feature matrix (n_rows, n_features). .npy path, NumPy array, DataFrame, or tensor. |
y |
array-like or str | None |
Target vector (n_rows,). Same formats as X. |
w |
array-like or str | None |
Weight vector (n_rows,). Defaults to all ones. |
config_idxs |
array-like or str | None |
Configuration ID per row (n_rows,). If None, each row is its own config. |
enrow_mask |
array-like or str | None |
Boolean mask for energy rows (n_rows,). If None, first row per config is energy. |
intercept |
bool | True |
Prepend a column of ones to X. |
device |
str | 'cuda' |
PyTorch device for computation. |
Methods:
| Method | Description |
|---|---|
train_test_split(test_fraction, seed, test_mask) |
Split data into train/test sets by configuration. |
train(method='lstsq', n_chunks=None) |
Weighted least squares. method='qr' uses TSQR. |
compute_errors(verbose=True) |
Returns (e_train, f_train, e_test, f_test) RMSE tuple. |
RandomSubSampler
from subdatapy.subsampler import RandomSubSampler
rs = RandomSubSampler(X, y=None, w=None, test_fraction=0.0, seed=None,
test_mask=None, config_idxs=None, enrow_mask=None,
intercept=True, device='cuda')
Inherits all BaseData parameters plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
test_fraction |
float | 0.0 |
Fraction of configurations for the test set. |
seed |
int | None |
Random seed for train/test split. |
test_mask |
array-like | None |
Explicit boolean test mask (overrides test_fraction). |
Methods:
| Method | Description |
|---|---|
create_subsample(subsample_fraction, seed=None) |
Create a subsample. Returns a boolean mask over all rows. |
train_subsample() |
Train WLS on the subsampled data. |
compute_subsample_errors(verbose=False) |
Returns 6-tuple of RMSE values (sub-train, full-train, test for energy + force). |
create_subsample_errors_dataframe(subsample_fractions_list, repeat_count_list=1, seed=None, verbose=False) |
Sweep fractions with repeats. Returns MultiIndex DataFrame. |
LeverageSubSampler
from subdatapy.subsampler import LeverageSubSampler
ls = LeverageSubSampler(X, y=None, w=None, test_fraction=0.0, seed=None,
config_idxs=None, enrow_mask=None, intercept=True,
device='cuda', block=False, U=None, S=None, Vh=None,
factorization='auto', n_chunks=None)
Inherits all RandomSubSampler parameters plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
block |
bool | False |
Sum leverage scores across all rows per configuration. |
U, S, Vh |
tensors | None |
Pre-computed SVD factors to skip SVD computation. |
factorization |
str | 'auto' |
'auto', 'svd', or 'qr'. Auto selects SVD unless chunked/distributed. |
n_chunks |
int | None |
Number of TSQR chunks for QR factorization. |
Attributes set after create_subsample():
| Attribute | Description |
|---|---|
leverage_scores |
Per-configuration leverage scores (n_configs,) tensor. |
CookSubSampler
from subdatapy.subsampler import CookSubSampler
cs = CookSubSampler(X, y, w=None, test_fraction=0.0, seed=None, test_mask=None,
config_idxs=None, enrow_mask=None, intercept=True,
device='cuda', block=False, stepwise=False, sampling=True,
ascending=True, initial_subsampler="random",
initial_subsample_fraction=1, U=None, S=None, Vh=None,
n_chunks=None, factorization='auto', update_method='auto',
tree_reduction_threshold=10)
Inherits all RandomSubSampler parameters plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
stepwise |
bool | False |
Stepwise greedy selection (True) or one-step scoring (False). |
block |
bool | False |
Block Cook's distance using all rows per config. |
sampling |
bool | True |
One-step only: sample proportional to Cook's (True) or top-k (False). |
ascending |
bool | True |
Stepwise only: grow subset (True) or shrink (False). |
initial_subsampler |
str | "random" |
Stepwise only: "random" or "leverage" initial subset. |
initial_subsample_fraction |
float | 1 |
Stepwise only: fraction for the initial subset. |
U, S, Vh |
tensors | None |
Pre-computed SVD factors (one-step mode). |
factorization |
str | 'auto' |
'auto', 'svd', or 'qr'. For initial (X'X)^{-1}. |
update_method |
str | 'auto' |
'auto', 'woodbury', or 'qr'. Per-iteration update. Auto = QR for block, Woodbury for non-block. |
n_chunks |
int | None |
TSQR chunks for initial factorization and chunked processing. |
tree_reduction_threshold |
int | 10 |
Reduce accumulated R matrices after this many chunks. |
Attributes set after create_subsample():
| Attribute | Description |
|---|---|
onestep_en_cooks |
One-step Cook's scores (n_configs,). Only set if stepwise=False. |
XTX_inv |
Current (X'X)^{-1} at end of stepwise. |
XTy |
Current X'y at end of stepwise. |
R_final |
QR R factor (only when using QR factorization). |
linalg Module
Pure functions for distributed linear algebra. Used internally by subsamplers but available for direct use.
from subdatapy import linalg
TSQR Functions:
| Function | Description |
|---|---|
tsqr_r(X, device, n_chunks, tree_reduction_threshold) |
Compute R factor of QR(X). Returns R on rank 0, None on others. |
tsqr_r_xty(X, y, device, n_chunks, tree_reduction_threshold) |
Compute R and X'y simultaneously. Returns (R, X'y) on rank 0. |
Inverse/Solve Functions:
| Function | Description |
|---|---|
xtx_inv_from_r(R, device) |
(X'X)^{-1} = R^{-1} R^{-T}. Inversion on CPU in float64. |
xtx_inv_from_svd(X, device) |
(X'X)^{-1} via SVD with singular value filtering. Returns (inv, U, S, Vh). |
solve_from_r_xty(R, XTy) |
Solve beta = R^{-1} R^{-T} X'y via two triangular solves. |
Update Functions:
| Function | Description |
|---|---|
woodbury_update(XTX_inv, X_change, y_change, XTy, ascending) |
Rank-k Woodbury/Sherman-Morrison update for adding or removing rows. |
qr_update_add(R, X_new, y_new, XTy, device) |
QR-based update: append rows and re-factorize. Returns (R_new, inv_new, XTy_new). |
Leverage:
| Function | Description |
|---|---|
leverage_scores_from_r(X, R, device) |
h_i = ||R^{-T} x_i||^2. QR-based leverage scores. |
Distributed Helpers:
| Function | Description |
|---|---|
get_rank() |
Current process rank (0 if not distributed). |
get_world_size() |
World size (1 if not distributed). |
is_distributed() |
True if torch.distributed is active with world_size > 1. |
Examples
Jupyter Notebooks
examples/Be_qSNAP_8/example1.ipynb— Beryllium qSNAP descriptor subsamplingexamples/lithium/example1.ipynb— Lithium dataset subsampling
Benchmark Script
The examples/mpi_benchmark/ directory contains a comprehensive benchmark:
# Generate synthetic data (2000 configs, 200 features)
python examples/mpi_benchmark/generate_data.py
# Run single-GPU benchmarks
python examples/mpi_benchmark/benchmark.py --method random --subsample_fraction 0.5
python examples/mpi_benchmark/benchmark.py --method leverage --block
python examples/mpi_benchmark/benchmark.py --method cooks --stepwise --ascending
# Run distributed benchmark
torchrun --nproc_per_node=4 examples/mpi_benchmark/benchmark.py \
--method cooks --stepwise --ascending --factorization qr
# Run the full benchmark suite (NERSC Perlmutter)
sbatch examples/mpi_benchmark/run_all.sh
Testing
# Run all tests
pytest
# With verbose output
pytest -v
# With coverage
pytest --cov=subdatapy --cov-report=term-missing
# Run only linalg tests
pytest tests/test_linalg.py -v
# Run only subsampler tests
pytest tests/test_subsamplers.py -v
Tests require a CUDA GPU by default. On CPU-only systems, tests run with device='cpu' (different expected RMSE values are used).
Mathematical Reference
For detailed derivations, equations, and code cross-references for all algorithms implemented in SubDataPy, see MATH.md.
Citation
If you use SubDataPy in your research, please cite:
@software{subdatapy,
author = {Ilgar Baghishov},
title = {SubDataPy: A Python Toolkit for Subsampling Large Datasets for MLIPs},
url = {https://github.com/IlgarBaghishov/SubDataPy},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file subdatapy-0.1.0.tar.gz.
File metadata
- Download URL: subdatapy-0.1.0.tar.gz
- Upload date:
- Size: 46.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2921ea3092674634770c96f0de54a1a54101a70485e118729de16dc11afbea8
|
|
| MD5 |
f2591b4127b0b13e28848664353c14a6
|
|
| BLAKE2b-256 |
6246ab8138a34cb7c4c1b63ab6dd22460ef6cf6c054a1bea34d2f65bf673046d
|
File details
Details for the file subdatapy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: subdatapy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9226dfa45018197099d149fc54e2f2da9ca7275732f2dba7ae1d567d06027748
|
|
| MD5 |
90e3feb00d14c20edc5d8f8496ac1a37
|
|
| BLAKE2b-256 |
62e629bb5e6da0c3f83e7d2f54749b81faff5b8ea0099e15b497597d147b5875
|