GPU-accelerated subsampling of large datasets for Machine Learning Interatomic Potentials (MLIPs).

These details have not been verified by PyPI

Project links

Project description

SubDataPy

SubDataPy is a GPU-accelerated Python toolkit for subsampling large datasets, designed for Machine Learning Interatomic Potentials (MLIPs). It implements statistical subsampling methods — random, leverage score, and Cook's distance — that operate on grouped rows (atomic configurations with energy + force rows).

Subsampling has been shown to not only increase training speed by reducing dataset size, but also potentially increase testing accuracy by removing redundant data.

Key Features
Installation
Quick Start
Core Concepts
Data Loading
Subsampling Methods
Training and Error Evaluation
Batch Evaluation
Multi-GPU and Distributed Usage
Generating Descriptor Matrices
API Reference
Examples
Testing
Mathematical Reference
Citation

Key Features

Three subsampling strategies: Random, Leverage Score, Cook's Distance (one-step and stepwise)
Configuration-aware: Treats groups of rows (1 energy + N force rows per atomic configuration) as atomic units — never splits a configuration
Block mode: For leverage and Cook's, aggregate scores across all rows of a configuration rather than using only the energy row
GPU-accelerated: All linear algebra (SVD, QR, least squares, Woodbury updates) runs on CUDA GPUs via PyTorch
Distributed TSQR: Multi-GPU support via torch.distributed (NCCL backend) for datasets that don't fit in GPU memory
Flexible data input: Accepts .npy file paths, NumPy arrays, Pandas DataFrames, or PyTorch tensors
Batch evaluation: create_subsample_errors_dataframe() sweeps over multiple subsample fractions with repeats and returns a structured Pandas DataFrame

Installation

pip install subdatapy

Note on PyTorch: pip install subdatapy pulls the default PyTorch wheel from PyPI — a CPU build on macOS/Windows, and the CUDA-enabled build on Linux x86_64. On some machines (e.g. if you need a specific CUDA or ROCm version), you may have to install PyTorch first, following the official instructions, and then run pip install subdatapy.

For development (tests and coverage):

git clone https://github.com/IlgarBaghishov/SubDataPy.git
cd SubDataPy

# Install in editable mode with test dependencies
pip install -e ".[test]"

Requirements

Python >= 3.11
PyTorch >= 2.8 (with CUDA support recommended)
NumPy >= 1.20
Pandas >= 1.0

Quick Start

import numpy as np
from subdatapy.subsampler import RandomSubSampler, LeverageSubSampler, CookSubSampler

# Load data (or pass file paths directly)
X = np.load("X.npy")            # (n_rows, n_features) descriptor matrix
y = np.load("y.npy")            # (n_rows,) targets (energies + forces)
w = np.load("w.npy")            # (n_rows,) weights
config_idxs = np.load("config_idxs.npy")  # (n_rows,) integer config IDs

# Create a random subsample at 50% of configurations
rs = RandomSubSampler(
    X, y=y, w=w,
    config_idxs=config_idxs,
    test_fraction=0.2,
    seed=42,
    device='cuda',
)
rs.create_subsample(subsample_fraction=0.5, seed=123)
rs.train_subsample()
errors = rs.compute_subsample_errors(verbose=True)
# Returns: (sub_e_train, sub_f_train, e_train, f_train, e_test, f_test)

Core Concepts

Configurations and Grouping

In MLIP training, each atomic configuration produces one energy row and N force rows (where N = 3 * number of atoms). SubDataPy always selects or removes entire configurations, never individual rows.

config_idxs: An integer array of length n_rows mapping each row to its configuration ID. All rows with the same ID belong to the same configuration.
enrow_mask: A boolean array of length n_rows where True marks energy rows. If not provided, SubDataPy uses the first row of each configuration as the energy row.

Block vs Non-Block Mode

Non-block mode (block=False, default): Leverage scores and Cook's distances are computed using only the energy row of each configuration. Faster, good for initial exploration.
Block mode (block=True): Scores are computed using all rows (energy + forces) of each configuration as a block. More statistically principled but more expensive.

Weighted Least Squares

All fitting uses Weighted Least Squares (WLS). The weight vector w allows different importance for energy vs force rows (e.g., w_energy=100, w_force=1). Internally, the design matrix and targets are pre-multiplied by weights before solving.

Data Flow

Data loads to CPU (from .npy files, NumPy arrays, or DataFrames)
An intercept column of ones is optionally prepended to X (intercept=True)
Train/test split moves subsets to GPU (device='cuda')
All linear algebra runs on GPU in torch.float64 (double precision)

Data Loading

SubDataPy accepts multiple input formats for X, y, w, config_idxs, and enrow_mask:

# File paths (.npy)
rs = RandomSubSampler(X="path/to/X.npy", y="path/to/y.npy", ...)

# NumPy arrays
rs = RandomSubSampler(X=np_array, y=np_array, ...)

# Pandas DataFrames
rs = RandomSubSampler(X=df_features, y=df_targets, ...)

# PyTorch tensors
rs = RandomSubSampler(X=torch_tensor, y=torch_tensor, ...)

All inputs are converted to torch.float64 tensors internally.

Subsampling Methods

Random Subsampling

Uniformly random selection of configurations. Serves as a baseline.

from subdatapy.subsampler import RandomSubSampler

rs = RandomSubSampler(
    X, y=y, w=w,
    config_idxs=config_idxs,
    enrow_mask=enrow_mask,       # optional, auto-detected if None
    test_fraction=0.2,           # fraction of configs for test set
    seed=42,                     # seed for train/test split
    intercept=True,              # prepend intercept column (default)
    device='cuda',               # 'cuda' or 'cpu'
)

# Create subsample: keep 50% of training configurations
mask = rs.create_subsample(subsample_fraction=0.5, seed=123)
# mask is a boolean tensor over ALL rows (train + test)

# Train on the subsample
rs.train_subsample()

# Evaluate errors
e_sub, f_sub, e_train, f_train, e_test, f_test = rs.compute_subsample_errors(verbose=True)

Leverage Score Subsampling

Samples configurations proportional to their statistical leverage scores. Configurations with higher leverage (more influential on the fit) are more likely to be selected.

from subdatapy.subsampler import LeverageSubSampler

ls = LeverageSubSampler(
    X, y=y, w=w,
    config_idxs=config_idxs,
    test_fraction=0.2,
    seed=42,
    device='cuda',

    # Leverage-specific parameters:
    block=False,              # True: sum leverage over all rows per config
                              # False: use energy row leverage only

    # Factorization method:
    factorization='auto',     # 'auto', 'svd', or 'qr'
                              # auto: SVD for single-GPU, QR if chunked/distributed
    n_chunks=None,            # number of TSQR chunks (for large matrices)

    # Pre-computed SVD factors (optional, skip SVD computation):
    U=None, S=None, Vh=None,
)

ls.create_subsample(subsample_fraction=0.3, seed=123)
ls.train_subsample()
errors = ls.compute_subsample_errors()

# Access leverage scores after create_subsample:
print(ls.leverage_scores)  # (n_configs,) tensor

How it works:

Compute weighted X: X_w = diag(w) @ X
Compute SVD: U, S, Vh = svd(X_w) (or QR factorization)
Row leverage scores: h_i = sum(U[i,:]^2) (or h_i = ||R^{-T} x_i||^2 for QR)
Block mode: sum leverage scores within each configuration
Non-block mode: use the energy row's leverage score
Sample configurations proportional to normalized leverage scores

Cook's Distance Subsampling

Selects configurations by their influence on the model fit, measured by Cook's distance. Supports two approaches:

One-Step Cook's Distance

Computes Cook's distance for each configuration in a single pass, then samples proportional to scores (or selects top-k).

from subdatapy.subsampler import CookSubSampler

# One-step (non-block only)
cs = CookSubSampler(
    X, y=y, w=w,
    config_idxs=config_idxs,
    test_fraction=0.2,
    seed=42,
    device='cuda',

    stepwise=False,           # one-step mode
    sampling=True,            # True: sample proportional to Cook's scores
                              # False: select top-k configurations
)

cs.create_subsample(subsample_fraction=0.5, seed=123)
cs.train_subsample()
errors = cs.compute_subsample_errors()

# Access Cook's distance scores:
print(cs.onestep_en_cooks)  # (n_configs,) tensor

One-step formula (energy rows only):

D_i = e_i^2 * h_i / (1 - h_i)^2

where e_i is the residual and h_i is the leverage score of the i-th energy row.

Stepwise Cook's Distance

Greedy iterative method: starts from a small initial subset, then adds (ascending) or removes (descending) one configuration per iteration based on which has the highest Cook's distance influence on the current model.

# Stepwise ascending with random initial subset
cs = CookSubSampler(
    X, y=y, w=w,
    config_idxs=config_idxs,
    test_fraction=0.2,
    seed=42,
    device='cuda',

    stepwise=True,            # stepwise mode
    ascending=True,           # True: grow subset from small initial
                              # False: shrink from large initial

    # Initial subset:
    initial_subsampler="random",    # "random" or "leverage"
    initial_subsample_fraction=0.05, # start with 5% of configs

    # Block mode:
    block=False,              # True: block Cook's (all rows per config)
                              # False: energy-row Cook's only

    # Factorization for initial (X'X)^{-1}:
    factorization='auto',     # 'auto', 'svd', or 'qr'
                              # auto: SVD single-process, QR if chunked/distributed

    # Update method per iteration:
    update_method='auto',     # 'auto', 'woodbury', or 'qr'
                              # auto: QR for block (stable), Woodbury for non-block (fast)

    # Chunked/distributed:
    n_chunks=None,            # TSQR chunks (for large data, or multi-GPU)
    tree_reduction_threshold=10,  # reduce R matrices after this many
)

cs.create_subsample(subsample_fraction=0.5, seed=123)
cs.train_subsample()
errors = cs.compute_subsample_errors()

Stepwise algorithm:

Create initial subset (random or leverage-based, at initial_subsample_fraction)
Compute initial factorization: (X'X)^{-1} via SVD or QR
For each iteration until target fraction is reached:
- Compute coefficients: beta = (X'X)^{-1} X'y
- Compute Cook's distance for each candidate configuration
- Select the configuration with highest (ascending) or lowest (descending) Cook's distance
- Update (X'X)^{-1} incrementally via Woodbury identity or QR re-factorization
Return the final subset mask

Non-block Cook's formula (per energy row):

D_i = e_i^2 * h_i / (1 + h_i)

Block Cook's formula (per configuration group c):

D_c = r_c' @ inv(I +/- H_c) @ H_c @ r_c

where H_c = X_c @ (X'X)^{-1} @ X_c' is the block leverage matrix and r_c = X_c @ beta - y_c.

Training and Error Evaluation

Full Training (No Subsampling)

from subdatapy.data import BaseData

bd = BaseData(
    X, y=y, w=w,
    config_idxs=config_idxs,
    enrow_mask=enrow_mask,
    intercept=True,
    device='cuda',
)
bd.train_test_split(test_fraction=0.2, seed=42)

# Standard least squares
bd.train()                    # method='lstsq' (default)

# Or QR-based (supports chunking for large matrices)
bd.train(method='qr', n_chunks=4)

# Evaluate
e_train, f_train, e_test, f_test = bd.compute_errors(verbose=True)

# Access coefficients
print(bd.coeffs.shape)  # (n_features + 1, 1) if intercept=True

Subsample Training

After creating a subsample with any subsampler:

# Train on the subsample
subsampler.train_subsample()

# Evaluate errors on subsampled training, full training, and test sets
e_sub, f_sub, e_train, f_train, e_test, f_test = subsampler.compute_subsample_errors(verbose=True)

The 6 returned RMSE values are:

Subsampled Training Energy RMSE — energy rows in the subsample
Subsampled Training Force RMSE — force rows in the subsample
Entire Training Energy RMSE — all training energy rows
Entire Training Force RMSE — all training force rows
Testing Energy RMSE — test set energy rows
Testing Force RMSE — test set force rows

Batch Evaluation

Sweep over multiple subsample fractions with repeated trials in a single call:

errors_df = subsampler.create_subsample_errors_dataframe(
    subsample_fractions_list=[0.05, 0.1, 0.2, 0.5],
    repeat_count_list=5,        # 5 repeats per fraction (int applies to all)
    seed=42,
    verbose=False,
)

# errors_df is a Pandas DataFrame with MultiIndex columns:
#   Level 0: Error Type (e.g., "Testing Energy RMSE")
#   Level 1: Subsample Fraction (e.g., 0.05, 0.1, 0.2, 0.5)
# Rows are repeats.

print(errors_df["Testing Energy RMSE"])

For stepwise Cook's distance, each successive fraction in subsample_fractions_list reuses the previous subset as initial state (the greedy loop continues rather than restarting), allowing efficient evaluation of the subsampling curve.

Variable repeats per fraction:

# More repeats for small fractions (high variance), fewer for large fractions
errors_df = subsampler.create_subsample_errors_dataframe(
    subsample_fractions_list=[0.05, 0.1, 0.2, 0.5],
    repeat_count_list=[10, 10, 5, 5],  # must be non-increasing
    seed=42,
)

Multi-GPU and Distributed Usage

For datasets that exceed single-GPU memory, SubDataPy supports chunked processing and multi-GPU distribution via torch.distributed with the NCCL backend.

Single-GPU Chunking

Stream data through GPU in chunks (no distributed setup needed):

cs = CookSubSampler(
    X, y=y, w=w,
    config_idxs=config_idxs,
    test_fraction=0.2, seed=42,
    device='cuda',
    stepwise=True, ascending=True,
    factorization='qr',       # must use QR for chunking
    n_chunks=8,               # process in 8 sequential chunks
)
cs.create_subsample(subsample_fraction=0.5, seed=123)

This keeps only one chunk on GPU at a time, reducing peak memory from O(n*p) to O(n/8 * p).

Multi-GPU (torchrun)

# 4 GPUs on one node
torchrun --nproc_per_node=4 my_script.py

# Two nodes, one rank per node, each rank using all 4 local GPUs via
# the `local_devices` parameter (see Partitioned Mode below)
srun -N 2 --ntasks-per-node=1 --gpus-per-task=4 \
    torchrun --nnodes=2 --nproc_per_node=1 my_script.py

# my_script.py
import torch.distributed as dist
from subdatapy.subsampler import CookSubSampler

dist.init_process_group(backend='nccl')

# Under torchrun, SubDataPy requires partitioned mode — pass .npy file
# paths for X and config_idxs so BaseData auto-enables it. Passing
# in-memory tensors while distributed raises ValueError.
cs = CookSubSampler(
    X="data/X.npy", y="data/y.npy", w="data/w.npy",
    config_idxs="data/config_idxs.npy",
    test_fraction=0.2, seed=42,
    device='cuda',
    stepwise=True, ascending=True,
    factorization='qr',
    n_chunks=16,               # 16 total chunks, 4 per GPU
)
cs.create_subsample(subsample_fraction=0.5, seed=123)
cs.train_subsample()
errors = cs.compute_subsample_errors(verbose=True)  # all ranks participate

dist.destroy_process_group()

Partitioned Mode (OOM-free distributed)

When BaseData receives .npy file paths for both X and config_idxs under torch.distributed, it auto-enters partitioned mode:

Rank 0 scans the config_idxs file and computes partition boundaries at configuration boundaries (never mid-config) via partition.compute_partition_ranges. The ranges are broadcast to all ranks.
Each rank mmap-loads only its slice of X, y, w, config_idxs, enrow_mask — memory is D/world_size per rank.
Train/test split uses the global config-id list so every rank agrees on which configs are test. Subsamplers use a global unique_config_idxs_train (built via partition.build_global_config_ids) so sampling with the same seed yields the same chosen configs across ranks.
TSQR runs in partitioned=True mode: each rank QRs its local rows; rank 0 gathers per-rank R matrices and does a final QR; R is broadcast back to all ranks for downstream use. Only small tensors (p×p R, p×1 XTy, and the single winning config's rows per stepwise iteration) travel across NCCL.

# Target: 2 nodes × 1 rank × 4 local GPUs = 8 GPUs, 2 CPU-RAM partitions.
# Each rank sees all 4 local GPUs and round-robins chunks across them.
cs = CookSubSampler(
    X="data/X.npy", y="data/y.npy", w="data/w.npy",
    config_idxs="data/config_idxs.npy",
    enrow_mask="data/enrow_mask.npy",
    device='cuda:0',
    local_devices=['cuda:0','cuda:1','cuda:2','cuda:3'],  # round-robin TSQR chunks
    stepwise=True, ascending=True,
    factorization='qr',
)

local_devices is auto-detected from torch.cuda.device_count() if not supplied. When LOCAL_WORLD_SIZE==1 (one rank per node), it defaults to every visible GPU; when multiple ranks share a node, it defaults to [device] and each rank gets one GPU.

_is_partitioned is the flag set by the auto-detection. You can force it via the partitioned_override= kwarg (used internally by CookSubSampler to inherit the parent's partition state in nested samplers).

Partitioned mode requires world_size > 1. Calling tsqr_r(partitioned=True) under a single process raises ValueError.

TSQR Execution Modes

`n_chunks`	Setup	Mode	Description
None	single-process, one `local_devices`	Single-pass	`QR(X)` directly on the device
N	single-process	Chunked	Stream N chunks through the primary device (or round-robin across `local_devices`)
None / per-rank N	partitioned multi-rank (torchrun + file paths)	Partitioned TSQR	Each rank QRs its local partition; rank 0 gathers and reduces across ranks

Only two NCCL collectives are needed: gather (R matrices) and reduce (X'y sum). All communication is GPU-to-GPU. Replicated-distributed mode — where every rank holds the full matrix and TSQR merely splits the row range — was removed because partitioned is strictly more memory-efficient.

Distributed Leverage Scores

from subdatapy.subsampler import LeverageSubSampler

ls = LeverageSubSampler(
    X, y=y, w=w,
    config_idxs=config_idxs,
    test_fraction=0.2, seed=42,
    device='cuda',
    factorization='qr',
    n_chunks=8,
)
ls.create_subsample(subsample_fraction=0.3, seed=123)

Distributed Full Training

from subdatapy.data import BaseData

bd = BaseData(X, y=y, w=w, config_idxs=config_idxs, device='cuda')
bd.train_test_split(test_fraction=0.2, seed=42)
bd.train(method='qr', n_chunks=8)

Generating Descriptor Matrices

SubDataPy works with pre-computed descriptor matrices. For MLIP workflows, you typically generate these using a descriptor calculator like FitSNAP.

An example FitSNAP helper script is provided in tools/generate_matrices/FitSNAP/qSNAP.py. It:

Reads atomic structures and DFT energies/forces from HDF5 files
Computes SNAP bispectrum descriptors via FitSNAP
Saves X.npy, y.npy, w.npy, config_idxs.npy for use with SubDataPy

For benchmarking with synthetic data, see examples/mpi_benchmark/generate_data.py.

API Reference

BaseData

Base class for data handling. All subsamplers inherit from this.

from subdatapy.data import BaseData

bd = BaseData(X, y=None, w=None, config_idxs=None, enrow_mask=None,
              intercept=True, device='cuda')

Parameters:

Parameter	Type	Default	Description
`X`	array-like or str	required	Feature matrix (n_rows, n_features). `.npy` path, NumPy array, DataFrame, or tensor.
`y`	array-like or str	`None`	Target vector (n_rows,). Same formats as X.
`w`	array-like or str	`None`	Weight vector (n_rows,). Defaults to all ones.
`config_idxs`	array-like or str	`None`	Configuration ID per row (n_rows,). If None, each row is its own config.
`enrow_mask`	array-like or str	`None`	Boolean mask for energy rows (n_rows,). If None, first row per config is energy.
`intercept`	bool	`True`	Prepend a column of ones to X.
`device`	str	`'cuda'`	PyTorch device for computation.

Methods:

Method	Description
`train_test_split(test_fraction, seed, test_mask)`	Split data into train/test sets by configuration.
`train(method='lstsq', n_chunks=None)`	Weighted least squares. `method='qr'` uses TSQR.
`compute_errors(verbose=True)`	Returns `(e_train, f_train, e_test, f_test)` RMSE tuple.

RandomSubSampler

from subdatapy.subsampler import RandomSubSampler

rs = RandomSubSampler(X, y=None, w=None, test_fraction=0.0, seed=None,
                      test_mask=None, config_idxs=None, enrow_mask=None,
                      intercept=True, device='cuda')

Inherits all BaseData parameters plus:

Parameter	Type	Default	Description
`test_fraction`	float	`0.0`	Fraction of configurations for the test set.
`seed`	int	`None`	Random seed for train/test split.
`test_mask`	array-like	`None`	Explicit boolean test mask (overrides `test_fraction`).

Methods:

Method	Description
`create_subsample(subsample_fraction, seed=None)`	Create a subsample. Returns a boolean mask over all rows.
`train_subsample()`	Train WLS on the subsampled data.
`compute_subsample_errors(verbose=False)`	Returns 6-tuple of RMSE values (sub-train, full-train, test for energy + force).
`create_subsample_errors_dataframe(subsample_fractions_list, repeat_count_list=1, seed=None, verbose=False)`	Sweep fractions with repeats. Returns MultiIndex DataFrame.

LeverageSubSampler

from subdatapy.subsampler import LeverageSubSampler

ls = LeverageSubSampler(X, y=None, w=None, test_fraction=0.0, seed=None,
                        config_idxs=None, enrow_mask=None, intercept=True,
                        device='cuda', block=False, U=None, S=None, Vh=None,
                        factorization='auto', n_chunks=None)

Inherits all RandomSubSampler parameters plus:

Parameter	Type	Default	Description
`block`	bool	`False`	Sum leverage scores across all rows per configuration.
`U, S, Vh`	tensors	`None`	Pre-computed SVD factors to skip SVD computation.
`factorization`	str	`'auto'`	`'auto'`, `'svd'`, or `'qr'`. Auto selects SVD unless chunked/distributed.
`n_chunks`	int	`None`	Number of TSQR chunks for QR factorization.

Attributes set after create_subsample():

Attribute	Description
`leverage_scores`	Per-configuration leverage scores (n_configs,) tensor.

CookSubSampler

from subdatapy.subsampler import CookSubSampler

cs = CookSubSampler(X, y, w=None, test_fraction=0.0, seed=None, test_mask=None,
                    config_idxs=None, enrow_mask=None, intercept=True,
                    device='cuda', block=False, stepwise=False, sampling=True,
                    ascending=True, initial_subsampler="random",
                    initial_subsample_fraction=1, U=None, S=None, Vh=None,
                    n_chunks=None, factorization='auto', update_method='auto',
                    tree_reduction_threshold=10)

Inherits all RandomSubSampler parameters plus:

Parameter	Type	Default	Description
`stepwise`	bool	`False`	Stepwise greedy selection (True) or one-step scoring (False).
`block`	bool	`False`	Block Cook's distance using all rows per config.
`sampling`	bool	`True`	One-step only: sample proportional to Cook's (True) or top-k (False).
`ascending`	bool	`True`	Stepwise only: grow subset (True) or shrink (False).
`initial_subsampler`	str	`"random"`	Stepwise only: `"random"` or `"leverage"` initial subset.
`initial_subsample_fraction`	float	`1`	Stepwise only: fraction for the initial subset.
`U, S, Vh`	tensors	`None`	Pre-computed SVD factors (one-step mode).
`factorization`	str	`'auto'`	`'auto'`, `'svd'`, or `'qr'`. For initial (X'X)^{-1}.
`update_method`	str	`'auto'`	`'auto'`, `'woodbury'`, or `'qr'`. Per-iteration update. Auto = QR for block, Woodbury for non-block.
`n_chunks`	int	`None`	TSQR chunks for initial factorization and chunked processing.
`tree_reduction_threshold`	int	`10`	Reduce accumulated R matrices after this many chunks.

Attributes set after create_subsample():

Attribute	Description
`onestep_en_cooks`	One-step Cook's scores (n_configs,). Only set if `stepwise=False`.
`XTX_inv`	Current (X'X)^{-1} at end of stepwise.
`XTy`	Current X'y at end of stepwise.
`R_final`	QR R factor (only when using QR factorization).

linalg Module

Pure functions for distributed linear algebra. Used internally by subsamplers but available for direct use.

from subdatapy import linalg

TSQR Functions:

Function	Description
`tsqr_r(X, device, n_chunks, tree_reduction_threshold)`	Compute R factor of QR(X). Returns R on rank 0, None on others.
`tsqr_r_xty(X, y, device, n_chunks, tree_reduction_threshold)`	Compute R and X'y simultaneously. Returns (R, X'y) on rank 0.

Inverse/Solve Functions:

Function	Description
`xtx_inv_from_r(R, device)`	(X'X)^{-1} = R^{-1} R^{-T}. Inversion on CPU in float64.
`xtx_inv_from_svd(X, device)`	(X'X)^{-1} via SVD with singular value filtering. Returns (inv, U, S, Vh).
`solve_from_r_xty(R, XTy)`	Solve beta = R^{-1} R^{-T} X'y via two triangular solves.

Update Functions:

Function	Description
`woodbury_update(XTX_inv, X_change, y_change, XTy, ascending)`	Rank-k Woodbury/Sherman-Morrison update for adding or removing rows.
`qr_update_add(R, X_new, y_new, XTy, device)`	QR-based update: append rows and re-factorize. Returns (R_new, inv_new, XTy_new).

Leverage:

Function	Description
`leverage_scores_from_r(X, R, device)`	h_i = \|\|R^{-T} x_i\|\|^2. QR-based leverage scores.

Distributed Helpers:

Function	Description
`get_rank()`	Current process rank (0 if not distributed).
`get_world_size()`	World size (1 if not distributed).
`is_distributed()`	True if torch.distributed is active with world_size > 1.

Examples

Jupyter Notebooks

examples/Be_qSNAP_8/example1.ipynb — Beryllium qSNAP descriptor subsampling
examples/lithium/example1.ipynb — Lithium dataset subsampling

Benchmark Script

The examples/mpi_benchmark/ directory contains a comprehensive benchmark:

# Generate synthetic data (2000 configs, 200 features)
python examples/mpi_benchmark/generate_data.py

# Run single-GPU benchmarks
python examples/mpi_benchmark/benchmark.py --method random --subsample_fraction 0.5
python examples/mpi_benchmark/benchmark.py --method leverage --block
python examples/mpi_benchmark/benchmark.py --method cooks --stepwise --ascending

# Run distributed benchmark
torchrun --nproc_per_node=4 examples/mpi_benchmark/benchmark.py \
    --method cooks --stepwise --ascending --factorization qr

# Run the full benchmark suite (NERSC Perlmutter)
sbatch examples/mpi_benchmark/run_all.sh

Testing

# Run all tests
pytest

# With verbose output
pytest -v

# With coverage
pytest --cov=subdatapy --cov-report=term-missing

# Run only linalg tests
pytest tests/test_linalg.py -v

# Run only subsampler tests
pytest tests/test_subsamplers.py -v

Tests require a CUDA GPU by default. On CPU-only systems, tests run with device='cpu' (different expected RMSE values are used).

Mathematical Reference

For detailed derivations, equations, and code cross-references for all algorithms implemented in SubDataPy, see MATH.md.

Citation

If you use SubDataPy in your research, please cite:

@software{subdatapy,
  author = {Ilgar Baghishov},
  title = {SubDataPy: A Python Toolkit for Subsampling Large Datasets for MLIPs},
  url = {https://github.com/IlgarBaghishov/SubDataPy},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subdatapy-0.1.0.tar.gz (46.5 kB view details)

Uploaded Jun 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

subdatapy-0.1.0-py3-none-any.whl (34.0 kB view details)

Uploaded Jun 7, 2026 Python 3

File details

Details for the file subdatapy-0.1.0.tar.gz.

File metadata

Download URL: subdatapy-0.1.0.tar.gz
Upload date: Jun 7, 2026
Size: 46.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for subdatapy-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c2921ea3092674634770c96f0de54a1a54101a70485e118729de16dc11afbea8`
MD5	`f2591b4127b0b13e28848664353c14a6`
BLAKE2b-256	`6246ab8138a34cb7c4c1b63ab6dd22460ef6cf6c054a1bea34d2f65bf673046d`

See more details on using hashes here.

File details

Details for the file subdatapy-0.1.0-py3-none-any.whl.

File metadata

Download URL: subdatapy-0.1.0-py3-none-any.whl
Upload date: Jun 7, 2026
Size: 34.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for subdatapy-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9226dfa45018197099d149fc54e2f2da9ca7275732f2dba7ae1d567d06027748`
MD5	`90e3feb00d14c20edc5d8f8496ac1a37`
BLAKE2b-256	`62e629bb5e6da0c3f83e7d2f54749b81faff5b8ea0099e15b497597d147b5875`

See more details on using hashes here.

subdatapy 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SubDataPy

Table of Contents

Key Features

Installation

Requirements

Quick Start

Core Concepts

Configurations and Grouping

Block vs Non-Block Mode

Weighted Least Squares

Data Flow

Data Loading

Subsampling Methods

Random Subsampling

Leverage Score Subsampling

Cook's Distance Subsampling

One-Step Cook's Distance

Stepwise Cook's Distance

Training and Error Evaluation

Full Training (No Subsampling)

Subsample Training

Batch Evaluation

Multi-GPU and Distributed Usage

Single-GPU Chunking

Multi-GPU (torchrun)

Partitioned Mode (OOM-free distributed)

TSQR Execution Modes

Distributed Leverage Scores

Distributed Full Training

Generating Descriptor Matrices

API Reference

BaseData

RandomSubSampler

LeverageSubSampler

CookSubSampler

linalg Module

Examples

Jupyter Notebooks

Benchmark Script

Testing

Mathematical Reference

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes