A Python package for fast sampling with applications on flow cytometry and scRNA-seq data, focusing on retaining rare cell populations.

Project description

SParseSampler (SPS)

SParseSampler (SPS) is a Python package for efficient subsampling of large-scale single-cell RNA-seq and flow cytometry datasets while preserving rare cell populations. The method employs an unsupervised approach that maintains dataset structure and rare cell types without requiring explicit labels.

Key Features

Core Method

PCA-based dimensionality reduction with automatic parameter selection via an EVR-based heuristic
Variance-weighted binning in the reduced dimensional space
Iterative cell selection prioritizing sparsest bins
Preserves rare populations without requiring cell type labels

Performance Benefits

Computational efficiency comparable to random sampling
Superior rare cell retention compared to existing methods (Hopper, Atomic Sketch)
Performance comparable to scSampler with improved speed
Successfully tested on datasets up to 34 million cells
Validated at multiple rarity levels (1%, 0.5%, 0.1%)

Installation

pip install sparsesampler

Technical Details

Parameters

SParseSampler uses an EVR-based heuristic for automatic parameter selection. Both the number of principal components (p) and the Bin Resolution Factor (K) are derived from the explained variance ratio (EVR) of the principal components using the EVR index (feature_index).

EVR Index (feature_index)
- Default: 12
- Controls both the number of principal components and the bin resolution
- The number of principal components p is set to feature_index + 1
- The Bin Resolution Factor K is computed as K = 2 / EVR_i, where EVR_i is the explained variance ratio of the i-th principal component
- Recommended ranges:
  - Flow cytometry data: EVR indices 7–20
  - scRNA-seq data: EVR indices 12–25
- The default value of 12 lies within both optimal ranges
Sample Size (size)
- Number of cells to subsample from the dataset
Seed (seed)
- Random seed for reproducibility

Supported Data Types

Single-cell RNA sequencing data
Flow cytometry data

Validation

Benchmarking

Comprehensive comparison against state-of-the-art methods
Validated on large-scale datasets:
- MCC dataset (scRNA-seq): 3.2M cells, 3,065 genes
- LCMV dataset (flow cytometry): 34M cells, 31 features
Consistent performance across varying dataset sizes and rarity levels (1%, 0.5%, 0.1%)
Downstream validation: Random Forest classifiers trained on SPS-subsampled data achieve substantially higher F1 scores for rare cell types compared to random subsampling

PCA Runtime

Flow cytometry (LCMV, 34M cells, 31 features): ~11 seconds
scRNA-seq (MCC, 3M cells, 3,065 genes): ~5 minutes

Usage

import sparsesampler.sampling as sps
import numpy as np

# Load your data (n_samples × n_features)
# Example 1: From NumPy array
X = np.load('your_data.npy')

# Example 2: From CSV file
import pandas as pd
X = pd.read_csv('your_data.csv').values

# Example 3: scRNA-seq data (AnnData format)
import scanpy as sc
adata = sc.read_h5ad('your_data.h5ad')
X = adata.X  # Use .toarray() if sparse matrix

# Run SParseSampler with default parameters (EVR index = 12)
indices, elapsed_time = sps.sample(X=X, size=100000)

# Run with custom EVR index (e.g., for flow cytometry data)
indices, elapsed_time = sps.sample(X=X, size=50000, feature_index=8)

# Get subsampled data
X_sampled = X[indices]

Preprocessing Recommendations

We recommend applying standard quality control filtering prior to SPS, including:

Removal of cells with abnormally high/low UMI counts
Filtering cells with high mitochondrial gene percentages
Doublet detection and removal (e.g., using Scrublet or DoubletFinder)

Visualization

The following animation shows how points are selected from a 2D toy dataset using PCA binning. Points are selected category by category (cells with 1 point, 2 points, etc.), and the process is visualized step by step:

All points start as skyblue.
When a category is considered, the cells are highlighted in yellow and the points in those cells are shown in gray for visibility.
Selected points turn red and remain red in all subsequent frames.
The process continues until the target number of points is reached.

Sampling Process Animation

To generate the animation yourself, run:

python docs/generate_visualization.py

Citation

If you use SParseSampler in your research, please cite:

# Add citation when available

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

Release history Release notifications | RSS feed

This version

1.1.0

Feb 7, 2026

1.0.0

Jul 7, 2025

0.1.0

Jun 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparsesampler-1.1.0.tar.gz (7.0 kB view details)

Uploaded Feb 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparsesampler-1.1.0-py3-none-any.whl (7.0 kB view details)

Uploaded Feb 7, 2026 Python 3

File details

Details for the file sparsesampler-1.1.0.tar.gz.

File metadata

Download URL: sparsesampler-1.1.0.tar.gz
Upload date: Feb 7, 2026
Size: 7.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for sparsesampler-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`15440f5c08cd9873b37cd1c71557d47aee9c6170ba0ff2ae39f71fb6b166c88a`
MD5	`c00869390db20364f46a21d956664da0`
BLAKE2b-256	`e4fdcfbd3ba635f4f047881eff5f716c17ce75c942e7c24e94d7d92b73ac7d19`

See more details on using hashes here.

File details

Details for the file sparsesampler-1.1.0-py3-none-any.whl.

File metadata

Download URL: sparsesampler-1.1.0-py3-none-any.whl
Upload date: Feb 7, 2026
Size: 7.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for sparsesampler-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0530b98d947019915969610ee0ca52ebc144e61b711f7ab486342d887936306b`
MD5	`3585711685305889cef90e992ceac940`
BLAKE2b-256	`41fd0c24516ec70f2948f6c254e3fee01fca22cb6658963c908f88a41d20059e`

See more details on using hashes here.

sparsesampler 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

SParseSampler (SPS)

Key Features

Core Method

Performance Benefits

Installation

Technical Details

Parameters

Supported Data Types

Validation

Benchmarking

PCA Runtime

Usage

Preprocessing Recommendations

Visualization

Citation

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes