Resampling-Enhanced Sparse LDA for ordinal outcomes

These details have not been verified by PyPI

Project links

Source

Project description

RE-sLDA: Resampling-Enhanced Sparse LDA for Ordinal Outcomes

This repository contains the implementation of RE-sLDA, a framework designed to enhance feature selection stability and accuracy when dealing with high-dimensional data and ordinal outcomes. By integrating resampling techniques with Sparse Linear Discriminant Analysis (sLDA), this method identifies robust biomarker signatures that standard sparse models often miss due to selection instability.

Key Features

Ordinal Outcome Optimization: Specifically tuned for categorical outcomes with a natural ordering (e.g., disease severity, treatment response grades).
Resampling-Based Stability: Utilizes bootstrap-based resampling to calculate Variable Inclusion Probabilities (VIP), ensuring the selected features are not artifacts of a single data split.
Parallel Computing Support: Fully integrated with multiprocessing for high-performance execution on multi-core machines.

Installation

From PyPI (recommended)

pip install RE-sLDA

This installs the re_slda Python package and a re-slda command-line entry point. A virtual environment is recommended to avoid dependency conflicts.

From source

Clone the repository, then install in editable mode:

git clone https://github.com/your-org/RE-sLDA.git
cd RE-sLDA
pip install -e .

Usage

RE-sLDA can be driven two ways: as importable functions inside a notebook/script, or as a command-line tool.

Option A — Import into a notebook

import pandas as pd
import re_slda

X = pd.read_csv("datasets/use_glio_data_filter1000.csv")    # has header
Y = pd.read_csv("datasets/use_glio_dataY_filter1000.csv",
                header=None).values.squeeze()

# Bootstrapping — returns a DataFrame, one row per iteration
results = re_slda.run_bootstrapping(X, Y, iters=200, base_seed=42)

# Per-feature Variable Inclusion Probability
vip = re_slda.compute_vip(results)
vip.head(20)

run_subsampling has the same shape. Pass out_prefix="MyRun" to also write a timestamped CSV to output/. The public API is:

Function	Purpose
`re_slda.run_bootstrapping(X, Y, varnames=None, *, iters=200, ...)`	Resampling-with-replacement pipeline. Returns a DataFrame.
`re_slda.run_subsampling(X, Y, varnames=None, *, iters=200, ...)`	Train/test-split pipeline. Returns a DataFrame.
`re_slda.compute_vip(results)`	Variable Inclusion Probabilities from a results DataFrame.
`re_slda.predict_asda_ordinal(model, X_new)`	Predict ordinal labels from a fitted ordASDA model.
`re_slda.ordASDA(...)`	Low-level ordinal sparse LDA fitter.

Option B — Command line

After pip install, the re-slda console script is available:

re-slda bootstrapping
re-slda subsampling

The legacy invocation still works for users who prefer to clone the repo:

python pipeline.py bootstrapping
python pipeline.py subsampling

Both commands accept --iters, --out-prefix, --save-dir, --seed, --x, and --y. Run re-slda --help for details.

Tutorial: Walkthrough with the Included Glioma Dataset

This repository ships with a real example dataset so you can run the full pipeline before applying RE-sLDA to your own data. The walkthrough below explains each step, how everything works, and how to read the results.

Step 1 — Inspect the example data

Two files are provided in datasets/:

File	Role	Shape	Description
`use_glio_data_filter1000.csv`	Feature matrix X	175 samples × 1000 features	Pre-filtered gene-expression features for glioma patients. First row is the feature names (`V4391`, `V708`, …).
`use_glio_dataY_filter1000.csv`	Response vector Y	175 values	Ordinal tumor-grade labels (e.g. `1`, `2`, `3`, `4`). No header row.

You can preview the data with any spreadsheet tool or:

head -2 datasets/use_glio_data_filter1000.csv
head -5 datasets/use_glio_dataY_filter1000.csv

Step 2 — Run the bootstrapping pipeline

Bootstrapping is the recommended starting point, as it produces Variable Inclusion Probabilities (VIPs) for every feature. Either from the command line:

re-slda bootstrapping            # after `pip install RE-sLDA`
# or, from a source checkout:
python pipeline.py bootstrapping

or directly from a notebook:

import pandas as pd, re_slda

X = pd.read_csv("datasets/use_glio_data_filter1000.csv")
Y = pd.read_csv("datasets/use_glio_dataY_filter1000.csv", header=None).values.squeeze()
results = re_slda.run_bootstrapping(X, Y, iters=200)

What happens during this run (using the defaults):

The script draws 200 bootstrap replicates of the rows of X/Y (with replacement).
For each replicate it samples a random predictor subspace of size subspace_size = 5, repeated until ~80% of the predictor pool has been covered (target_unique_prob = 0.8).
Inside each subspace it fits an ordinal sLDA model with 4-fold cross-validation to pick optimal_lambda.
Selected variables are accumulated and held-out MAE / Accuracy are recorded.

Expect a runtime of several minutes on a modern multi-core laptop. The console will print progress as each replicate completes.

Step 3 — Run the subsampling pipeline (optional, for comparison)

re-slda subsampling

or from a notebook:

results = re_slda.run_subsampling(X, Y, iters=200)

Subsampling replaces the bootstrap with repeated train/test splits (test_ratio = 0.20, n_subspaces = 5, 200 iterations). It is useful as a sanity check: features that appear stable under both schemes are the most trustworthy.

Step 4 — Locate the output

When invoked from the command line, results are written to output/ with a timestamped filename:

output/BS_Glios_group_bootstrapping_<mmddHHMM>.csv      # bootstrapping
output/CVlam_Glio_Subspace_subsampling_<mmddHHMM>.csv   # subsampling

From a notebook nothing is written to disk by default — the call returns a DataFrame. Pass out_prefix="MyRun" (and optionally save_dir=...) to also write a timestamped CSV.

Step 5 — Interpret the results

Each CSV contains one row per resampling iteration with these columns:

Column	Meaning	How to read it
`Selected_Variables`	Comma separated features chosen on that iteration	Tally these across rows. Features appearing in many rows are stable. The fraction of rows in which a feature appears is its Variable Inclusion Probability (VIP).
`optimal_lambda`	Cross-validated regularisation strength	A tight distribution suggests the regularisation surface is well-behaved; very high variance suggests an under-determined problem.
`MAE`	Mean Absolute Error on held-out data	Lower is better. Because Y is ordinal, MAE is the primary performance metric. It penalises a "grade 4 predicted as grade 2" more than a one-step error.
`Accuracy`	Exact-match classification accuracy on held-out data	Use as a secondary metric; ordinal models often have modest accuracy but small MAE.

A typical post-processing pattern in Python:

import pandas as pd, re_slda

# If you ran from a notebook you already have `results`; otherwise read the CSV:
results = pd.read_csv("output/BS_Glios_group_bootstrapping_<mmddHHMM>.csv")

# 1. Per-feature VIP across the 200 iterations
vip = re_slda.compute_vip(results)
print(vip.head(20))

# 2. Predictive performance summary
print(results[["MAE", "Accuracy", "optimal_lambda"]].describe())

Rules of thumb for the example dataset:

Features with VIP ≥ 0.6 are the candidate stable signature.
Compare the top VIP features from the bootstrapping and subsampling outputs. The intersection is the most reliable.
Median MAE on the glioma example should sit well below 1.0 (i.e. on average predictions are off by less than one tumor grade)

Step 6 — Adapt to your own data

Once you are familiarized with the framework, swap in your own data (see Setup Dataset below) and tune the parameters described in Parameter Configuration.

Parameter Configuration

All tunable parameters are exposed as keyword arguments to re_slda.run_bootstrapping / re_slda.run_subsampling (and as flags on the re-slda CLI). Pass them at the call site — no need to edit pipeline.py.

Parameter	Pipeline	Effect
`iters`	both	Number of resampling iterations. More iterations → more stable VIPs, longer runtime.
`cv_folds` / `n_folds_cv`	both	Inner CV folds used to pick `lambda`.
`predictor_subset`	both	Pool of candidate features per iteration (default 80% of columns).
`subspace_size`	bootstrapping	Size of each randomly sampled predictor subspace.
`target_unique_prob`	bootstrapping	Target unique-sample coverage; controls bootstrap scale.
`n_subspaces`	subsampling	Number of subspaces per train/test split.
`test_ratio`	subsampling	Hold-out fraction for each split.
`base_seed`	both	Random seed for reproducibility.
`out_prefix`, `save_dir`	both	If set, also write a timestamped CSV to disk.

Setup Dataset

A sample dataset is already provided in datasets/. To run on your own data:

From a notebook: load any DataFrame / NumPy array and pass it directly — re_slda.run_bootstrapping(X, Y) accepts both.
From the CLI: point re-slda at your files with --x path/to/X.csv --y path/to/Y.csv.

Dataset Requirements:

Files must be in .csv format
Y (response) dataset
- Single column, no header
- Contains ordinal labels only
X (feature) dataset
- Rows represent samples
- Columns represent features
- First row must contain feature (variable) names

Important: The number of rows in X must match the number of entries in Y.

Output

After running either the bootstrapping or subsampling pipeline, the framework generates a CSV output file in the designated output directory.

All output files are saved in the output/ directory
The filename prefix can be modified in the pipeline parameters

Output Naming

Output files are timestamped to ensure reproducibility and prevent overwriting previous results. The filename format is:

<prefix>_<pipeline>_<mmddHHMM>.csv

Example:

BS_Glios_group_bootstrapping_01251445.csv

Notes on Interpretation

Note: The Selected_Variables column reflects the final set of features chosen for that run and may vary across executions due to resampling and randomness. This variation is the signal the framework exploits; aggregate across iterations to obtain the VIP.

Important: Performance metrics are computed on held-out data and may vary depending on the random seed and parameter configuration. Always report summaries (median, IQR) across iterations rather than a single number.

Reference

Clemmensen, L., Hastie, T., Witten, D., & Ersbøll, B. (2011). Sparse Discriminant Analysis. Technometrics.

Project details

These details have not been verified by PyPI

Project links

Source

Release history Release notifications | RSS feed

This version

0.1.1

May 23, 2026

0.1.0

May 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

re_slda-0.1.1.tar.gz (21.4 kB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

re_slda-0.1.1-py3-none-any.whl (26.4 kB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file re_slda-0.1.1.tar.gz.

File metadata

Download URL: re_slda-0.1.1.tar.gz
Upload date: May 23, 2026
Size: 21.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for re_slda-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`c67c3a93d282a8b08f0e31da313800f2e5ead6fac2487be18ca560714763b639`
MD5	`c0afe6fae232e2d59e7d8a07f4f0f008`
BLAKE2b-256	`dd63f1a2ec1ea12ba7efe817d8b6f9977d54624fdf3f635da008d67b63362244`

See more details on using hashes here.

Provenance

The following attestation bundles were made for re_slda-0.1.1.tar.gz:

Publisher: publish.yml on ryan-wng/RE-sLDA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: re_slda-0.1.1.tar.gz
- Subject digest: c67c3a93d282a8b08f0e31da313800f2e5ead6fac2487be18ca560714763b639
- Sigstore transparency entry: 1613238224
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: ryan-wng/RE-sLDA@e827f9662364873446e7ea72f3a9b845b6162c96
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/ryan-wng
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e827f9662364873446e7ea72f3a9b845b6162c96
- Trigger Event: push

File details

Details for the file re_slda-0.1.1-py3-none-any.whl.

File metadata

Download URL: re_slda-0.1.1-py3-none-any.whl
Upload date: May 23, 2026
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for re_slda-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4415daecdd000484a1f230255c7969e2ac4afdc1b453facdf751a02210c635ed`
MD5	`81b73f64bd9a046228067d085a131ec6`
BLAKE2b-256	`42daf511a951a07d61cb921272842fa02933431c373fc78c52037b4ef0d1476e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for re_slda-0.1.1-py3-none-any.whl:

Publisher: publish.yml on ryan-wng/RE-sLDA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: re_slda-0.1.1-py3-none-any.whl
- Subject digest: 4415daecdd000484a1f230255c7969e2ac4afdc1b453facdf751a02210c635ed
- Sigstore transparency entry: 1613238403
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: ryan-wng/RE-sLDA@e827f9662364873446e7ea72f3a9b845b6162c96
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/ryan-wng
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e827f9662364873446e7ea72f3a9b845b6162c96
- Trigger Event: push

RE-sLDA 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RE-sLDA: Resampling-Enhanced Sparse LDA for Ordinal Outcomes

Key Features

Installation

From PyPI (recommended)

From source

Usage

Option A — Import into a notebook

Option B — Command line

Tutorial: Walkthrough with the Included Glioma Dataset

Step 1 — Inspect the example data

Step 2 — Run the bootstrapping pipeline

Step 3 — Run the subsampling pipeline (optional, for comparison)

Step 4 — Locate the output

Step 5 — Interpret the results

Step 6 — Adapt to your own data

Parameter Configuration

Setup Dataset

Dataset Requirements:

Output

Output Naming

Notes on Interpretation

Reference

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance