Resampling-Enhanced Sparse LDA for ordinal outcomes
Project description
RE-sLDA: Resampling-Enhanced Sparse LDA for Ordinal Outcomes
This repository contains the implementation of RE-sLDA, a framework designed to enhance feature selection stability and accuracy when dealing with high-dimensional data and ordinal outcomes. By integrating resampling techniques with Sparse Linear Discriminant Analysis (sLDA), this method identifies robust biomarker signatures that standard sparse models often miss due to selection instability.
Key Features
-
Ordinal Outcome Optimization: Specifically tuned for categorical outcomes with a natural ordering (e.g., disease severity, treatment response grades).
-
Resampling-Based Stability: Utilizes bootstrap-based resampling to calculate Variable Inclusion Probabilities (VIP), ensuring the selected features are not artifacts of a single data split.
-
Parallel Computing Support: Fully integrated with
multiprocessingfor high-performance execution on multi-core machines.
Installation
From PyPI (recommended)
pip install RE-sLDA
This installs the re_slda Python package and a re-slda command-line entry point. A virtual environment is recommended to avoid dependency conflicts.
From source
Clone the repository, then install in editable mode:
git clone https://github.com/your-org/RE-sLDA.git
cd RE-sLDA
pip install -e .
Usage
RE-sLDA can be driven two ways: as importable functions inside a notebook/script, or as a command-line tool.
Option A — Import into a notebook
import pandas as pd
import re_slda
X = pd.read_csv("datasets/use_glio_data_filter1000.csv") # has header
Y = pd.read_csv("datasets/use_glio_dataY_filter1000.csv",
header=None).values.squeeze()
# Bootstrapping — returns a DataFrame, one row per iteration
results = re_slda.run_bootstrapping(X, Y, iters=200, base_seed=42)
# Per-feature Variable Inclusion Probability
vip = re_slda.compute_vip(results)
vip.head(20)
run_subsampling has the same shape. Pass out_prefix="MyRun" to also write a timestamped CSV to output/. The public API is:
| Function | Purpose |
|---|---|
re_slda.run_bootstrapping(X, Y, varnames=None, *, iters=200, ...) |
Resampling-with-replacement pipeline. Returns a DataFrame. |
re_slda.run_subsampling(X, Y, varnames=None, *, iters=200, ...) |
Train/test-split pipeline. Returns a DataFrame. |
re_slda.compute_vip(results) |
Variable Inclusion Probabilities from a results DataFrame. |
re_slda.predict_asda_ordinal(model, X_new) |
Predict ordinal labels from a fitted ordASDA model. |
re_slda.ordASDA(...) |
Low-level ordinal sparse LDA fitter. |
Option B — Command line
After pip install, the re-slda console script is available:
re-slda bootstrapping
re-slda subsampling
The legacy invocation still works for users who prefer to clone the repo:
python pipeline.py bootstrapping
python pipeline.py subsampling
Both commands accept --iters, --out-prefix, --save-dir, --seed, --x, and --y. Run re-slda --help for details.
Tutorial: Walkthrough with the Included Glioma Dataset
This repository ships with a real example dataset so you can run the full pipeline before applying RE-sLDA to your own data. The walkthrough below explains each step, how everything works, and how to read the results.
Step 1 — Inspect the example data
Two files are provided in datasets/:
| File | Role | Shape | Description |
|---|---|---|---|
use_glio_data_filter1000.csv |
Feature matrix X | 175 samples × 1000 features | Pre-filtered gene-expression features for glioma patients. First row is the feature names (V4391, V708, …). |
use_glio_dataY_filter1000.csv |
Response vector Y | 175 values | Ordinal tumor-grade labels (e.g. 1, 2, 3, 4). No header row. |
You can preview the data with any spreadsheet tool or:
head -2 datasets/use_glio_data_filter1000.csv
head -5 datasets/use_glio_dataY_filter1000.csv
Step 2 — Run the bootstrapping pipeline
Bootstrapping is the recommended starting point, as it produces Variable Inclusion Probabilities (VIPs) for every feature. Either from the command line:
re-slda bootstrapping # after `pip install RE-sLDA`
# or, from a source checkout:
python pipeline.py bootstrapping
or directly from a notebook:
import pandas as pd, re_slda
X = pd.read_csv("datasets/use_glio_data_filter1000.csv")
Y = pd.read_csv("datasets/use_glio_dataY_filter1000.csv", header=None).values.squeeze()
results = re_slda.run_bootstrapping(X, Y, iters=200)
What happens during this run (using the defaults):
- The script draws 200 bootstrap replicates of the rows of X/Y (with replacement).
- For each replicate it samples a random predictor subspace of size
subspace_size = 5, repeated until ~80% of the predictor pool has been covered (target_unique_prob = 0.8). - Inside each subspace it fits an ordinal sLDA model with 4-fold cross-validation to pick
optimal_lambda. - Selected variables are accumulated and held-out MAE / Accuracy are recorded.
Expect a runtime of several minutes on a modern multi-core laptop. The console will print progress as each replicate completes.
Step 3 — Run the subsampling pipeline (optional, for comparison)
re-slda subsampling
or from a notebook:
results = re_slda.run_subsampling(X, Y, iters=200)
Subsampling replaces the bootstrap with repeated train/test splits (test_ratio = 0.20, n_subspaces = 5, 200 iterations). It is useful as a sanity check: features that appear stable under both schemes are the most trustworthy.
Step 4 — Locate the output
When invoked from the command line, results are written to output/ with a timestamped filename:
output/BS_Glios_group_bootstrapping_<mmddHHMM>.csv # bootstrapping
output/CVlam_Glio_Subspace_subsampling_<mmddHHMM>.csv # subsampling
From a notebook nothing is written to disk by default — the call returns a DataFrame. Pass out_prefix="MyRun" (and optionally save_dir=...) to also write a timestamped CSV.
Step 5 — Interpret the results
Each CSV contains one row per resampling iteration with these columns:
| Column | Meaning | How to read it |
|---|---|---|
Selected_Variables |
Comma separated features chosen on that iteration | Tally these across rows. Features appearing in many rows are stable. The fraction of rows in which a feature appears is its Variable Inclusion Probability (VIP). |
optimal_lambda |
Cross-validated regularisation strength | A tight distribution suggests the regularisation surface is well-behaved; very high variance suggests an under-determined problem. |
MAE |
Mean Absolute Error on held-out data | Lower is better. Because Y is ordinal, MAE is the primary performance metric. It penalises a "grade 4 predicted as grade 2" more than a one-step error. |
Accuracy |
Exact-match classification accuracy on held-out data | Use as a secondary metric; ordinal models often have modest accuracy but small MAE. |
A typical post-processing pattern in Python:
import pandas as pd, re_slda
# If you ran from a notebook you already have `results`; otherwise read the CSV:
results = pd.read_csv("output/BS_Glios_group_bootstrapping_<mmddHHMM>.csv")
# 1. Per-feature VIP across the 200 iterations
vip = re_slda.compute_vip(results)
print(vip.head(20))
# 2. Predictive performance summary
print(results[["MAE", "Accuracy", "optimal_lambda"]].describe())
Rules of thumb for the example dataset:
- Features with VIP ≥ 0.6 are the candidate stable signature.
- Compare the top VIP features from the bootstrapping and subsampling outputs. The intersection is the most reliable.
- Median MAE on the glioma example should sit well below 1.0 (i.e. on average predictions are off by less than one tumor grade)
Step 6 — Adapt to your own data
Once you are familiarized with the framework, swap in your own data (see Setup Dataset below) and tune the parameters described in Parameter Configuration.
Parameter Configuration
All tunable parameters are exposed as keyword arguments to re_slda.run_bootstrapping / re_slda.run_subsampling (and as flags on the re-slda CLI). Pass them at the call site — no need to edit pipeline.py.
| Parameter | Pipeline | Effect |
|---|---|---|
iters |
both | Number of resampling iterations. More iterations → more stable VIPs, longer runtime. |
cv_folds / n_folds_cv |
both | Inner CV folds used to pick lambda. |
predictor_subset |
both | Pool of candidate features per iteration (default 80% of columns). |
subspace_size |
bootstrapping | Size of each randomly sampled predictor subspace. |
target_unique_prob |
bootstrapping | Target unique-sample coverage; controls bootstrap scale. |
n_subspaces |
subsampling | Number of subspaces per train/test split. |
test_ratio |
subsampling | Hold-out fraction for each split. |
base_seed |
both | Random seed for reproducibility. |
out_prefix, save_dir |
both | If set, also write a timestamped CSV to disk. |
Setup Dataset
A sample dataset is already provided in datasets/. To run on your own data:
- From a notebook: load any DataFrame / NumPy array and pass it directly —
re_slda.run_bootstrapping(X, Y)accepts both. - From the CLI: point
re-sldaat your files with--x path/to/X.csv --y path/to/Y.csv.
Dataset Requirements:
- Files must be in
.csvformat - Y (response) dataset
- Single column, no header
- Contains ordinal labels only
- X (feature) dataset
- Rows represent samples
- Columns represent features
- First row must contain feature (variable) names
Important: The number of rows in X must match the number of entries in Y.
Output
After running either the bootstrapping or subsampling pipeline, the framework generates a CSV output file in the designated output directory.
- All output files are saved in the
output/directory - The filename prefix can be modified in the pipeline parameters
Output Naming
Output files are timestamped to ensure reproducibility and prevent overwriting previous results. The filename format is:
<prefix>_<pipeline>_<mmddHHMM>.csv
Example:
BS_Glios_group_bootstrapping_01251445.csv
Notes on Interpretation
Note: The Selected_Variables column reflects the final set of features chosen for that run and may vary across executions due to resampling and randomness. This variation is the signal the framework exploits; aggregate across iterations to obtain the VIP.
Important: Performance metrics are computed on held-out data and may vary depending on the random seed and parameter configuration. Always report summaries (median, IQR) across iterations rather than a single number.
Reference
Clemmensen, L., Hastie, T., Witten, D., & Ersbøll, B. (2011). Sparse Discriminant Analysis. Technometrics.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file re_slda-0.1.1.tar.gz.
File metadata
- Download URL: re_slda-0.1.1.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c67c3a93d282a8b08f0e31da313800f2e5ead6fac2487be18ca560714763b639
|
|
| MD5 |
c0afe6fae232e2d59e7d8a07f4f0f008
|
|
| BLAKE2b-256 |
dd63f1a2ec1ea12ba7efe817d8b6f9977d54624fdf3f635da008d67b63362244
|
Provenance
The following attestation bundles were made for re_slda-0.1.1.tar.gz:
Publisher:
publish.yml on ryan-wng/RE-sLDA
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
re_slda-0.1.1.tar.gz -
Subject digest:
c67c3a93d282a8b08f0e31da313800f2e5ead6fac2487be18ca560714763b639 - Sigstore transparency entry: 1613238224
- Sigstore integration time:
-
Permalink:
ryan-wng/RE-sLDA@e827f9662364873446e7ea72f3a9b845b6162c96 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ryan-wng
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e827f9662364873446e7ea72f3a9b845b6162c96 -
Trigger Event:
push
-
Statement type:
File details
Details for the file re_slda-0.1.1-py3-none-any.whl.
File metadata
- Download URL: re_slda-0.1.1-py3-none-any.whl
- Upload date:
- Size: 26.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4415daecdd000484a1f230255c7969e2ac4afdc1b453facdf751a02210c635ed
|
|
| MD5 |
81b73f64bd9a046228067d085a131ec6
|
|
| BLAKE2b-256 |
42daf511a951a07d61cb921272842fa02933431c373fc78c52037b4ef0d1476e
|
Provenance
The following attestation bundles were made for re_slda-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on ryan-wng/RE-sLDA
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
re_slda-0.1.1-py3-none-any.whl -
Subject digest:
4415daecdd000484a1f230255c7969e2ac4afdc1b453facdf751a02210c635ed - Sigstore transparency entry: 1613238403
- Sigstore integration time:
-
Permalink:
ryan-wng/RE-sLDA@e827f9662364873446e7ea72f3a9b845b6162c96 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ryan-wng
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e827f9662364873446e7ea72f3a9b845b6162c96 -
Trigger Event:
push
-
Statement type: