Skip to main content

Resampling-Enhanced Sparse LDA for ordinal outcomes

Project description

RE-sLDA: Resampling-Enhanced Sparse LDA for Ordinal Outcomes

This repository contains the implementation of RE-sLDA, a framework designed to enhance feature selection stability and accuracy when dealing with high-dimensional data and ordinal outcomes. By integrating resampling techniques with Sparse Linear Discriminant Analysis (sLDA), this method identifies robust biomarker signatures that standard sparse models often miss due to selection instability.

Key Features

  • Ordinal Outcome Optimization: Specifically tuned for categorical outcomes with a natural ordering (e.g., disease severity, treatment response grades).

  • Resampling-Based Stability: Utilizes bootstrap-based resampling to calculate Variable Inclusion Probabilities (VIP), ensuring the selected features are not artifacts of a single data split.

  • Parallel Computing Support: Fully integrated with multiprocessing for high-performance execution on multi-core machines.


Installation

From PyPI (recommended)

pip install RE-sLDA

This installs the re_slda Python package and a re-slda command-line entry point. A virtual environment is recommended to avoid dependency conflicts.

From source

Clone the repository, then install in editable mode:

git clone https://github.com/your-org/RE-sLDA.git
cd RE-sLDA
pip install -e .

Usage

RE-sLDA can be driven two ways: as importable functions inside a notebook/script, or as a command-line tool.

Option A — Import into a notebook

import pandas as pd
import re_slda

X = pd.read_csv("datasets/use_glio_data_filter1000.csv")    # has header
Y = pd.read_csv("datasets/use_glio_dataY_filter1000.csv",
                header=None).values.squeeze()

# Bootstrapping — returns a DataFrame, one row per iteration
results = re_slda.run_bootstrapping(X, Y, iters=200, base_seed=42)

# Per-feature Variable Inclusion Probability
vip = re_slda.compute_vip(results)
vip.head(20)

run_subsampling has the same shape. Pass out_prefix="MyRun" to also write a timestamped CSV to output/. The public API is:

Function Purpose
re_slda.run_bootstrapping(X, Y, varnames=None, *, iters=200, ...) Resampling-with-replacement pipeline. Returns a DataFrame.
re_slda.run_subsampling(X, Y, varnames=None, *, iters=200, ...) Train/test-split pipeline. Returns a DataFrame.
re_slda.compute_vip(results) Variable Inclusion Probabilities from a results DataFrame.
re_slda.predict_asda_ordinal(model, X_new) Predict ordinal labels from a fitted ordASDA model.
re_slda.ordASDA(...) Low-level ordinal sparse LDA fitter.

Option B — Command line

After pip install, the re-slda console script is available:

re-slda bootstrapping
re-slda subsampling

The legacy invocation still works for users who prefer to clone the repo:

python pipeline.py bootstrapping
python pipeline.py subsampling

Both commands accept --iters, --out-prefix, --save-dir, --seed, --x, and --y. Run re-slda --help for details.


Tutorial: Walkthrough with the Included Glioma Dataset

This repository ships with a real example dataset so you can run the full pipeline before applying RE-sLDA to your own data. The walkthrough below explains each step, how everything works, and how to read the results.

Step 1 — Inspect the example data

Two files are provided in datasets/:

File Role Shape Description
use_glio_data_filter1000.csv Feature matrix X 175 samples × 1000 features Pre-filtered gene-expression features for glioma patients. First row is the feature names (V4391, V708, …).
use_glio_dataY_filter1000.csv Response vector Y 175 values Ordinal tumor-grade labels (e.g. 1, 2, 3, 4). No header row.

You can preview the data with any spreadsheet tool or:

head -2 datasets/use_glio_data_filter1000.csv
head -5 datasets/use_glio_dataY_filter1000.csv

Step 2 — Run the bootstrapping pipeline

Bootstrapping is the recommended starting point, as it produces Variable Inclusion Probabilities (VIPs) for every feature. Either from the command line:

re-slda bootstrapping            # after `pip install RE-sLDA`
# or, from a source checkout:
python pipeline.py bootstrapping

or directly from a notebook:

import pandas as pd, re_slda

X = pd.read_csv("datasets/use_glio_data_filter1000.csv")
Y = pd.read_csv("datasets/use_glio_dataY_filter1000.csv", header=None).values.squeeze()
results = re_slda.run_bootstrapping(X, Y, iters=200)

What happens during this run (using the defaults):

  1. The script draws 200 bootstrap replicates of the rows of X/Y (with replacement).
  2. For each replicate it samples a random predictor subspace of size subspace_size = 5, repeated until ~80% of the predictor pool has been covered (target_unique_prob = 0.8).
  3. Inside each subspace it fits an ordinal sLDA model with 4-fold cross-validation to pick optimal_lambda.
  4. Selected variables are accumulated and held-out MAE / Accuracy are recorded.

Expect a runtime of several minutes on a modern multi-core laptop. The console will print progress as each replicate completes.

Step 3 — Run the subsampling pipeline (optional, for comparison)

re-slda subsampling

or from a notebook:

results = re_slda.run_subsampling(X, Y, iters=200)

Subsampling replaces the bootstrap with repeated train/test splits (test_ratio = 0.20, n_subspaces = 5, 200 iterations). It is useful as a sanity check: features that appear stable under both schemes are the most trustworthy.

Step 4 — Locate the output

When invoked from the command line, results are written to output/ with a timestamped filename:

output/BS_Glios_group_bootstrapping_<mmddHHMM>.csv      # bootstrapping
output/CVlam_Glio_Subspace_subsampling_<mmddHHMM>.csv   # subsampling

From a notebook nothing is written to disk by default — the call returns a DataFrame. Pass out_prefix="MyRun" (and optionally save_dir=...) to also write a timestamped CSV.

Step 5 — Interpret the results

Each CSV contains one row per resampling iteration with these columns:

Column Meaning How to read it
Selected_Variables Comma separated features chosen on that iteration Tally these across rows. Features appearing in many rows are stable. The fraction of rows in which a feature appears is its Variable Inclusion Probability (VIP).
optimal_lambda Cross-validated regularisation strength A tight distribution suggests the regularisation surface is well-behaved; very high variance suggests an under-determined problem.
MAE Mean Absolute Error on held-out data Lower is better. Because Y is ordinal, MAE is the primary performance metric. It penalises a "grade 4 predicted as grade 2" more than a one-step error.
Accuracy Exact-match classification accuracy on held-out data Use as a secondary metric; ordinal models often have modest accuracy but small MAE.

A typical post-processing pattern in Python:

import pandas as pd, re_slda

# If you ran from a notebook you already have `results`; otherwise read the CSV:
results = pd.read_csv("output/BS_Glios_group_bootstrapping_<mmddHHMM>.csv")

# 1. Per-feature VIP across the 200 iterations
vip = re_slda.compute_vip(results)
print(vip.head(20))

# 2. Predictive performance summary
print(results[["MAE", "Accuracy", "optimal_lambda"]].describe())

Rules of thumb for the example dataset:

  • Features with VIP ≥ 0.6 are the candidate stable signature.
  • Compare the top VIP features from the bootstrapping and subsampling outputs. The intersection is the most reliable.
  • Median MAE on the glioma example should sit well below 1.0 (i.e. on average predictions are off by less than one tumor grade)

Step 6 — Adapt to your own data

Once you are familiarized with the framework, swap in your own data (see Setup Dataset below) and tune the parameters described in Parameter Configuration.


Parameter Configuration

All tunable parameters are exposed as keyword arguments to re_slda.run_bootstrapping / re_slda.run_subsampling (and as flags on the re-slda CLI). Pass them at the call site — no need to edit pipeline.py.

Parameter Pipeline Effect
iters both Number of resampling iterations. More iterations → more stable VIPs, longer runtime.
cv_folds / n_folds_cv both Inner CV folds used to pick lambda.
predictor_subset both Pool of candidate features per iteration (default 80% of columns).
subspace_size bootstrapping Size of each randomly sampled predictor subspace.
target_unique_prob bootstrapping Target unique-sample coverage; controls bootstrap scale.
n_subspaces subsampling Number of subspaces per train/test split.
test_ratio subsampling Hold-out fraction for each split.
base_seed both Random seed for reproducibility.
out_prefix, save_dir both If set, also write a timestamped CSV to disk.

Setup Dataset

A sample dataset is already provided in datasets/. To run on your own data:

  • From a notebook: load any DataFrame / NumPy array and pass it directly — re_slda.run_bootstrapping(X, Y) accepts both.
  • From the CLI: point re-slda at your files with --x path/to/X.csv --y path/to/Y.csv.

Dataset Requirements:

  • Files must be in .csv format
  • Y (response) dataset
    • Single column, no header
    • Contains ordinal labels only
  • X (feature) dataset
    • Rows represent samples
    • Columns represent features
    • First row must contain feature (variable) names

Important: The number of rows in X must match the number of entries in Y.


Output

After running either the bootstrapping or subsampling pipeline, the framework generates a CSV output file in the designated output directory.

  • All output files are saved in the output/ directory
  • The filename prefix can be modified in the pipeline parameters

Output Naming

Output files are timestamped to ensure reproducibility and prevent overwriting previous results. The filename format is:

<prefix>_<pipeline>_<mmddHHMM>.csv

Example:

BS_Glios_group_bootstrapping_01251445.csv

Notes on Interpretation

Note: The Selected_Variables column reflects the final set of features chosen for that run and may vary across executions due to resampling and randomness. This variation is the signal the framework exploits; aggregate across iterations to obtain the VIP.

Important: Performance metrics are computed on held-out data and may vary depending on the random seed and parameter configuration. Always report summaries (median, IQR) across iterations rather than a single number.


Reference

Clemmensen, L., Hastie, T., Witten, D., & Ersbøll, B. (2011). Sparse Discriminant Analysis. Technometrics.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

re_slda-0.1.1.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

re_slda-0.1.1-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file re_slda-0.1.1.tar.gz.

File metadata

  • Download URL: re_slda-0.1.1.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for re_slda-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c67c3a93d282a8b08f0e31da313800f2e5ead6fac2487be18ca560714763b639
MD5 c0afe6fae232e2d59e7d8a07f4f0f008
BLAKE2b-256 dd63f1a2ec1ea12ba7efe817d8b6f9977d54624fdf3f635da008d67b63362244

See more details on using hashes here.

Provenance

The following attestation bundles were made for re_slda-0.1.1.tar.gz:

Publisher: publish.yml on ryan-wng/RE-sLDA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file re_slda-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: re_slda-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 26.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for re_slda-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4415daecdd000484a1f230255c7969e2ac4afdc1b453facdf751a02210c635ed
MD5 81b73f64bd9a046228067d085a131ec6
BLAKE2b-256 42daf511a951a07d61cb921272842fa02933431c373fc78c52037b4ef0d1476e

See more details on using hashes here.

Provenance

The following attestation bundles were made for re_slda-0.1.1-py3-none-any.whl:

Publisher: publish.yml on ryan-wng/RE-sLDA

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page