Skip to main content

ElasticNet stability selection with the Shah & Samworth (2013) r-concave PFER bound, for protein-level biomarker discovery from DIA-NN output.

Project description

stably

PyPI Python License

ElasticNet stability selection with the Shah & Samworth (2013) r-concave PFER bound, for protein-level biomarker discovery from DIA-NN output.

The selection-probability threshold π is derived per-run from the r-concave tail bound (S&S 2013, equation 8) using the data-dependent quantity θ = q/p. This is tighter than the Meinshausen & Buhlmann (2010) worst-case bound and so retains more features for the same PFER budget. The package operates on protein-level DIA-NN matrices (pg_matrix); peptide-level input is rejected explicitly.


Installation

Requires Python ≥ 3.10.

pip install stably

To also install test dependencies:

pip install "stably[test]"

To install the latest development version directly from GitHub:

pip install git+https://github.com/byrnedaniel5-eng/stably.git

Or for an editable install when developing the package itself:

git clone https://github.com/byrnedaniel5-eng/stably.git
pip install -e ./stably

Verify the install:

python -c "import stably; print(stably.__version__)"
# 0.4.1

Dependencies

Pulled in automatically by pip install: numpy, pandas, scipy, scikit-learn, joblib, matplotlib, seaborn, pyyaml, openpyxl.


Input data

Two files are required:

1. Protein-level abundance matrix (pg_matrix)

A CSV produced by DIA-NN. The first 6 columns are protein-level metadata (Protein.Group, Protein.Ids, Protein.Names, Genes, First.Protein.Description, N.Proteotypic.Sequences); every column from index 6 onwards is a sample, named by the run/biobank ID.

Peptide-level matrices (pr_matrix, detectable by a Stripped.Sequence column) are rejected at load time — see the docstring on io.py for why.

2. Sample-label spreadsheet

An .xlsx file with at least:

  • a column matching the sample IDs used as column headers in the pg_matrix (e.g. Biobank Number)
  • a group column with two values matching case_name and control_name from the config (e.g. Group containing PDAC / Non-cancer)

Samples present in the pg_matrix but missing from the label file are dropped with a warning.


Configuration

Runs are driven by a YAML config that you supply per-analysis. The fastest way to start is to drop a fully-commented template into your analysis directory:

python -m stably init

This writes config_stably.yaml next to where you ran the command. Edit at minimum data_file, label_file, case_name, control_name, and output_dir. The template is also browsable at examples/config_template.yaml.

The most important keys:

Key Meaning
data_file, label_file Paths to the two input files
case_name, control_name Labels in the group column to map to y=1 / y=0
group_column, sample_id_column Column names in the label file
min_proteotypic_peptides Drops proteins with N.Proteotypic.Sequences below this (default 2)
max_missing Per-protein NaN fraction allowed after sample alignment
imputation_strategy knn, minimum, mean, or median
log_transform Apply log2(x+1) before scaling
max_candidates (q) Number of top-
stability_iterations 2 × B; S&S §3.4.1 recommends B ≤ 50 so leave at 100 unless you understand the r-concavity assumption
stability_pfer Absolute PFER budget E[V]
l1_ratio ElasticNet mixing parameter (1 = lasso, 0 = ridge)
output_dir Results destination

The threshold π is not specified directly; it is derived per-run from (q, p, B, PFER) via the r-concave bound after preprocessing fixes p.


Running on real data

From the directory containing your config (and data files referenced by it):

python -m stably --config config_stably.yaml

This prints class balance, q/p, B, the r-concave vs M&B threshold comparison, calibrated C_ref, the top-20 stable features by selection probability π̂, and writes results to output_dir. Relative paths in the config are resolved from the current working directory.


Toy example with synthetic data

Use this to sanity-check the install on a tiny fabricated dataset (no DIA-NN run required). It produces a pg_matrix-shaped CSV, an Excel label file, and a config, then runs the pipeline.

Save as toy_run.py next to the package and run with python toy_run.py:

"""Minimal end-to-end smoke test for stably on synthetic data."""
import numpy as np
import pandas as pd
import yaml
from pathlib import Path

rng = np.random.default_rng(0)
n_samples_per_group = 20
n_proteins = 200
n_true_signal = 10  # proteins with a real case/control difference

work = Path("toy_workspace")
work.mkdir(exist_ok=True)

# 1. Build a synthetic pg_matrix-shaped CSV.
sample_ids = [f"S{i:03d}" for i in range(2 * n_samples_per_group)]
y = np.array([0] * n_samples_per_group + [1] * n_samples_per_group)

X = rng.normal(loc=20.0, scale=1.0, size=(n_proteins, len(sample_ids)))
# Inject a real signal into the first n_true_signal proteins (up in cases).
X[:n_true_signal, y == 1] += rng.normal(loc=2.0, scale=0.3,
                                        size=(n_true_signal, n_samples_per_group))

metadata = pd.DataFrame({
    "Protein.Group":              [f"P{i:05d}" for i in range(n_proteins)],
    "Protein.Ids":                [f"P{i:05d}" for i in range(n_proteins)],
    "Protein.Names":              [f"PROT{i}_HUMAN" for i in range(n_proteins)],
    "Genes":                      [f"GENE{i}" for i in range(n_proteins)],
    "First.Protein.Description":  [f"Synthetic protein {i}" for i in range(n_proteins)],
    "N.Proteotypic.Sequences":    rng.integers(2, 8, size=n_proteins),
})
pg_matrix = pd.concat([metadata, pd.DataFrame(X, columns=sample_ids)], axis=1)
pg_matrix.to_csv(work / "toy.pg_matrix.csv", index=False)

# 2. Build a label spreadsheet matching the sample IDs.
labels = pd.DataFrame({
    "Biobank Number": sample_ids,
    "Group":          ["Non-cancer" if yi == 0 else "PDAC" for yi in y],
})
labels.to_excel(work / "toy_labels.xlsx", index=False)

# 3. Write a config.
cfg = {
    "data_file":               str(work / "toy.pg_matrix.csv"),
    "label_file":              str(work / "toy_labels.xlsx"),
    "control_name":            "Non-cancer",
    "case_name":               "PDAC",
    "group_column":            "Group",
    "sample_id_column":        "Biobank Number",
    "random_state":            42,
    "max_sample_missing":      0.7,
    "max_missing":             0.0,
    "log_transform":           False,   # data is already on a log-like scale
    "imputation_strategy":     "knn",
    "knn_neighbors":           5,
    "min_proteotypic_peptides": 2,
    "max_iterations":          5000,
    "max_candidates":          20,
    "stability_iterations":    100,     # 2 × B = 2 × 50
    "stability_pfer":          5,
    "l1_ratio":                0.3,
    "n_jobs":                  -1,
    "output_dir":              str(work / "toy_results"),
}
cfg_path = work / "toy_config.yaml"
with open(cfg_path, "w") as f:
    yaml.safe_dump(cfg, f)

# 4. Run the pipeline programmatically.
from stably import (
    Config, load_data, save_results, create_visualizations,
    full_dataset_stability_elasticnet,
)

config = Config.from_yaml(cfg_path)
np.random.seed(config.RANDOM_STATE)

X_raw, y, sample_ids, protein_metadata = load_data(config)
results = full_dataset_stability_elasticnet(X_raw, y, config)

if results is None:
    print("No stable features found.")
else:
    save_results(results, protein_metadata, config)
    create_visualizations(results, protein_metadata, config)

    sel_probs = results["feature_stability"]["selection_probabilities"]
    stable    = results["feature_stability"]["stable_features"]
    top = sorted(stable, key=lambda f: sel_probs[f], reverse=True)[:10]
    print("\nTop stable features:")
    for f in top:
        gene = protein_metadata.loc[f, "Genes"]
        print(f"  {gene:>10s}  π̂ = {sel_probs[f]:.3f}")

If the install is healthy you should see most of the first 10 injected proteins (GENE0GENE9) show up in the stable list. The toy dataset is small enough to finish in a few seconds on a laptop.


Outputs

output_dir will contain four timestamped artefacts per run:

File Contents
stable_features_<ts>.csv One row per measured protein with selection_probability, cohens_d, direction, n_case, n_control, imputation_rate. Sorted by π̂. The "stable" subset is everything with π̂ ≥ threshold (recorded in the manifest).
stable_features_<ts>.schema.json Per-column descriptions for the CSV.
full_results_<ts>.pkl Pickled results dict (feature_stability, stability_selection, preprocessing, sample_counts).
run_manifest_<ts>.json Reproducibility manifest: package + dependency versions, input SHA-256 hashes, upstream DIA-NN log provenance, resolved config, derived parameters (C_ref, threshold, M&B threshold, θ, B, PFER), preprocessing summary, convergence stats.

create_visualizations additionally writes selection-probability and threshold-comparison plots to the same directory.


Programmatic API

The CLI is just a thin wrapper. The same pipeline from Python:

from stably import (
    Config, load_data, save_results, create_visualizations,
    full_dataset_stability_elasticnet,
)

config = Config.from_yaml("config_stably.yaml")
X_raw, y, sample_ids, protein_metadata = load_data(config)
results = full_dataset_stability_elasticnet(X_raw, y, config)
save_results(results, protein_metadata, config)
create_visualizations(results, protein_metadata, config)

Lower-level entry points exposed at the package root:

  • Preprocessor(config).fit_transform(X, y) — leakage-safe missing-filter + log + impute + z-score; reusable on held-out data via .transform.
  • stability_selection_elasticnet(X, y, config) — returns (stable_features, threshold, selection_probs, C_ref, threshold_info).
  • compute_rconcave_threshold(q, p, B, pfer) and compute_D(...) — the threshold maths in isolation.
  • calculate_cohens_d(X, y, features) — signed effect size on the preprocessed matrix.

Tests

pytest stably/tests          # fast suite
pytest stably/tests --runslow  # includes the empirical PFER calibration test

The suite covers the r-concave bound (test_rconcave.py), an empirical PFER check (test_pfer_empirical.py), Cohen's d edge cases, peptide-input rejection, preprocessing leakage guarantees, and the end-to-end CSV/manifest schema.


Theory — one paragraph

For each subsample the procedure fits an ElasticNet logistic regression at a fixed regularisation C_ref (calibrated once on the full data so each subsample yields ≤ q non-zero coefficients), then records the top-q features by |β|. With B complementary pairs (2B subsamples) the selection probability π̂ for each feature is the empirical proportion of subsamples in which it appeared. A feature is called stable when π̂ ≥ π, where π is the smallest threshold for which the r-concave bound (Shah & Samworth 2013, eq. 8)

E[V] ≤ min{ D(θ², 2τ-1, B, -½), D(θ, τ, 2B, -¼) } × p

is below the user-specified PFER budget. θ = q/p is data-dependent and so π must be computed after preprocessing has fixed p. The bound reduces to M&B's 1/(2τ-1) bound when r-concavity is dropped, hence the manifest reports both for diagnostic purposes.


Citation

Shah, R. D. & Samworth, R. J. (2013). Variable selection with error control: another look at stability selection. JRSS B, 75(1):55–80.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stably-0.5.0.tar.gz (44.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stably-0.5.0-py3-none-any.whl (48.1 kB view details)

Uploaded Python 3

File details

Details for the file stably-0.5.0.tar.gz.

File metadata

  • Download URL: stably-0.5.0.tar.gz
  • Upload date:
  • Size: 44.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for stably-0.5.0.tar.gz
Algorithm Hash digest
SHA256 631774e2c94d676edcc1cedea58a4078951cfb10d348572b487a93c9158f668e
MD5 e1d83ce7acf231349804ceec708a160e
BLAKE2b-256 4b75bfe3865fbdb7776193fa0ca875c56edb786e4d15def939f54d3f1219e042

See more details on using hashes here.

File details

Details for the file stably-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: stably-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 48.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for stably-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 920a127e9a5045311b4a59efa1eee6c89093e0771e78e0e653978c27e257c169
MD5 01c3e14eeb169f38b9a131f4d9395202
BLAKE2b-256 73ef1d5cdca0a0f8dc3f6b5969f26f4c966fb5518b4b697856be594698e1b175

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page