ElasticNet stability selection with the Shah & Samworth (2013) r-concave PFER bound, for protein-level biomarker discovery from DIA-NN output.

Project description

stably

ElasticNet stability selection with the Shah & Samworth (2013) r-concave PFER bound, for protein-level biomarker discovery from DIA-NN output.

The selection-probability threshold π is derived per-run from the r-concave tail bound (S&S 2013, equation 8) using the data-dependent quantity θ = q/p. This is tighter than the Meinshausen & Buhlmann (2010) worst-case bound and so retains more features for the same PFER budget. The package operates on protein-level DIA-NN matrices (pg_matrix); peptide-level input is rejected explicitly.

Installation

Requires Python ≥ 3.10.

pip install stably

To also install test dependencies:

pip install "stably[test]"

To install the latest development version directly from GitHub:

pip install git+https://github.com/byrnedaniel5-eng/stably.git

Or for an editable install when developing the package itself:

git clone https://github.com/byrnedaniel5-eng/stably.git
pip install -e ./stably

Verify the install:

python -c "import stably; print(stably.__version__)"
# 0.4.1

Dependencies

Pulled in automatically by pip install: numpy, pandas, scipy, scikit-learn, joblib, matplotlib, seaborn, pyyaml, openpyxl.

Input data

Two files are required:

1. Protein-level abundance matrix (`pg_matrix`)

A CSV produced by DIA-NN. The first 6 columns are protein-level metadata (Protein.Group, Protein.Ids, Protein.Names, Genes, First.Protein.Description, N.Proteotypic.Sequences); every column from index 6 onwards is a sample, named by the run/biobank ID.

Peptide-level matrices (pr_matrix, detectable by a Stripped.Sequence column) are rejected at load time — see the docstring on io.py for why.

2. Sample-label spreadsheet

An .xlsx file with at least:

a column matching the sample IDs used as column headers in the pg_matrix (e.g. Biobank Number)
a group column with two values matching case_name and control_name from the config (e.g. Group containing PDAC / Non-cancer)

Samples present in the pg_matrix but missing from the label file are dropped with a warning.

Configuration

Runs are driven by a YAML config that you supply per-analysis. The fastest way to start is to drop a fully-commented template into your analysis directory:

python -m stably init

This writes config_stably.yaml next to where you ran the command. Edit at minimum data_file, label_file, case_name, control_name, and output_dir. The template is also browsable at examples/config_template.yaml.

The most important keys:

Key	Meaning
`data_file`, `label_file`	Paths to the two input files
`case_name`, `control_name`	Labels in the group column to map to y=1 / y=0
`group_column`, `sample_id_column`	Column names in the label file
`min_proteotypic_peptides`	Drops proteins with `N.Proteotypic.Sequences` below this (default 2)
`max_missing`	Per-protein NaN fraction allowed after sample alignment
`imputation_strategy`	`knn`, `minimum`, `mean`, or `median`
`log_transform`	Apply log2(x+1) before scaling
`max_candidates` (q)	Number of top-
`stability_iterations`	2 × B; S&S §3.4.1 recommends B ≤ 50 so leave at 100 unless you understand the r-concavity assumption
`stability_pfer`	Absolute PFER budget E[V]
`l1_ratio`	ElasticNet mixing parameter (1 = lasso, 0 = ridge)
`output_dir`	Results destination

The threshold π is not specified directly; it is derived per-run from (q, p, B, PFER) via the r-concave bound after preprocessing fixes p.

Running on real data

From the directory containing your config (and data files referenced by it):

python -m stably --config config_stably.yaml

This prints class balance, q/p, B, the r-concave vs M&B threshold comparison, calibrated C_ref, the top-20 stable features by selection probability π̂, and writes results to output_dir. Relative paths in the config are resolved from the current working directory.

Toy example with synthetic data

Use this to sanity-check the install on a tiny fabricated dataset (no DIA-NN run required). It produces a pg_matrix-shaped CSV, an Excel label file, and a config, then runs the pipeline.

Save as toy_run.py next to the package and run with python toy_run.py:

"""Minimal end-to-end smoke test for stably on synthetic data."""
import numpy as np
import pandas as pd
import yaml
from pathlib import Path

rng = np.random.default_rng(0)
n_samples_per_group = 20
n_proteins = 200
n_true_signal = 10  # proteins with a real case/control difference

work = Path("toy_workspace")
work.mkdir(exist_ok=True)

# 1. Build a synthetic pg_matrix-shaped CSV.
sample_ids = [f"S{i:03d}" for i in range(2 * n_samples_per_group)]
y = np.array([0] * n_samples_per_group + [1] * n_samples_per_group)

X = rng.normal(loc=20.0, scale=1.0, size=(n_proteins, len(sample_ids)))
# Inject a real signal into the first n_true_signal proteins (up in cases).
X[:n_true_signal, y == 1] += rng.normal(loc=2.0, scale=0.3,
                                        size=(n_true_signal, n_samples_per_group))

metadata = pd.DataFrame({
    "Protein.Group":              [f"P{i:05d}" for i in range(n_proteins)],
    "Protein.Ids":                [f"P{i:05d}" for i in range(n_proteins)],
    "Protein.Names":              [f"PROT{i}_HUMAN" for i in range(n_proteins)],
    "Genes":                      [f"GENE{i}" for i in range(n_proteins)],
    "First.Protein.Description":  [f"Synthetic protein {i}" for i in range(n_proteins)],
    "N.Proteotypic.Sequences":    rng.integers(2, 8, size=n_proteins),
})
pg_matrix = pd.concat([metadata, pd.DataFrame(X, columns=sample_ids)], axis=1)
pg_matrix.to_csv(work / "toy.pg_matrix.csv", index=False)

# 2. Build a label spreadsheet matching the sample IDs.
labels = pd.DataFrame({
    "Biobank Number": sample_ids,
    "Group":          ["Non-cancer" if yi == 0 else "PDAC" for yi in y],
})
labels.to_excel(work / "toy_labels.xlsx", index=False)

# 3. Write a config.
cfg = {
    "data_file":               str(work / "toy.pg_matrix.csv"),
    "label_file":              str(work / "toy_labels.xlsx"),
    "control_name":            "Non-cancer",
    "case_name":               "PDAC",
    "group_column":            "Group",
    "sample_id_column":        "Biobank Number",
    "random_state":            42,
    "max_sample_missing":      0.7,
    "max_missing":             0.0,
    "log_transform":           False,   # data is already on a log-like scale
    "correlation_threshold":   0.95,
    "imputation_strategy":     "knn",
    "knn_neighbors":           5,
    "min_proteotypic_peptides": 2,
    "max_iterations":          5000,
    "max_candidates":          20,
    "stability_iterations":    100,     # 2 × B = 2 × 50
    "stability_pfer":          5,
    "l1_ratio":                0.3,
    "n_jobs":                  -1,
    "output_dir":              str(work / "toy_results"),
}
cfg_path = work / "toy_config.yaml"
with open(cfg_path, "w") as f:
    yaml.safe_dump(cfg, f)

# 4. Run the pipeline programmatically.
from stably import (
    Config, load_data, save_results, create_visualizations,
    full_dataset_stability_elasticnet,
)

config = Config.from_yaml(cfg_path)
np.random.seed(config.RANDOM_STATE)

X_raw, y, sample_ids, protein_metadata = load_data(config)
results = full_dataset_stability_elasticnet(X_raw, y, config)

if results is None:
    print("No stable features found.")
else:
    save_results(results, protein_metadata, config)
    create_visualizations(results, protein_metadata, config)

    sel_probs = results["feature_stability"]["selection_probabilities"]
    stable    = results["feature_stability"]["stable_features"]
    top = sorted(stable, key=lambda f: sel_probs[f], reverse=True)[:10]
    print("\nTop stable features:")
    for f in top:
        gene = protein_metadata.loc[f, "Genes"]
        print(f"  {gene:>10s}  π̂ = {sel_probs[f]:.3f}")

If the install is healthy you should see most of the first 10 injected proteins (GENE0–GENE9) show up in the stable list. The toy dataset is small enough to finish in a few seconds on a laptop.

Outputs

output_dir will contain four timestamped artefacts per run:

File	Contents
`stable_features_<ts>.csv`	One row per measured protein with `selection_probability`, `cohens_d`, `direction`, `n_case`, `n_control`, `imputation_rate`. Sorted by π̂. The "stable" subset is everything with π̂ ≥ threshold (recorded in the manifest).
`stable_features_<ts>.schema.json`	Per-column descriptions for the CSV.
`full_results_<ts>.pkl`	Pickled `results` dict (feature_stability, stability_selection, preprocessing, sample_counts).
`run_manifest_<ts>.json`	Reproducibility manifest: package + dependency versions, input SHA-256 hashes, upstream DIA-NN log provenance, resolved config, derived parameters (`C_ref`, threshold, M&B threshold, θ, B, PFER), preprocessing summary, convergence stats.

create_visualizations additionally writes selection-probability and threshold-comparison plots to the same directory.

Programmatic API

The CLI is just a thin wrapper. The same pipeline from Python:

from stably import (
    Config, load_data, save_results, create_visualizations,
    full_dataset_stability_elasticnet,
)

config = Config.from_yaml("config_stably.yaml")
X_raw, y, sample_ids, protein_metadata = load_data(config)
results = full_dataset_stability_elasticnet(X_raw, y, config)
save_results(results, protein_metadata, config)
create_visualizations(results, protein_metadata, config)

Lower-level entry points exposed at the package root:

Preprocessor(config).fit_transform(X, y) — leakage-safe missing-filter + log + impute + z-score; reusable on held-out data via .transform.
stability_selection_elasticnet(X, y, config) — returns (stable_features, threshold, selection_probs, C_ref, threshold_info).
compute_rconcave_threshold(q, p, B, pfer) and compute_D(...) — the threshold maths in isolation.
calculate_cohens_d(X, y, features) — signed effect size on the preprocessed matrix.

Tests

pytest stably/tests          # fast suite
pytest stably/tests --runslow  # includes the empirical PFER calibration test

The suite covers the r-concave bound (test_rconcave.py), an empirical PFER check (test_pfer_empirical.py), Cohen's d edge cases, peptide-input rejection, preprocessing leakage guarantees, and the end-to-end CSV/manifest schema.

Theory — one paragraph

For each subsample the procedure fits an ElasticNet logistic regression at a fixed regularisation C_ref (calibrated once on the full data so each subsample yields ≤ q non-zero coefficients), then records the top-q features by |β|. With B complementary pairs (2B subsamples) the selection probability π̂ for each feature is the empirical proportion of subsamples in which it appeared. A feature is called stable when π̂ ≥ π, where π is the smallest threshold for which the r-concave bound (Shah & Samworth 2013, eq. 8)

E[V] ≤ min{ D(θ², 2τ-1, B, -½), D(θ, τ, 2B, -¼) } × p

is below the user-specified PFER budget. θ = q/p is data-dependent and so π must be computed after preprocessing has fixed p. The bound reduces to M&B's 1/(2τ-1) bound when r-concavity is dropped, hence the manifest reports both for diagnostic purposes.

Citation

Shah, R. D. & Samworth, R. J. (2013). Variable selection with error control: another look at stability selection. JRSS B, 75(1):55–80.

Project details

Release history Release notifications | RSS feed

0.5.0

May 11, 2026

0.4.2

May 8, 2026

This version

0.4.1

May 8, 2026

0.4.0

May 8, 2026

0.3.0

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stably-0.4.1.tar.gz (41.2 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stably-0.4.1-py3-none-any.whl (44.2 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file stably-0.4.1.tar.gz.

File metadata

Download URL: stably-0.4.1.tar.gz
Upload date: May 8, 2026
Size: 41.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for stably-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`540f775aefc65c28d8cee2b1026b6ff9360405f8051e2934d559679eff737034`
MD5	`9d3e4782a72df843aeaf04fbdee78115`
BLAKE2b-256	`9aef5ac75ebc7a68598f74d4ab1a36a4749e5fff305d807269a6ae605de6669a`

See more details on using hashes here.

File details

Details for the file stably-0.4.1-py3-none-any.whl.

File metadata

Download URL: stably-0.4.1-py3-none-any.whl
Upload date: May 8, 2026
Size: 44.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for stably-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8668a7dcc7b6423ea972941c42be2bc7450ef921b2b09b7d94bb729369da218f`
MD5	`474491bdd5a97df35c9548a8c27da88f`
BLAKE2b-256	`f519de6d070f4b0a918864f0275dc08509b8cbafe1dfd40c0f1d3795dbaf9aee`

See more details on using hashes here.

stably 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

stably

Installation

Dependencies

Input data

1. Protein-level abundance matrix (`pg_matrix`)

2. Sample-label spreadsheet

Configuration

Running on real data

Toy example with synthetic data

Outputs

Programmatic API

Tests

Theory — one paragraph

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

stably 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

stably

Installation

Dependencies

Input data

1. Protein-level abundance matrix (pg_matrix)

2. Sample-label spreadsheet

Configuration

Running on real data

Toy example with synthetic data

Outputs

Programmatic API

Tests

Theory — one paragraph

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Protein-level abundance matrix (`pg_matrix`)