Skip to main content

Weighted structured nonconvex sparse models (Python + Rust)

Project description

skein

Weighted structured nonconvex sparse models. Rust core + Python API.

Documentation: the docs site has the full conceptual reference (penalties, datafits, weights, backends), porting guides for glmnet / ncvreg / grpreg, worked examples, and an auto-generated API reference. Built with Sphinx + Furo and hosted on Read the Docs (config in .readthedocs.yaml); preview locally with sphinx-build -b html docs docs/_build/html. CI builds it with -W (warnings = errors) on every PR.

skein targets a niche that's well-served in R (grpreg, ncvreg) but missing in Python at production quality: nonconvex group-structured penalties (group MCP, group SCAD, sparse-group nonconvex) with first-class support for weights along three axes — per-sample, per-feature, and per-group.

Status

v0.7. Full nonconvex / group / GLM stack is in place; M5 model selection + inference + threaded CV folds shipped; M11 graphical lasso family (single + joint) shipped; M12 hardening pass closed the penalty / datafit unit-test gaps and added a CI smoke job for the PyO3 layer. See ROADMAP.md for the full plan and the M13 performance findings opened by the benches/v2 release-profile run.

Done so far:

  • Solvers — production CD core (path solver, strong rule + KKT verification, gap-safe screening, Anderson acceleration, M13.1 saturation bypass); group block-CD with LLA outer loop for nonconvex group penalties (M13.4 Phase 2.3 weight-space short-circuit); Rayon-parallel group sweeps; operator-norm Lipschitz via power iteration.
  • Datafits — least squares, binomial logistic, Poisson (log link, with offsets), Cox PH (Breslow + Efron ties), multinomial softmax, Huber. All glued together by a GlmDatafit trait that exposes a weighted-LS surrogate; the M1/M2 inner solvers absorb every GLM unchanged.
  • Penalties — lasso, MCP, SCAD, elastic net, bridge |β|^q, group lasso, group MCP, group SCAD, group elastic net, sparse-group lasso, sparse-group MCP, sparse-group SCAD. Per-feature and per-group weights honored throughout.
  • Design-matrix backendsDenseMatrix, SparseCSC, lazy Standardized<D>, MmapMatrix (f64 + f32), row-block Chunked<C>, Augmented<D>, MultiTaskDesign<D> — all behind one trait, freely composable.
  • Python — sklearn-compatible estimators for every (datafit × penalty) combination (~150 classes); type stubs; warm-started λ-paths; standardization with original-scale coef_ / intercept_ recovery on dense and sparse.
  • Model selection + inference (M5) — K-fold CV across every *PathCV class (threaded folds via PyO3 GIL release, ~2.3–2.5× speedup); AIC/BIC/EBIC tuning; stability selection (MB bootstrap); debiased / desparsified lasso for LS + binomial + Poisson with Wald CIs and p-values.
  • Graphical models (M11) — sparse precision matrix estimation (GraphicalLasso / GraphicalMCP / GraphicalSCAD) and joint estimation across K related populations (JointGraphicalLasso / JointGraphicalMCP, Danaher–Wang–Witten 2014 group form via ADMM), with EBIC tuning and bootnet-style bootstrap edge stability. Nonconvex penalties on edges close the shrinkage-bias gap that sklearn.covariance.GraphicalLasso and R's glasso / qgraph / bootnet leave open.
  • Distribution + docs (M8) + hardening (M12) — CI + cibuildwheel + Read the Docs + Sphinx site (concepts + R-porting + extending + examples + API ref) + R numerical regression suite vs glmnet / ncvreg / grpreg + stable Rust API contract. M12 added penalty + datafit unit-test coverage, an integration test directory, a CI smoke job for the PyO3 layer, and an R-fixture gate.

Coming next: the M13 performance workstream — group_mcp LLA overhead at medium scale (Phase 2.3 shipped, native group-MCP BCD scoped as M13.4b), per-λ fixed-cost cut for convex Lasso (M13.2), and the publication benchmark suite at benches/v2/ that backs the software paper.

Layout

crates/skein-core/   pure Rust: traits + algorithms (no Python)
crates/skein-py/     PyO3 bindings (cdylib → skein_glm._core)
python/skein_glm/    sklearn-compatible estimators + ABCs for extensions
tests/               pytest suite (Rust extension required)
benches/             v1 cross-package harness (skein vs sklearn / skglm / celer / glmnet / ncvreg / grpreg)
benches/v2/          publication-quality Snakemake suite backing the paper
crates/skein-core/benches/   internal Rust criterion microbenches
paper/               figure + table bundle regenerated by benches/v2
docs/                Sphinx site (Read the Docs)

The Rust traits (DesignMatrix, Datafit, GlmDatafit, Penalty, GroupPenalty) and their Python ABC mirrors (skein_glm.penalties.Penalty, etc.) are the extension surface for downstream per-paper projects.

Quick start

import numpy as np
from skein_glm import MCPPathRegressor, LogisticGroupMCPPathRegressor, CoxMCPRegressor

# Nonconvex sparse least squares with a λ-path.
rng = np.random.default_rng(0)
n, p = 200, 50
X = rng.standard_normal((n, p))
y = X[:, :3] @ np.array([1.5, -2.0, 0.8]) + 0.1 * rng.standard_normal(n)
model = MCPPathRegressor(gamma=3.0, n_lambdas=50, standardize=True).fit(X, y)
print(model.coefs_[-1, :5], model.intercepts_[-1])

# Logistic + group MCP via LLA, with sklearn-style predict/predict_proba.
groups = np.repeat(np.arange(p // 5), 5)  # 5 features per group
y_bin = (X[:, :3].sum(axis=1) > 0).astype(float)
clf = LogisticGroupMCPPathRegressor(groups=groups, gamma=3.0, n_lambdas=20).fit(X, y_bin)
proba = clf.predict_proba(X)  # shape (n, n_lambdas)

# Cox PH with right-censored survival data.
time = rng.exponential(1.0 / np.exp(X[:, :3].sum(axis=1)))
event = rng.uniform(size=n) < 0.7
cox = CoxMCPRegressor(lambda_=0.01, gamma=3.0).fit(X, time, event.astype(float))
risk = cox.predict(X)  # prognostic index η

Every regressor follows the same (datafit) × (penalty) × ({,Path}Regressor) naming scheme. The path variants warm-start across λ; their coefs_ / intercepts_ (where applicable) are 2D arrays indexed by λ.

Performance

skein is benchmarked against sklearn / skglm / celer / glmnet / ncvreg on shared λ-grids via the harness under benches/. Headline numbers (Apple M1, 16 GB; median of N timed trials after a warm-up):

Each scenario is run in two regimes that name what the solution does at the tail of the λ-path, not the path geometry:

  • denseλ_min/λ_max = 1e-3; the active set saturates near the smallest λ (typical "I want the full path including the over-fit tail" usage).
  • sparseλ_min/λ_max = 5e-2; the path stops near support recovery, support stays small throughout.
scenario size skein next-fastest comparator
Lasso LS — dense medium (n=10k, p=1k) 1.17 s sklearn 0.125 s
Lasso LS — sparse medium 0.78 s sklearn 0.099 s
MCP LS — dense medium 1.37 s skglm 3.35 s
MCP LS — sparse medium 0.75 s ncvreg 1.17 s
MCP LS — dense large (n=100k, p=10k) 510 s skglm 666 s
MCP LS — sparse large 497 s skglm 702 s
SCAD LS — dense medium 1.78 s ncvreg 7.99 s
SCAD LS — sparse medium 0.90 s ncvreg 1.86 s

skein is the fastest on every nonconvex row across every size; on convex lasso/LS the sklearn Cython lasso_path remains the floor at ~8–9× faster on the medium bench. See docs/benchmarks/mcp_ls.md and docs/benchmarks/scad_ls.md for the full nonconvex write-ups (correctness matrices + methodology + per-size tables) and docs/perf/lasso_ls_profile.md for the lasso/LS profiling work that drove M10.

Reproduce with python benches/run.py --scenarios mcp_ls mcp_ls_sparse --sizes small,medium. The publication-quality benchmark suite under benches/v2/ drives the paper figures and tables; see docs/benchmarks/index.md for the layered overview.

Build

# Rust core only (fast iteration on algorithms)
cargo test -p skein-core --lib

# Full Python package (requires maturin in your env)
maturin develop --release
pytest

See docs/installation.md for from-source and development installs, and CLAUDE.md for the contributor quickstart (pre-PR checks, solver-change pre-flight protocol, etc.).

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skein_glm-0.8.0.tar.gz (277.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

skein_glm-0.8.0-cp310-abi3-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.10+Windows x86-64

skein_glm-0.8.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

skein_glm-0.8.0-cp310-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file skein_glm-0.8.0.tar.gz.

File metadata

  • Download URL: skein_glm-0.8.0.tar.gz
  • Upload date:
  • Size: 277.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for skein_glm-0.8.0.tar.gz
Algorithm Hash digest
SHA256 fa3f98eee4a431b326e6e6128bb5d8fc5a4852d12b5c83be09458594325defa9
MD5 3fcf6ac45c52357a7a5f82d22272b6dd
BLAKE2b-256 5ff20268b1bb56e9a1eb9eaec32f178a92c2707bcd42586c90dca654ec17fdb5

See more details on using hashes here.

Provenance

The following attestation bundles were made for skein_glm-0.8.0.tar.gz:

Publisher: wheels.yml on dvillacis/skein

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file skein_glm-0.8.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: skein_glm-0.8.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for skein_glm-0.8.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 455eae6d4e420188540f86614e62315604c6bb4479d5b2d05e741eebd02f77ab
MD5 8e949541802ceb8ceeccbb8ffa60b2c5
BLAKE2b-256 b56dbeae07508ba2c97f9bdded1e9787d6a56aa12902ad691f6c822f71a24862

See more details on using hashes here.

Provenance

The following attestation bundles were made for skein_glm-0.8.0-cp310-abi3-win_amd64.whl:

Publisher: wheels.yml on dvillacis/skein

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file skein_glm-0.8.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for skein_glm-0.8.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ec76d6bf80af275180fa7c73160ee32bef6864e455a15170d602d4dfb29c34ef
MD5 f87063770cf2ad133d94048e5c9f24a6
BLAKE2b-256 50ebaba995346bdb7f91ff35f216e54de2edad4200648c0f55bd509f6fa1011d

See more details on using hashes here.

Provenance

The following attestation bundles were made for skein_glm-0.8.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on dvillacis/skein

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file skein_glm-0.8.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for skein_glm-0.8.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d000ead58b9205762b3922a8a53e71e232ec6a713b8872911466622314c64a20
MD5 1c3e18b94914b69ec3ee86bc269268bc
BLAKE2b-256 832d91ad31190f6de91351bc9cce42d507aac2ae86c8e64693db142d1b40b0c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for skein_glm-0.8.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: wheels.yml on dvillacis/skein

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page