COMBSS: continuous optimisation for best subset selection in generalised linear models (linear, logistic, multinomial)

These details have not been verified by PyPI

Project links

Project description

COMBSS Logo

COMBSS: Best Subset Selection for Generalised Linear Models

Python implementation of COMBSS (Continuous Optimisation for Best Subset Selection) for generalised linear models. COMBSS reformulates the NP-hard discrete subset selection problem as a continuous optimisation over the hypercube [0,1]^p, making it scalable to high-dimensional settings with p >> n.

Supported model types:

Linear regression (continuous response)
Binary logistic regression (two-class classification)
Multinomial logistic regression (multi-class, C > 2)

References

Moka, Liquet, Zhu & Muller (2024). COMBSS: best subset selection via continuous optimization. Statistics and Computing.

Mathur, Liquet, Muller & Moka (2026). Parsimonious Subset Selection for Generalized Linear Models with Biomedical Applications. arXiv preprint.

Installation

pip install combss

Quick Start

Linear regression

import combss
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1000, n_features=50, n_informative=5,
                       noise=0.1, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.4)

# Frank-Wolfe method (default) — with validation data
model = combss.linear.model()
model.fit(X_train, y_train, X_val=X_val, y_val=y_val, q=10)

print("Best subset:", model.subset)
print("Best MSE:", model.mse)

# All subsets for k = 1, ..., q
for k, feat in enumerate(model.subset_list, 1):
    print(f"k={k:2d}  features={feat.tolist()}")

Without validation data, subset_list is still available but subset, mse, and coef_ will be None:

model = combss.linear.model()
model.fit(X_train, y_train, q=10)

for k, feat in enumerate(model.subset_list, 1):
    print(f"k={k:2d}  features={feat.tolist()}")

To use the original Adam + dynamic-lambda method from v1.x:

model = combss.linear.model()
model.fit(X_train, y_train, X_val=X_val, y_val=y_val, q=10, method='original')

print("Best subset:", model.subset)
print("Validation MSE:", model.mse)
print("Optimal lambda:", model.lambda_)

Binary logistic regression

import combss

# X_train, X_val : (n, p) feature matrices
# y_train, y_val : binary labels {0, 1}

model = combss.logistic.model()
model.fit(X_train, y_train, X_val=X_val, y_val=y_val, q=15)

print("Best subset:", model.subset)
print("Best accuracy:", model.accuracy)

for k, feat in enumerate(model.subset_list, 1):
    print(f"k={k:2d}  features={feat.tolist()}")

Multinomial logistic regression

import combss
import numpy as np
import pandas as pd

# Example with the Khan SRBCT dataset (4 tumour classes, 2308 genes)
train = pd.read_csv('data/Khan_train.csv')
test  = pd.read_csv('data/Khan_test.csv')

y_train = train.iloc[:, 0].values.astype(int)
X_train = train.iloc[:, 1:].values.astype(float)
y_test  = test.iloc[:, 0].values.astype(int)
X_test  = test.iloc[:, 1:].values.astype(float)

C = len(np.unique(y_train))

model = combss.multinomial.model()
model.fit(X_train, y_train, X_val=X_test, y_val=y_test, q=20, C=C)

print("Best subset:", model.subset)
print("Best accuracy:", model.accuracy)

for k, feat in enumerate(model.subset_list, 1):
    print(f"k={k:2d}  features={feat.tolist()}")

Lambda selection via LOOCV

Select the ridge penalty using leave-one-out cross-validation:

import combss

# Classification
best_lam, best_lam_per_k, cv_df = combss.cv.select_lambda(
    X_train, y_train,
    q=15, C=4,
    model_type='multinomial',
)

# Linear regression
best_lam, best_lam_per_k, cv_df = combss.cv.select_lambda(
    X_train, y_train,
    q=15,
    model_type='linear',
)

API Reference

`combss.linear.model`

Best subset selection for linear regression.

model = combss.linear.model()
model.fit(X_train, y_train, q=10)                          # FW (default), no validation
model.fit(X_train, y_train, X_val, y_val, q=10)            # FW with validation
model.fit(X_train, y_train, X_val, y_val, q=10,            # Original method
          method='original')

Frank-Wolfe method parameters (method='fw', default)

Sparsity is controlled by k (model size): COMBSS returns selected features for each k = 1, ..., q. The lam_ridge parameter is an optional ridge regularisation on the coefficients in the inner solver, unrelated to the sparsity penalty lambda used in the original method.

Parameter	Default	Description
`q`	min(n,p)	Maximum subset size
`Niter`	25	Number of homotopy iterations
`lam_ridge`	0	Ridge penalty on coefficients in the inner solver
`alpha`	0.01	Frank-Wolfe step size
`scale`	True	Column-normalise X
`mandatory_features`	None	1-indexed features always included
`inner_tol`	1e-4	Inner solver tolerance
`patience`	20	Early stopping: stop after this many k with no improvement. Requires validation data. Set to None to disable
`min_k`	20	Minimum k values before early stopping can trigger. Automatically disabled if min_k + patience > q
`verbose`	True	Print progress

Original method parameters (method='original')

Sparsity is controlled by lambda: a grid of lambda values is searched, and each lambda yields a different subset. The best subset is selected by validation MSE.

Parameter	Default	Description
`q`	min(n,p)	Maximum subset size
`nlam`	50	Number of lambda values in dynamic grid
`scaling`	True	Enable feature scaling
`tau`	0.5	Threshold parameter
`delta_frac`	1	n/delta in objective function
`eta`	0.001	Truncation parameter
`gd_patience`	10	Patience for Adam termination
`gd_maxiter`	1000	Maximum Adam iterations
`gd_tol`	1e-5	Adam convergence tolerance
`cg_maxiter`	n	Conjugate gradient max iterations
`cg_tol`	1e-5	Conjugate gradient tolerance

Output attributes — Frank-Wolfe method

Attribute	Description
`subset`	Best subset (0-indexed). Requires validation data, otherwise None
`coef_`	Regression coefficients (length p, zeros for unselected). Requires validation data, otherwise None
`mse`	Validation MSE for best subset. Requires validation data, otherwise None
`lam_ridge`	Ridge penalty value used
`subset_list`	List of subsets for k = 1, ..., q (0-indexed). May be shorter if early stopping triggered

Output attributes — Original method

Attribute	Description
`subset`	Best subset (0-indexed)
`coef_`	Regression coefficients (length p, zeros for unselected)
`mse`	Validation MSE for best subset
`lambda_`	Optimal lambda value
`subset_list`	List of subsets across the lambda grid (0-indexed)
`lambda_list`	Lambda grid values

`combss.logistic.model`

Best subset selection for binary logistic regression.

model = combss.logistic.model()
model.fit(X_train, y_train, q=15)                          # Without validation
model.fit(X_train, y_train, X_val, y_val, q=15)            # With validation

Labels y must be binary {0, 1}. Parameters are the same as the Frank-Wolfe method in combss.linear.model.

Output attributes

Attribute	Description
`subset`	Best subset (0-indexed). Requires validation data, otherwise None
`coef_`	Logistic coefficients (length p, zeros for unselected). Requires validation data, otherwise None
`accuracy`	Validation accuracy for best subset. Requires validation data, otherwise None
`lam_ridge`	Ridge penalty value used
`subset_list`	List of subsets for k = 1, ..., q (0-indexed)

`combss.multinomial.model`

Best subset selection for multinomial logistic regression.

model = combss.multinomial.model()
model.fit(X_train, y_train, q=20, C=4)                     # Without validation
model.fit(X_train, y_train, X_val, y_val, q=20, C=4)       # With validation

Labels y must be in {1, ..., C}. Parameters are the same as the Frank-Wolfe method in combss.linear.model, plus:

Additional Parameter	Default	Description
`C`	inferred	Number of classes

Output attributes

Attribute	Description
`subset`	Best subset (0-indexed). Requires validation data, otherwise None
`coef_`	Multinomial coefficients (shape (C, p), zeros for unselected). Requires validation data, otherwise None
`accuracy`	Validation accuracy for best subset. Requires validation data, otherwise None
`lam_ridge`	Ridge penalty value used
`subset_list`	List of subsets for k = 1, ..., q (0-indexed)

`combss.cv.select_lambda`

LOOCV-based ridge penalty selection.

best_lam, best_lam_per_k, results_df = combss.cv.select_lambda(
    X, y, q, C=None, lambda_grid=None, model_type='multinomial', ...
)

Parameter	Default	Description
`X`	--	Feature matrix (n, p), no intercept
`y`	--	Response / labels
`q`	--	Maximum subset size
`C`	None	Number of classes (required for classification)
`lambda_grid`	[0] + logspace(-3,1,10)	Candidate ridge penalty values
`Niter`	50	Homotopy iterations
`model_type`	'multinomial'	'logit', 'multinomial', or 'linear'
`lambda_refit`	0	Ridge penalty for LOOCV refit

Returns: (best_lambda, best_lambda_per_k, results_df)

Classification: results_df has columns lambda, k, loocv_acc, selected
Linear: results_df has columns lambda, k, loocv_mse, selected

`combss.metrics`

Performance metrics for variable selection evaluation.

result = combss.metrics.performance_metrics(X, beta_true, beta_pred)
# result keys: pe, mcc, acc, sens, spec, f1, prec

Algorithm Overview

The Frank-Wolfe homotopy algorithm operates as follows:

Initialise t = (k/p, ..., k/p) -- centroid of the k-sparse simplex T_k
For i = 1, ..., N (homotopy loop):
- Set delta_i on a geometric schedule from delta_min to delta_max
- Solve the ridge-penalised GLM inner problem at current t
- Compute the Danskin gradient (no Hessian required)
- Find the Frank-Wolfe vertex (k smallest gradient components)
- Update: t <- (1 - alpha) t + alpha s
Select the k features with the largest final t values

The penalty schedule is auto-calibrated from the spectral norm of X, so no tuning is needed in practice.

Intercept Handling

The intercept is handled internally and is not subject to selection. For the Frank-Wolfe method, an intercept column is prepended automatically. For the original method, the intercept is treated the same as other features.

Dependencies

Python 3.8+
NumPy (>= 1.21.0)
SciPy (>= 1.7.0)
scikit-learn (>= 1.0.0)
pandas (>= 1.3.0)

Developers

Sarat Moka
Anant Mathur
Hua Yang Hu

Contributing

Contributions are welcome! Please open an issue or submit a pull request at github.com/saratmoka/combss.

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Apr 24, 2026

1.1.4

Jul 7, 2025

1.1.3

Jul 5, 2025

1.1.2

Jul 5, 2025

1.1.1

Jul 5, 2025

1.1.0

Jul 5, 2025

1.0.3

May 22, 2025

1.0.2

Nov 18, 2024

1.0.1

Nov 18, 2024

1.0.0

Nov 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

combss-2.0.0.tar.gz (42.0 kB view details)

Uploaded Apr 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

combss-2.0.0-py2.py3-none-any.whl (35.4 kB view details)

Uploaded Apr 24, 2026 Python 2Python 3

File details

Details for the file combss-2.0.0.tar.gz.

File metadata

Download URL: combss-2.0.0.tar.gz
Upload date: Apr 24, 2026
Size: 42.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.32.3

File hashes

Hashes for combss-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9f492fb1c3f1b81df1792a3de8ce72bb8e2081474c0ef2025a99fb4c87613432`
MD5	`d9e0ea2ca7f84984c221576613810874`
BLAKE2b-256	`6dd501e9388f73f553004fe8a7cb77d21c76138970890bd7b9c4bfeb8fc842cc`

See more details on using hashes here.

File details

Details for the file combss-2.0.0-py2.py3-none-any.whl.

File metadata

Download URL: combss-2.0.0-py2.py3-none-any.whl
Upload date: Apr 24, 2026
Size: 35.4 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.32.3

File hashes

Hashes for combss-2.0.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`103c22b632cb5294fd969d86dc2bcd79385e978a364b81fe546131c4728b0aa3`
MD5	`89c3c75f40db2d2eb63b8506277b05b5`
BLAKE2b-256	`42757e6763fa8323e95fe6237ee92c97af074a8b7c1a936d01477e523003f7b0`

See more details on using hashes here.

combss 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

COMBSS: Best Subset Selection for Generalised Linear Models

References

Installation

Quick Start

Linear regression

Binary logistic regression

Multinomial logistic regression

Lambda selection via LOOCV

API Reference

combss.linear.model

Frank-Wolfe method parameters (method='fw', default)

Original method parameters (method='original')

Output attributes — Frank-Wolfe method

Output attributes — Original method

combss.logistic.model

Output attributes

combss.multinomial.model

Output attributes

combss.cv.select_lambda

combss.metrics

Algorithm Overview

Intercept Handling

Dependencies

Developers

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`combss.linear.model`

`combss.logistic.model`

`combss.multinomial.model`

`combss.cv.select_lambda`

`combss.metrics`