GEMSS: Gaussian Ensemble for Multiple Sparse Solutions.

These details have not been verified by PyPI

Project links

Repository

Project description

GEMSS: Gaussian Ensemble for Multiple Sparse Solutions

This repository implements Bayesian sparse feature selection using variational inference with Gaussian mixture models (paper). The main objective is to recover all sparse feature subsets (supports) that explain the response in high-dimensional regression or classification tasks.

To make this tool accessible by non-coders (typically domain experts), we provide a user-friendly application for the entire exploratory GEMSS workflow.

Motivation

In many real-world problems, e.g. in life sciences, datasets with far more features than samples are common because collecting new data points is costly or impractical. In these situations, there are often several distinct, sparse combinations of features that can explain the observed outcomes, each corresponding to a different underlying mechanism or hypothesis. Moreover, in many cases, the quality of a combination of predictors can be assessed only ex-post by utilizing advanced domain knowledge.

Traditional feature selection methods typically identify only a single solution to a classification or regression problem, overlooking the ambiguity and the potential for multiple valid interpretations. This project addresses that gap by providing a Bayesian framework that systematically recovers all plausible sparse solutions, enabling a more complete understanding of the data and supporting the exploration and comparison of alternative explanatory hypotheses.

When to use GEMSS

Instead of finding just one "best" set of features, GEMSS discovers several most likely feature combinations that predict your target variable comparably well. This is valuable when:

You have precious few samples and many more features.
Multiple underlying mechanisms might explain your data.
You are striving for an interpretable model.
You want to engineer a multitude of nonlinear and combined features from your original set for exploratory purposes.
Your features are correlated.
When there is domain knowledge to be mined (a human in the loop).

When NOT to use GEMSS

When the desired number of features you are looking for exceeds approximately 10-20.
Inside automated modeling pipelines.

Features

GEMSS provides a comprehensive framework for Bayesian feature selection with the following capabilities:

Multiple sparse solutions: Recovers diverse sparse feature sets rather than a single solution
Missing data: Native handling without imputation
Flexible priors: Structured spike-and-slab (default), Student-t, vanilla spike-and-slab
Variational inference: PyTorch-based optimization
Diversity regularization: Optional Jaccard penalty for enforcing solution diversity
Solution evaluation by predictive models: Predictive modeling metrics for various algorithms from Scikit-learn, XGBoost and (optionally) TabPFN. Supports custom stratification.
Visualization: Interactive plots and comprehensive diagnostics
Modular configuration: JSON-based dataset/algorithm/postprocessing settings
Batch experiments: Parameter sweeps and tiered validation suites

Citation

If you use GEMSS in your research, please cite the preprint paper:

@misc{henclova2026gemssvariationalbayesianmethod,
      title={GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems}, 
      author={Kateřina Henclová and Václav Šmídl},
      year={2026},
      eprint={2602.08913},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.08913}, 
}

Repository structure

The repository is organized into core packages, interactive notebooks, batch experiment scripts, and configuration files:

gemss/
  technical_report.pdf       # Preprint paper on GEMSS
  app/                       # Interactive marimo app
    gemss_explorer.py          # GEMSS explorer app
    results/                   # App outputs
  data/                      # User datasets and preprocessed real-world examples
    preprocessed_datasets/     # Ready-to-use benchmark datasets
  notebooks/                 # Interactive demos and analysis
    demo.ipynb               # End-to-end synthetic demo
    explore_custom_dataset.ipynb      # Custom data workflow
    tabpfn_evaluation_example.ipynb   # TabPFN evaluation demo
    tabpfn_evaluate_custom_dataset_results.ipynb # Evaluate saved solutions with TabPFN
    analyze_experiment_results/       # Experiment analysis (development)
    results/                          # Artifacts from the notebook runs
  scripts/                   # Batch experiments
    run_experiment.py        # Single experiment
    run_sweep.ps1            # Parameter sweeps
    run_tiers.ps1            # Tiered benchmark suite
    experiment_parameters.json    # 128-experiment design
    results/                 # Outputs and logs from the scripted experiments
  gemss/                     # Core package
    config/                  # JSON configuration files
    data_handling/           # Data generation and preprocessing
    feature_selection/       # Variational inference core
    postprocessing/          # Solution extraction and evaluation
    diagnostics/             # Performance diagnostics (WIP)
    experiment_assessment/   # Result analysis utilities
    utils/                   # Persistence and visualization

Package installation

This project uses uv for dependency management. uv replaces tools like pip.

1. Install uv

If you do not have uv installed, run one of the following commands:

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows:

powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

2. Set up the environment

Navigate to the repository root and sync the environment. This command will create a virtual environment and install all dependencies (including the gemss package itself) defined in pyproject.toml.

uv sync

⚠️ Troubleshooting: important for Windows users: uv is incompatible with the Windows Store Python distribution due to file system restrictions.

If uv sync fails with an error like Failed to build ... Failed to create temporary virtualenv ... The file cannot be accessed by the system. (os error 1920), this indicates that uv is trying to use the wrong version of Python.

To resolve this issue:

Check which Python installations you have:

Get-Command python -All | Select-Object Source

Install Python 3.13 using uv (if not already installed):
```
uv python install 3.13
```
Find the uv-managed Python path:
```
uv python list
```
Look for a path like C:\Users\<YourUser>\AppData\Roaming\uv\python\cpython-3.13.11-windows-x86_64-none\python.exe

Run uv sync with the correct Python:

uv sync --python C:\Users\<YourUser>\AppData\Roaming\uv\python\cpython-<YourVersion>-windows-x86_64-none\python.exe

Alternatively, uninstall the Windows Store Python and install Python 3.13 from python.org, then run uv sync without the --python flag.

3. Register the Jupyter kernel (optional)

Note: This step is only required if you plan to use notebooks. The marimo app doesn't need kernel registration.

To run Jupyter notebooks with the correct Python environment, register the kernel:

uv run python -m ipykernel install --user --name=gemss --display-name="Python (gemss)"

This makes the environment available. When opening a notebook, select "Python (gemss)" from the kernel picker.

To verify the kernel is registered, run:

uv run jupyter kernelspec list

Quick start

GEMSS can be applied to both custom datasets and synthetic data for validation and benchmarking.

GEMSS Explorer: an interactive application (recommended)

The easiest way to use GEMSS is through the interactive marimo app:

uv run marimo run app/gemss_explorer.py

The app provides a complete guided workflow from data upload through solution recovery to downstream modeling. Several curated datasets are located in the data directory for quick trial.

For detailed documentation, data requirements, and workflow overview, see app/README.md.

Jupyter notebooks

For more control and customization, use the Jupyter notebooks:

notebooks/demo.ipynb — a walkthrough with synthetic data
notebooks/explore_custom_dataset.ipynb — custom data workflow
notebooks/README.md — detailed documentation

Launch notebooks with:

uv run jupyter notebook notebooks/demo.ipynb

Ready-to-use datasets

The data/preprocessed_datasets/ folder contains real-world benchmark datasets ready for immediate use:

MTBLS1 (diabetes): 132 samples, 222 metabolite features
MTBLS2 (arabidopsis): 16 samples, 41 metabolite features (extremely low sample size)
MTBLS12968 (PCOS/preterm): 149 samples, 488 features (multi-task)
Colonoscopy (lesion classification): 152 samples, 698 image-derived features

These datasets work with both the GEMSS Explorer app and Jupyter notebooks. For details, see data/README.md.

Validation experiments

A comprehensive experimental framework validates GEMSS across diverse data scenarios, from clean baseline conditions to challenging high-dimensional and noisy settings. One can review and replicate these experiments.

There are 128 experiments organized in 7 tiers:

Tier 1: Baseline (18): clean data, n < p
Tier 2: High-dimensional (9): p ≥ 1000, n << p
Tier 3: Sample-rich (14): n ≥ p
Tier 4: Adversities (22): high noise and missing data
Tier 5: Jaccard penalty (28): diversity effects
Tier 6: Regression (29): continuous response
Tier 7: Class imbalance (8): unbalanced labels

Specific research questions are then answered by 47 cross-tier test cases.

Running experiments

In order to use correct Python dependencies, it is recommended that scripts are run using uv run python instead of the python command.

# Single experiment
uv run python scripts/run_experiment.py

Batch experiments (PowerShell): for the PowerShell scripts, it is often easier to activate the environment first:

# Activate environment (Windows)
.venv\Scripts\activate.ps1

Then run:

# Parameter sweeps (custom parameter setting)
.\scripts\run_sweep.ps1

# The benchmark (128 experiments)
.\scripts\run_tiers.ps1                          # Full suite
.\scripts\run_tiers.ps1 -tiers @("1","4")        # Selected tiers

Result analysis

notebooks/analyze_experiment_results/tier_level_analysis.ipynb — tier-level performance
notebooks/analyze_experiment_results/analysis_per_testcase.ipynb — cross-tier research questions

For more details, see the dedicated documentation: scripts/README.md

License

The GEMSS algorithm is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

1.1.0

Feb 23, 2026

1.0.2

Feb 10, 2026

1.0.1

Feb 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gemss-1.1.0.tar.gz (9.0 MB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gemss-1.1.0-py3-none-any.whl (100.8 kB view details)

Uploaded Feb 23, 2026 Python 3

File details

Details for the file gemss-1.1.0.tar.gz.

File metadata

Download URL: gemss-1.1.0.tar.gz
Upload date: Feb 23, 2026
Size: 9.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for gemss-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`53746f4c962e25f03148acf9a3f1426151ae50b07da4474793cc35f147169840`
MD5	`c0ccd72d9d7a90a8ee383d95f8874a93`
BLAKE2b-256	`43dafd8695a924083c5d798405a415c9481672a8e56f503146ac134cf43a550f`

See more details on using hashes here.

File details

Details for the file gemss-1.1.0-py3-none-any.whl.

File metadata

Download URL: gemss-1.1.0-py3-none-any.whl
Upload date: Feb 23, 2026
Size: 100.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for gemss-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1f82330d7a4847f4e6180ed6b39c239ab449f6739df20054152dd0f929986405`
MD5	`5588385de8ef9c98ad6bf1f4860504d6`
BLAKE2b-256	`601341b7d44ad1b786ec480d7d3c5d7f1e0d28d3a1c3b092b41cf8d2765aa0db`

See more details on using hashes here.

gemss 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GEMSS: Gaussian Ensemble for Multiple Sparse Solutions

Motivation

When to use GEMSS

When NOT to use GEMSS

Features

Citation

Repository structure

Package installation

1. Install uv

2. Set up the environment

3. Register the Jupyter kernel (optional)

Quick start

GEMSS Explorer: an interactive application (recommended)

Jupyter notebooks

Ready-to-use datasets

Validation experiments

Running experiments

Result analysis

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes