Python machine and deep learning API to impute missing genotypes

These details have not been verified by PyPI

Project links

Project description

PG-SUI

PG-SUI Logo: Stylized blue and purple gradient design with faded appearance representing PG-SUI - Population Genomic Supervised and Unsupervised Imputation

Population Genomic Supervised and Unsupervised Imputation.

About PG-SUI

PG-SUI is a Python 3 API that uses machine learning to impute missing values from population genomic SNP data. There are several supervised and unsupervised machine learning algorithms available to impute missing data, as well as some non-machine learning imputers that are useful.

Below is some general information and a basic tutorial. For more detailed information, see our API Documentation.

Unsupervised Imputation Methods

Unsupervised imputers include three custom neural network models:

Variational Autoencoder (VAE) 1
- VAE models train themselves to reconstruct their input (i.e., the genotypes) 1. To use VAE for imputation, the missing values are masked and the VAE model gets trained to reconstruct only on known values. Once the model is trained, it is then used to predict the missing values.
Autoencoder 2
- A standard autoencoder that trains the input to predict itself 2. As with VAE, missing values are masked and the model gets trained only on known values. Predictions are then made on the missing values.

See the below diagram for an overview of implemented features for each model.

Supervised Imputation Methods

Supervised methods utilze the scikit-learn's IterativeImputer, which is based on the MICE (Multivariate Imputation by Chained Equations) algorithm 3, and iterates over each SNP site (i.e., feature) while uses the N nearest neighbor features to inform the imputation. The number of nearest features can be adjusted by users. IterativeImputer currently works with the following scikit-learn classifiers:

ImputeRandomForest
ImputeHistGradientBoosting

See the scikit-learn documentation for more information on IterativeImputer and each of the classifiers.

Non-Machine Learning (Deterministic) Methods

We also include several deterministic options for imputing missing data, including:

Per-population mode per SNP site
Overall mode per SNP site

Installing PG-SUI

PG-SUI supports both pip and conda distributions. Both are kept current with up-to-date releases.

Installation with Pip

To install PG-SUI with pip, do the following. It is strongly recommended to install pg-sui in a virtual environment.

python3 -m venv .pgsui-venv
source .pgsui-venv/bin/activate
pip install pg-sui

Installation with Anaconda

To install PG-SUI with Anaconda, do the following:

conda create -n pgsui-env python=3.12
conda activate pgsui-env
conda install -c btmartin721 pg-sui

Docker Container

We also maintains a Docker image that comes with PG-SUI preinstalled. This can be useful for automated worklows such as Nextflow or Snakemake.

docker pull pg-sui:latest

Optional MacOS GUI

PG-SUI ships an optional Electron GUI (Graphical User Interface) wrapper around the Python CLI. Currently for the GUI, only MacOS is supported.

Install the Python-side extras (FastAPI/ uvicorn helper) if you want to serve from Python: pip install pg-sui[gui]
Install Node.js and fetch the app dependencies: pgsui-gui-setup
Launch the graphical interface: pgsui-gui

The GUI shells out to the same CLI underneath, so presets, overrides, and YAML configs behave identically.

Input Data

You can read your input files as a GenotypeData object from the SNPio package. SNPio supports the VCF, PHYLIP, STRUCTURE, and GENEPOP input file formats.

# Import snpio. Automatically installed with pg-sui.
from snpio import VCFReader

# Read in VCF alignment.
# SNPio also supports PHYLIP, STRUCTURE, and GENEPOP input file formats.
data = VCFReader(
    filename="pgsui/example_data/phylogen_subset14K.vcf.gz",
    popmapfile="pgsui/example_data/popmaps/phylogen_nomx.popmap", # optional
    force_popmap=True, # optional
)

Supported Imputation Methods

There are several supported algorithms PG-SUI uses to impute missing data. Each one can be run by calling the corresponding class. You must provide a GenotypeData instance as the first positional argument.

You can import all the supported methods with the following:

from pgsui import ImputeVAE, ImputeAutoencoder, ImputeRefAllele, ImputeMostFrequent, ImputeRandomForest, ImputeHistGradientBoosting

Unsupervised Imputers

The four unsupervised imputers can be run by initializing them with the SNPio GenotypeData object and then calling fit() and transform().

# Initialize the models, then fit and impute
vae = ImputeVAE(data) # Variational autoencoder
vae.fit()
vae_imputed = vae.transform()

ae = ImputeAutoencoder(data) # standard autoencoder
ae.fit()
ae_imputed = ae.transform()

The *_imputed objects are NumPy arrays of IUPAC single-character codes that are compatible with SNPio's GenotypeData objects.

Supervised Imputers

Various supervised imputation options are supported, and these use the same API design.

# Supervised IterativeImputer classifiers

# Random Forest
rf = ImputeRandomForest(data)
rf.fit()
imputed_rf = rf.transform()

# HistGradientBoosting
hgb = ImputeHistGradientBoosting(data)
hgb.fit()
imputed_hgb = hgb.transform()

Non-machine learning methods

The following deterministic methods are supported. ImputeMostFrequent supports the mode-per-population or overall (global) mode options to inform imputation.

# Per-population, per-locus mode
pop_mode = ImputeMostFrequent(data, by_populations=True)
pop_mode.fit()
imputed_pop_mode = pop_mode.transform()

# Per-locus mode
mode = ImputeMostFrequent(data, by_populations=False)
mode.fit()
imputed_mode = mode.transform()

Or, always replace missing values with the reference allele.

ref = ImputeRefAllele(data)
ref.fit()
imputed_ref = ref.transform()

Command-Line Interface

Run the PG-SUI CLI with pg-sui (installed alongside the library). The CLI follows the same precedence model as the Python API:

code defaults < preset (--preset) < YAML (--config) < explicit CLI flags < --set key=value.

Recent releases add explicit switches for the simulated-missingness workflow shared by the neural and supervised models:

--sim-strategy selects one of random, random_weighted, random_weighted_inv, nonrandom, nonrandom_weighted.
--sim-prop sets the proportion of observed calls to temporarily mask when building the evaluation set.

Example:

pg-sui \
  --input data.vcf.gz \
  --popmap pops.popmap \
  --models ImputeVAE ImputeAutoencoder \
  --preset balanced \
  --sim-strategy random_weighted_inv \
  --sim-prop 0.3 \
  --prefix ae_and_vae \
  --n-jobs 4 \
  --tune-n-trials 100 \
  --set tune.enabled=True

CLI overrides cascade into every selected model, so a single invocation can evaluate multiple imputers with a consistent simulation strategy and output prefix.

STRUCTURE inputs accept a few extra flags for parsing metadata:

pg-sui \
  --input data.str \
  --format structure \
  --structure-has-popids \
  --structure-allele-start-col 2 \
  --structure-allele-encoding '{"1":"A","2":"C","3":"G","4":"T","-9":"N"}'

References

Kingma, D.P. & Welling, M. (2013). Auto-encoding variational bayes. In: Proceedings of the International Conference on Learning Representations (ICLR). arXiv:1312.6114 [stat.ML].
Hinton, G.E., & Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504-507.
Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 45: 1-67.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.7.8

Feb 15, 2026

1.7.7

Feb 3, 2026

1.7.4

Jan 20, 2026

1.7.3

Jan 20, 2026

1.7.2

Jan 19, 2026

1.7.1

Jan 17, 2026

1.7.0

Jan 9, 2026

1.6.28

Dec 23, 2025

1.6.26

Dec 22, 2025

1.6.23

Dec 17, 2025

1.6.22

Dec 14, 2025

1.6.21

Dec 13, 2025

1.6.20

Dec 10, 2025

1.6.18

Dec 9, 2025

1.6.17

Dec 9, 2025

1.6.16

Dec 9, 2025

1.6.16a4 pre-release

Dec 9, 2025

1.6.16a3 pre-release

Dec 9, 2025

1.6.16a2 pre-release

Dec 9, 2025

1.6.14a5 pre-release

Dec 6, 2025

1.6.14a4 pre-release

Dec 6, 2025

1.6.14a0 pre-release

Dec 6, 2025

1.6.14.dev12 pre-release

Dec 6, 2025

1.6.14.dev11 pre-release

Dec 6, 2025

1.6.14.dev10 pre-release

Dec 5, 2025

1.6.14.dev9 pre-release

Dec 5, 2025

1.6.14.dev8 pre-release

Dec 5, 2025

1.6.14.dev7 pre-release

Dec 5, 2025

1.6.14.dev6 pre-release

Dec 5, 2025

1.6.14.dev5 pre-release

Dec 5, 2025

1.6.14.dev4 pre-release

Dec 5, 2025

1.6.14.dev2 pre-release

Dec 5, 2025

1.6.14.dev0 pre-release

Dec 5, 2025

1.6.13

Dec 5, 2025

1.6.12

Nov 29, 2025

1.6.11

Nov 21, 2025

1.6.10

Nov 21, 2025

1.6.9

Nov 19, 2025

1.6.8

Oct 27, 2025

1.6.3

Oct 26, 2025

1.0.2.1

Sep 12, 2023

1.0.2

Aug 28, 2023

1.0.1

Aug 15, 2023

1.0

Jul 30, 2023

0.3.0.1

Jul 26, 2023

0.3

Jul 25, 2023

0.2.5

Aug 15, 2023

0.2.4

Jul 23, 2023

0.2.3.1

Jul 23, 2023

0.2.3

Jul 23, 2023

0.2.2

Jul 23, 2023

0.2.1

Jul 23, 2023

0.2

Jul 23, 2023

0.0.0

Oct 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pg_sui-1.7.8.tar.gz (38.4 MB view details)

Uploaded Feb 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pg_sui-1.7.8-py3-none-any.whl (8.9 MB view details)

Uploaded Feb 15, 2026 Python 3

File details

Details for the file pg_sui-1.7.8.tar.gz.

File metadata

Download URL: pg_sui-1.7.8.tar.gz
Upload date: Feb 15, 2026
Size: 38.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pg_sui-1.7.8.tar.gz
Algorithm	Hash digest
SHA256	`8d587088a2cb7e4c97352ae74099e6bc7dd6e9ce25410138de55c755067b0e7f`
MD5	`48cf62197e980fd13f11ee8ef643e1e3`
BLAKE2b-256	`9a9a83fdd7f02c73fd16e67ceb5fef1579c0dc4c2ae08727fcdc314e9b83b943`

See more details on using hashes here.

File details

Details for the file pg_sui-1.7.8-py3-none-any.whl.

File metadata

Download URL: pg_sui-1.7.8-py3-none-any.whl
Upload date: Feb 15, 2026
Size: 8.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pg_sui-1.7.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`02b155f30fccfbeecacf290977f1aaf2e2ee3248f0f994c6ac9ccb351f743e96`
MD5	`cd2a695fd1c5352074f1fe6014a01fc6`
BLAKE2b-256	`a693b4d533fe8bcdad6c1c9bda9fe596af53a9ac07257a45b8d284ca6eb87c8a`

See more details on using hashes here.

pg-sui 1.7.8

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

PG-SUI

About PG-SUI

Unsupervised Imputation Methods

Supervised Imputation Methods

Non-Machine Learning (Deterministic) Methods

Installing PG-SUI

Installation with Pip

Installation with Anaconda

Docker Container

Optional MacOS GUI

Input Data

Supported Imputation Methods

Unsupervised Imputers

Supervised Imputers

Non-machine learning methods

Command-Line Interface

References

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes