Skip to main content

Process genomes with ease

Project description

snputils logo

snputils: A Python Library for Processing Genetic Variation and Population Structure

License BSD-3 PyPI Python Version Test, Docs & Publish

snputils is a Python package designed to ease the processing and analysis of genomic datasets, while handling all the complexities of different genome formats and operations very efficiently. The library provides robust tools for handling sequencing and ancestry data, with a focus on performance, ease of use, and advanced visualization capabilities.

Developed in collaboration between Stanford University's Department of Biomedical Data Science, UC Santa Cruz Genomics Institute, and more collaborators worldwide.

Note: snputils is under active development. While the core API is stabilizing, we are continuously adding features, optimizing performance, and expanding format support.

Installation

Basic installation using pip:

pip install snputils

Optionally, for PyTorch-backed features, install with the [torch] extra:

pip install 'snputils[torch]'

Key Features

Ease of Use

snputils is designed to be user-friendly and intuitive, with a simple API that allows you to quickly load, process, and visualize genomic data. For example, reading a whole genome VCF file is as simple as:

import snputils as su
snpobj = su.read_snp("path/to/file.vcf.gz")

Similarly, reading BED or PGEN filesets is straightforward:

snpobj = su.read_snp("path/to/file.pgen")

Working with ancestry files, performing processing operations, and creating visualizations is just as straightforward. See the tutorial notebooks for examples.

File Format Support

snputils aims to provide the fastest available readers and writers for various genomic data formats:

  • VCF: Support for .vcf and .vcf.gz files
  • BGEN: Support for .bgen files
  • PLINK1: Support for .bed, .bim, .fam filesets
  • PLINK2: Support for .pgen, .pvar, .psam filesets
  • Local Ancestry: Handle .msp and FLARE .anc.vcf.gz local ancestry formats
  • Admixture: Read and write .Q and .P files

Processing & Analysis Tools

  • Basic Data Manipulation

    • Filter variants and samples, correct SNP flips, and filter out ambiguous SNPs
    • Compute cohort allele frequency and ancestry-specific allele frequencies via SNPObject.allele_freq(...) or in streaming for memory efficiency with snputils.stats.allele_freq_stream(...)
    • Standardized querying across genotype, local ancestry, global ancestry, and IBD data
  • Dimensionality Reduction

    • Standard PCA with optional GPU acceleration
    • Missing-data PCA (mdPCA)
    • Multi-array ancestry-specific MDS (maasMDS)
  • Population Genetic Statistics

    • Compute $D$, $f_2$, $f_3$, $f_4$, the $f_4$-ratio, and $F_{ST}$ (Hudson, Weir-Cockerham, and Tsallis $F_{q}$)
    • Includes block jackknife standard errors and optional ancestry masking
  • Identity-by-Descent (IBD) & Relatedness

    • Read hap-IBD and ancIBD outputs into a unified format
    • Fast filtering and ancestry-restricted segment trimming using local ancestry
  • Association Analysis

    • Admixture Mapping: Locus-by-locus regression of local ancestry dosage on traits
    • GWAS: Variant-level association testing on SNP dosages for binary and quantitative traits
  • Admixture Simulation

    • Simulation: Lightweight haplotype-based simulation of admixed mosaics from real founder haplotypes

Visualization

  • Interactive global ancestry bar plots
  • Detailed scatter plots of PCA, mdPCA, and maasMDS
  • Admixture mapping Manhattan plots
  • Local ancestry visualization
    • Chromosome painting (with Tagore)
    • Dataset-level

Performance

  • Fast file I/O through built-in methods or optimized wrappers (e.g., Pgenlib for PLINK files)
  • Memory-efficient operations using NumPy and Polars
  • Optional GPU acceleration via PyTorch for computationally intensive tasks
  • Support for large-scale genomic datasets through efficient memory management

Our benchmark demonstrates superior performance compared to existing tools:

Reading time and peak-memory comparison for chromosome 22 data across different tools. See the benchmark directory for detailed methodology and results.

The snputils package is continuously updated with new features and improvements.

Command-Line Interface

Installing the package provides a snputils command for common file-backed workflows:

snputils --help
snputils --version

Available subcommands include:

  • pca: run standard PCA and save coordinates/components and a scatter plot.
  • mdpca: run missing-data PCA and save an embedding table.
  • maasmds: run ancestry-specific MDS and save an embedding table.
  • admixture-map: run admixture mapping from phenotype and local ancestry files.
  • gwas: run variant-level association testing from phenotype and genotype files.
  • simulate: simulate admixed haplotype batches from phased founder haplotypes.
  • plot-manhattan and plot-qq: render association result visualizations.

The Python API remains the full surface for low-level readers/writers, object manipulation, IBD filtering and trimming, f-statistics, allele-frequency helpers, custom visualizations, and notebook-oriented workflows. Use the CLI when a workflow naturally starts from files and produces files; use Python when you need programmatic composition or in-memory objects.

Documentation & Support

  • Documentation: User guide, tutorials, and API reference at docs.snputils.org.
  • Examples & Tutorials: Browse the tutorials in the documentation or the source notebooks in docs/tutorials.
  • Issues & Community: Report bugs, ask questions, or request features via GitHub Issues.

Citation

If you use snputils in your research, please cite our paper:

@article{snputils2026,
    author    = {Bonet, David and Comajoan Cara, Marçal and Barrabés, Míriam and Smeriglio, Riccardo and Agrawal, Devang and Aounallah, Khaled and Geleta, Margarita and Dominguez Mantes, Albert and Thomassin, Christophe and Shanks, Cole and Huang, Edward C. and Franquesa Monés, Marc and Luis, Aina and Saurina, Joan and Perera, Maria and López, Cayetana and Sabat, Benet Oriol and Abante, Jordi and Moreno-Grau, Sonia and Mas Montserrat, Daniel and Ioannidis, Alexander G.},
    title     = {{snputils}: A High-Performance {Python} Library for Genetic Variation and Population Structure},
    year      = {2026},
    doi       = {10.64898/2026.02.28.708618},
    url       = {https://www.biorxiv.org/content/10.64898/2026.02.28.708618},
    journal   = {bioRxiv},
    publisher = {Cold Spring Harbor Laboratory},
}

Acknowledgments

We would like to thank the open-source packages that make snputils possible.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snputils-1.0.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snputils-1.0.0-py3-none-any.whl (391.8 kB view details)

Uploaded Python 3

File details

Details for the file snputils-1.0.0.tar.gz.

File metadata

  • Download URL: snputils-1.0.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for snputils-1.0.0.tar.gz
Algorithm Hash digest
SHA256 86f63d2289643d0e8036bc474d850c2b0ef18eadb70d38d42efbe2ba935b80ad
MD5 87eeeef495cd9a992339387ab6c11379
BLAKE2b-256 aed2cde1ee54caee52d23c71ce6e2a8b948d8f4b35a2a892dac473d98903d8b0

See more details on using hashes here.

File details

Details for the file snputils-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: snputils-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 391.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for snputils-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c5fcf89277c482bae510a34a6508d9f461aca33dc55be37f2ec6e0b4f2af80a8
MD5 3625b3bfdd02c4f2fd5de0d9eaeda3dd
BLAKE2b-256 745aa91f307e5f0d973db05da2916bb881720571aab0791e2a5f84a231d4b498

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page