Skip to main content

Process genomes with ease

Project description

snputils logo

snputils: A Python Library for Processing Genetic Variation and Population Structure

License BSD-3 PyPI Python Version Test, Docs & Publish

snputils is a Python package designed to ease the processing and analysis of genomic datasets, while handling all the complexities of different genome formats and operations very efficiently. The library provides robust tools for handling sequencing and ancestry data, with a focus on performance, ease of use, and advanced visualization capabilities.

Developed in collaboration between Stanford University's Department of Biomedical Data Science, UC Santa Cruz Genomics Institute, and more collaborators worldwide.

Note: snputils is under active development. While the core API is stabilizing, we are continuously adding features, optimizing performance, and expanding format support.

Installation

Basic installation using pip:

pip install snputils

Optionally, for GPU-accelerated functionalities, install the package with the [gpu] extra:

pip install 'snputils[gpu]'

Key Features

Ease of Use

snputils is designed to be user-friendly and intuitive, with a simple API that allows you to quickly load, process, and visualize genomic data. For example, reading a whole genome VCF file is as simple as:

import snputils as su
snpobj = su.read_snp("path/to/file.vcf.gz")

Similarly, reading BED or PGEN filesets is straightforward:

snpobj = su.read_snp("path/to/file.pgen")

Working with ancestry files, performing processing operations, and creating visualizations is just as straightforward. See the demos directory for examples.

File Format Support

snputils aims to provide the fastest available readers and writers for various genomic data formats:

  • VCF: Support for .vcf and .vcf.gz files
  • PLINK1: Support for .bed, .bim, .fam filesets
  • PLINK2: Support for .pgen, .pvar, .psam filesets
  • Local Ancestry: Handle .msp local ancestry format
  • Admixture: Read and write .Q and .P files

Processing & Analysis Tools

  • Basic Data Manipulation

    • Filter variants and samples, correct SNP flips, and filter out ambiguous SNPs
    • Standardized querying across genotype, local ancestry, global ancestry, and IBD data
  • Dimensionality Reduction

    • Standard PCA with optional GPU acceleration
    • Missing-data PCA (mdPCA)
    • Multi-array ancestry-specific MDS (maasMDS)
  • Population Genetic Statistics

    • Compute $D$, $f_2$, $f_3$, $f_4$, the $f_4$-ratio, and $F_{ST}$ (Hudson and Weir-Cockerham)
    • Includes block jackknife standard errors and optional ancestry masking
  • Identity-by-Descent (IBD) & Relatedness

    • Read hap-IBD and ancIBD outputs into a unified format
    • Fast filtering and ancestry-restricted segment trimming using local ancestry
  • Admixture Analysis & Simulation

    • Admixture Mapping: Locus-by-locus regression of local ancestry dosage on traits
    • Simulation: Lightweight haplotype-based simulation of admixed mosaics from real founder haplotypes

Visualization

  • Interactive global ancestry bar plots
  • Detailed scatter plots of PCA, mdPCA, and maasMDS
  • Admixture mapping Manhattan plots
  • Local ancestry visualization
    • Chromosome painting (with Tagore)
    • Dataset-level

Performance

  • Fast file I/O through built-in methods or optimized wrappers (e.g., Pgenlib for PLINK files)
  • Memory-efficient operations using NumPy and Polars
  • Optional GPU acceleration via PyTorch for computationally intensive tasks
  • Support for large-scale genomic datasets through efficient memory management

Our benchmark demonstrates superior performance compared to existing tools:

Reading performance comparison for chromosome 22 data across different tools. See the benchmark directory for detailed methodology and results.

The snputils package is continuously updated with new features and improvements.

Documentation & Support

  • Documentation: Comprehensive API reference at docs.snputils.org.
  • Examples & Tutorials: Check out our interactive notebooks in the demos directory.
  • Issues & Community: Report bugs, ask questions, or request features via GitHub Issues.

Acknowledgments

We would like to thank the open-source Python packages that make snputils possible: matplotlib, NumPy, pandas, Pgenlib, polars, pong, PyTorch, scikit-allel, scikit-learn, Tagore.

Citation

If you use snputils in your research, please cite:

Bonet, D.*, Comajoan Cara, M.*, Barrabés, M.*, Smeriglio, R., Agrawal, D., Dominguez Mantes, A., López, C., Thomassin, C., Calafell, A., Luis, A., Saurina, J., Franquesa, M., Perera, M., Geleta, M., Jaras, A., Sabat, B. O., Abante, J., Moreno-Grau, S., Mas Montserrat, D., Ioannidis, A. G., snputils: A Python library for processing diverse genomes. Annual Meeting of The American Society of Human Genetics, November 2024, Denver, Colorado, USA. *Equal contribution.

Journal paper coming soon!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snputils-0.2.36.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

snputils-0.2.36-py3-none-any.whl (226.1 kB view details)

Uploaded Python 3

File details

Details for the file snputils-0.2.36.tar.gz.

File metadata

  • Download URL: snputils-0.2.36.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for snputils-0.2.36.tar.gz
Algorithm Hash digest
SHA256 0502e5e858d8d7e0776fc7df52036c41bb4991337ea54bd25a981cbc49e5b190
MD5 8eb81e8e13fc507ce7ae4bdf03100428
BLAKE2b-256 690ce4d6fd70c807520f8a90a541729a3a689b2c094ad5cf53491dc2dc01743d

See more details on using hashes here.

File details

Details for the file snputils-0.2.36-py3-none-any.whl.

File metadata

  • Download URL: snputils-0.2.36-py3-none-any.whl
  • Upload date:
  • Size: 226.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for snputils-0.2.36-py3-none-any.whl
Algorithm Hash digest
SHA256 a59d71a4bd2a0ac8d76358b1f23b0235f03800e5ba0b0763b0d7b1b0f89161bb
MD5 1ab64bb4d8b635de3bf458a406f34c5d
BLAKE2b-256 b6649de38be586e180122d18cf8ecf2a54640cf23fcf54b5f11194f41522cf23

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page