Skip to main content

SNP utils

Project description

snputils logo

snputils: A Python library for processing diverse genomes

License BSD-3 PyPI Python Version Test, Docs & Publish

snputils is a Python package designed to ease the processing and analysis of common and diverse genomic datasets, while handling all the complexities of diverse genome formats and operations very efficiently. The library provides robust tools for handling sequencing and ancestry data, with a focus on performance, ease of use, and advanced visualization capabilities.

Developed in collaboration between Stanford University's Department of Biomedical Data Science, UC Santa Cruz Genomics Institute, and more collaborators worldwide.

This is an early access release, parts of the code are likely to change significantly in the upcoming weeks.

Installation

Basic installation using pip:

pip install snputils

Optionally, for GPU-accelerated functionalities, install the package with the [gpu] extra:

pip install snputils[gpu]

Key Features

Ease of Use

snputils is designed to be user-friendly and intuitive, with a simple API that allows you to quickly load, process, and visualize genomic data. For example, reading a whole genome VCF file is as simple as:

import snputils as su
snpobj = su.read_snp("path/to/file.vcf.gz")

Similarly, reading BED or PGEN filesets is straightforward:

snpobj = su.read_snp("path/to/file.pgen")

Working with ancestry files, performing processing operations, and creating visualizations is just as straightforward. See the demos directory for examples.

File Format Support

snputils aims to provide the fastest available readers and writers for various genomic data formats:

  • VCF: Support for .vcf and .vcf.gz files
  • PLINK1: Support for .bed, .bim, .fam filesets
  • PLINK2: Support for .pgen, .pvar, .psam filesets
  • Local Ancestry: Handle .msp local ancestry format
  • Admixture: Read and write .Q and .P files

Processing Tools

  • Basic Data Manipulation

    • Filter variants and samples
    • Correct SNP flips
    • Filter out ambiguous SNPs
  • Dimensionality Reduction

    • Standard PCA with optional GPU acceleration
    • Missing-DNA PCA (mdPCA)
    • Multi-array ancestry-specific MDS (maasMDS)
  • Admixture Mapping

Visualization

  • Interactive global ancestry bar plots
  • Detailed scatter plots of PCA, mdPCA, and maasMDS
  • Admixture mapping Manhattan plots
  • Local ancestry visualization
    • Chromosome painting (with Tagore)
    • Dataset-level

Performance

  • Fast file I/O through built-in methods or optimized wrappers (e.g., Pgenlib for PLINK files)
  • Memory-efficient operations using NumPy and Polars
  • Optional GPU acceleration via PyTorch for computationally intensive tasks
  • Support for large-scale genomic datasets through efficient memory management

Our benchmark demonstrates superior performance compared to existing tools:

Reading performance comparison for chromosome 22 data across different tools. See the benchmark directory for detailed methodology and results.

The snputils package is continuously updated with new features and improvements. Future releases will include support for statistical computations, admixture simulations, command-line tools, and more.

Documentation & Support

Acknowledgments

We would like to thank the open-source Python packages that make snputils possible: matplotlib, NumPy, pandas, Pgenlib, polars, pong, PyTorch, scikit-allel, scikit-learn, Tagore.

Citation

If you use snputils in your research, please cite:

Bonet, D.*, Comajoan Cara, M.*, Barrabés, M.*, Smeriglio, R., Agrawal, D., Dominguez Mantes, A., López, C., Thomassin, C., Calafell, A., Luis, A., Saurina, J., Franquesa, M., Perera, M., Geleta, M., Jaras, A., Sabat, B. O., Abante, J., Moreno-Grau, S., Mas Montserrat, D., Ioannidis, A. G., snputils: A Python library for processing diverse genomes. Annual Meeting of The American Society of Human Genetics, November 2024, Denver, Colorado, USA. * Equal contribution.

Journal paper coming soon!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snputils-0.2.2.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

snputils-0.2.2-py3-none-any.whl (118.6 kB view details)

Uploaded Python 3

File details

Details for the file snputils-0.2.2.tar.gz.

File metadata

  • Download URL: snputils-0.2.2.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for snputils-0.2.2.tar.gz
Algorithm Hash digest
SHA256 881ff88aec6d22529e041a770102d2ad7a4f70aef038dd985e574ff57f65a694
MD5 4b569d9666b895e22f9647a9a44cfed9
BLAKE2b-256 77c6b05d25f810f83f526c2c628d26e036eeeb961b0b15a91f4ffa0f15645dd6

See more details on using hashes here.

File details

Details for the file snputils-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: snputils-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 118.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for snputils-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0c19a4d765a11c44e8d9ec0fe688817e0ee00a8021756eb7989e709e521fbc2b
MD5 0ab02b2bce4770a79eca4e0793f976f5
BLAKE2b-256 02a6465ea75b21dcfcdd0dea14339d20aa34afea2c9531447f67a32f69e6586a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page