Skip to main content

A collection of tools for genotype quality control and analysis

Project description

GenoTools

Published in G3: https://www.biorxiv.org/content/10.1101/2024.03.26.586362v1.full.pdf

DOI PyPI version PRs Welcome GitHub License Python Python Python

Documentation

You can find the full documentation with the following links:

Getting Started

GenoTools is a suite of automated genotype data processing steps written in Python. The core pipeline was built for Quality Control and Ancestry estimation of data in the Global Parkinson's Genetics Program (GP2)

To download the most current version from pip:

pip install the-real-genotools

Alternatively, if you'd like to download from github:

git clone https://github.com/dvitale199/GenoTools.git
cd GenoTools
pip install .

you can pull the most current references by running:

genotools-download

By default, the reference panel will be downloaded to ~/.genotools/ref. but can be download to a location of choice with --destination.

To download specific references/models, you can run the download with the following options:

genotools-download --ref 1kg_30x_hgdp_ashk_ref_panel --model nba_v1 --destination /path/to/download_directory/

Currently, 1kg_30x_hgdp_ashk_ref_panel is the only available reference panel. Available models are nba_v1 for the NeuroBooster array and neurochip_v1 for the NeuroChip Array and both are in GRCh38. If using a different array, we would suggest training a new model by running the standard command below. Please ensure the reference panel and your genotypes are in the same build. If you're using our reference panel, your genotypes must be in GRCh38.

Modify the paths in the following command to run the standard GP2 pipeline:

genotools \
  --pfile /path/to/genotypes/for/qc \
  --out /path/to/qc/output \
  --ancestry \
  --ref_panel /path/to/reference/panel \
  --ref_labels /path/to/reference/ancestry/labels \
  --all_sample \
  --all_variant

This will find common snps between your genotype data and the reference panel, run PCA, UMAP-transform PCs, and train a new XGBoost classifier specific to your data/ref panel.

if you'd like to run the pipeline using an existing model, you can do that like so (take note of the --model option):

genotools \
  --pfile /path/to/genotypes/for/qc \
  --out /path/to/qc/output \
  --ancestry \
  --ref_panel /path/to/reference/panel \
  --ref_labels /path/to/reference/ancestry/labels \
  --all_sample \
  --all_variant
  --model /path/to/nba_v1/model

if you'd like to run the pipeline using the default nba_v1 model in a Docker container, you can do that like so:

genotools \
  --pfile /path/to/genotypes/for/qc \
  --out /path/to/qc/output \
  --ancestry \
  --ref_panel /path/to/reference/panel \
  --ref_labels /path/to/reference/ancestry/labels \
  --container \
  --all_sample \
  --all_variant

Note: add the --singularity flag to run containerized ancestry predictions on HPC

genotools accept --pfile, --bfile, or --vcf. Any bfile or vcf will be converted to a pfile before running any steps.

Please consult the docs links listed at the top of the README for the full argument guide, function guide, Default pipeline overview, and guide for navigating the output JSON.

Acknowledgements

GenoTools was developed as the core genotype and wgs processing pipeline for the Global Parkinson's Genetics Program (GP2) at the Center for Alzheimer's and Related Dementias (CARD) at the National Institutes of Health.

This tool relies on PLINK, a whole genome association analysis toolset, for various genetic data processing functionalities. We gratefully acknowledge the developers of PLINK for their foundational contributions to the field of genetics. More about PLINK can be found at their website.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

the_real_genotools-1.3.4.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

the_real_genotools-1.3.4-py3-none-any.whl (1.1 MB view details)

Uploaded Python 3

File details

Details for the file the_real_genotools-1.3.4.tar.gz.

File metadata

  • Download URL: the_real_genotools-1.3.4.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for the_real_genotools-1.3.4.tar.gz
Algorithm Hash digest
SHA256 e8e373a9f8c2190b0f3328c68a63210c3eacec6aa48664109af3093fcc51f757
MD5 262061f0b78a31cf9a8fa51c6dc5851d
BLAKE2b-256 509df03efcc2aa4eead7e12967f220ee127e0d5e3b42efb44f8dc7553e6d1243

See more details on using hashes here.

File details

Details for the file the_real_genotools-1.3.4-py3-none-any.whl.

File metadata

File hashes

Hashes for the_real_genotools-1.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8989f525b58769fdd21fd460a95bbdd00451169fdd5df1c1002a36d601e9c5fc
MD5 38edfbdb0d9bf3e50a92cafd47b527a9
BLAKE2b-256 e4c647d1bdaa6aa0b957842777b7d82bf2eacfb9298587aef217bb742a18d606

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page