pypgx

A Python package for pharmacogenomics research

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

https://anaconda.org/bioconda/pypgx/badges/version.svg

https://anaconda.org/bioconda/pypgx/badges/license.svg

https://anaconda.org/bioconda/pypgx/badges/downloads.svg

https://anaconda.org/bioconda/pypgx/badges/installer/conda.svg

Introduction

The main purpose of the PyPGx package is to provide a unified platform for pharmacogenomics (PGx) research.

The package is written in Python, and supports both command line interface (CLI) and application programming interface (API) whose documentations are available at the Read the Docs.

PyPGx is compatible with both of the Genome Reference Consortium Human (GRCh) builds, GRCh37 (hg19) and GRCh38 (hg38).

There are currently 57 pharmacogenes in PyPGx:

ABCB1	CACNA1S	CFTR	CYP1A1	CYP1A2
CYP1B1	CYP2A6/CYP2A7	CYP2A13	CYP2B6/CYP2B7	CYP2C8
CYP2C9	CYP2C19	CYP2D6/CYP2D7	CYP2E1	CYP2F1
CYP2J2	CYP2R1	CYP2S1	CYP2W1	CYP3A4
CYP3A5	CYP3A7	CYP3A43	CYP4A11	CYP4A22
CYP4B1	CYP4F2	CYP17A1	CYP19A1	CYP26A1
DPYD	G6PD	GSTM1	GSTP1	GSTT1
IFNL3	NAT1	NAT2	NUDT15	POR
PTGIS	RYR1	SLC15A2	SLC22A2	SLCO1B1
SLCO1B3	SLCO2B1	SULT1A1	TBXAS1	TPMT
UGT1A1	UGT1A4	UGT2B7	UGT2B15	UGT2B17
VKORC1	XPC

Your contributions (e.g. feature ideas, pull requests) are most welcome.

Author: Seung-been “Steven” Lee
Email: sbstevenlee@gmail.com
License: MIT License

Installation

Following packages are required to run PyPGx:

Package	Anaconda	PyPI
fuc	✅	✅
scikit-learn	✅	✅
openjdk	✅	❌

There are various ways you can install PyPGx. The recommended way is via conda (Anaconda):

$ conda install -c bioconda pypgx

Above will automatically download and install all the dependencies as well. Alternatively, you can use pip (PyPI) to install PyPGx and all of its dependencies except openjdk (i.e. Java JDK must be installed separately):

$ pip install pypgx

Finally, you can clone the GitHub repository and then install PyPGx locally:

$ git clone https://github.com/sbslee/pypgx
$ cd pypgx
$ pip install .

The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the git checkout command. When you do this, please make sure your environment already has all the dependencies installed.

GRCh37 vs. GRCh38

When working with PGx data, it’s not uncommon to encounter a situation where you are handling GRCh37 data in one project but GRCh38 in another. You may be tempted to use tools like LiftOver to convert GRCh37 to GRCh38, or vice versa, but deep down you know it’s going to be a mess (and please don’t do this). The good news is, PyPGx supports both of the builds!

In many of the PyPGx actions, you can simply indicate which human genome build to use. For example, you can use assembly for the API and --assembly for the CLI. Note that GRCh37 will always be the default.

However, there is one important caveat to consider if your sequencing data is GRCh38. That is, sequence reads must be aligned only to the main contigs (i.e. chr1, chr2, …, chrX, chrY), and not to the alternative (ALT) contigs such as chr1_KI270762v1_alt. This is because the presence of ALT contigs reduces the sensitivity of variant calling and many other analyses including SV detection. Therefore, if you have sequencing data in GRCh38, make sure it’s aligned to the main contigs only.

The only exception to above rule is the GSTT1 gene, which is located on chr22 for GRCh37 but on chr22_KI270879v1_alt for GRCh38. This gene is known to have an extremely high rate of gene deletion polymorphism in the population and thus requires SV analysis. Therefore, if you are interested in genotyping this gene with GRCh38 data, then you must include that contig when performing read alignment. To this end, you can easily filter your reference FASTA file before read alignment so that it only contains the main contigs plus the ALT contig. If you don’t know how to do this, here’s one way using the fuc program (which should have already been installed along with PyPGx):

$ cat contigs.list
chr1
chr2
...
chrX
chrY
chr22_KI270879v1_alt
$ fuc fa-filter in.fa --contigs contigs.list > out.fa

Archive file, semantic type, and metadata

In order to efficiently store and transfer data, PyPGx uses the ZIP archive file format (.zip) which supports lossless data compression. Each archive file created by PyPGx has a metadata file (metadata.txt) and a data file (e.g. data.tsv, data.vcf). A metadata file contains important information about the data file within the same archive, which is expressed as pairs of =-separated keys and values (e.g. Assembly=GRCh37):

Metadata	Description	Examples
Assembly	Reference genome assembly.	GRCh37, GRCh38
Control	Control gene.	VDR, chr1:10000-20000
Gene	Target gene.	CYP2D6, GSTT1
Platform	Genotyping platform.	WGS, Targeted, Chip
Program	Name of the phasing program.	Beagle, SHAPEIT
Samples	Samples used for inter-sample normalization.	NA07000,NA10854,NA11993
SemanticType	Semantic type of the archive.	CovFrame[CopyNumber], Model[CNV]

Notably, all archive files have defined semantic types, which allows us to ensure that the data that is passed to a PyPGx command (CLI) or method (API) is meaningful for the operation that will be performed. Below is a list of currently defined semantic types:

CovFrame[CopyNumber]
- CovFrame for storing target gene’s per-base copy number which is computed from read depth with control statistics.
- Requires following metadata: Gene, Assembly, SemanticType, Platform, Control, Samples.
CovFrame[DepthOfCoverage]
- CovFrame for storing read depth for all target genes with SV.
- Requires following metadata: Assembly, SemanticType, Platform.
CovFrame[ReadDepth]
- CovFrame for storing read depth for single target gene.
- Requires following metadata: Gene, Assembly, SemanticType, Platform.
Model[CNV]
- Model for calling CNV in target gene.
- Requires following metadata: Gene, Assembly, SemanticType, Control.
SampleTable[Alleles]
- TSV file for storing target gene’s candidate star alleles for each sample.
- Requires following metadata: Platform, Gene, Assembly, SemanticType, Program.
SampleTable[CNVCalls]
- TSV file for storing target gene’s CNV call for each sample.
- Requires following metadata: Gene, Assembly, SemanticType, Control.
SampleTable[Genotypes]
- TSV file for storing target gene’s genotype call for each sample.
- Requires following metadata: Gene, Assembly, SemanticType.
SampleTable[Phenotypes]
- TSV file for storing target gene’s phenotype call for each sample.
- Requires following metadata: Gene, SemanticType.
SampleTable[Results]
- TSV file for storing various results for each sample.
- Requires following metadata: Gene, Assembly, SemanticType.
SampleTable[Statistcs]
- TSV file for storing control gene’s various statistics on read depth for each sample. Used for converting target gene’s read depth to copy number.
- Requires following metadata: Control, Assembly, SemanticType, Platform.
VcfFrame[Consolidated]
- VcfFrame for storing target gene’s consolidated variant data.
- Requires following metadata: Platform, Gene, Assembly, SemanticType, Program.
VcfFrame[Imported]
- VcfFrame for storing target gene’s raw variant data.
- Requires following metadata: Platform, Gene, Assembly, SemanticType.
VcfFrame[Phased]
- VcfFrame for storing target gene’s phased variant data.
- Requires following metadata: Platform, Gene, Assembly, SemanticType, Program.

Phenotype prediction

Many of the genes in PyPGx have a diplotype-phenotype table available from the Clinical Pharmacogenetics Implementation Consortium (CPIC). PyPGx will use this information to perform phenotype prediction. Note that there two types of phenotype prediction:

Method 1. Diplotype-phenotype mapping: This method directly uses the diplotype-phenotype mapping as defined by CPIC. Using the CYP2B6 gene as an example, the diplotypes *6/*6, *1/*29, *1/*2, *1/*4, and *4/*4 correspond to Poor Metabolizer, Intermediate Metabolizer, Normal Metabolizer, Rapid Metabolizer, and Ultrarapid Metabolizer.
Method 2. Activity score: This method uses a standard unit of enzyme activity known as an activity score. Using the CYP2D6 gene as an example, the fully functional reference *1 allele is assigned a value of 1, decreased-function alleles such as *9 and *17 receive a value of 0.5, and nonfunctional alleles including *4 and *5 have a value of 0. The sum of values assigned to both alleles constitutes the activity score of a diplotype. Consequently, subjects with *1/*1, *1/*4, and *4/*5 diplotypes have an activity score of 2 (Normal Metabolizer), 1 (Intermediate Metabolizer), and 0 (Poor Metabolizer), respectively.

Please visit the Genes page to see the list of genes with a CPIC diplotype-phenotype table and each of their prediction method.

Getting help

For detailed documentations on the CLI and API, please refer to the Read the Docs.

For getting help on the CLI:

$ pypgx -h

usage: pypgx [-h] [-v] COMMAND ...

positional arguments:
  COMMAND
    call-genotypes      Call genotypes for the target gene.
    call-phenotypes     Call phenotypes for the target gene.
    combine-results     Combine various results for the target gene.
    compare-genotypes   Calculate concordance rate between two genotype results.
    compute-control-statistics
                        Compute summary statistics for the control gene from BAM files.
    compute-copy-number
                        Compute copy number from read depth for the target gene.
    compute-target-depth
                        Compute read depth for the target gene from BAM files.
    create-consolidated-vcf
                        Create a consolidated VCF file.
    create-regions-bed  Create a BED file which contains all regions used by PyPGx.
    estimate-phase-beagle
                        Estimate haplotype phase of observed variants with the Beagle program.
    filter-samples      Filter Archive file for specified samples.
    import-read-depth   Import read depth data for the target gene.
    import-variants     Import variant data for the target gene.
    plot-bam-copy-number
                        Plot copy number profile from CovFrame[CopyNumber].
    plot-bam-read-depth
                        Plot read depth profile with BAM data.
    plot-cn-af          Plot both copy number profile and allele fraction profile in one figure.
    plot-vcf-allele-fraction
                        Plot allele fraction profile with VCF data.
    plot-vcf-read-depth
                        Plot read depth profile with VCF data.
    predict-alleles     Predict candidate star alleles based on observed variants.
    predict-cnv         Predict CNV for the target gene based on copy number data.
    prepare-depth-of-coverage
                        Prepare a depth of coverage file for all target genes with SV.
    print-metadata      Print the metadata of specified archive.
    run-chip-pipeline   Run PyPGx's genotyping pipeline for chip data.
    run-ngs-pipeline    Run PyPGx's genotyping pipeline for NGS data.
    test-cnv-caller     Test a CNV caller for the target gene.
    train-cnv-caller    Train a CNV caller for the target gene.

optional arguments:
  -h, --help            Show this help message and exit.
  -v, --version         Show the version number and exit.

For getting help on a specific command (e.g. call-genotypes):

$ pypgx call-genotypes -h

Below is the list of submodules available in the API:

core : The core submodule is the main suite of tools for PGx research.
genotype : The genotype submodule is primarily used to make final diplotype calls by interpreting candidate star alleles and/or detected structural variants.
pipeline : The pipeline submodule is used to provide convenient methods that combine multiple PyPGx actions and automatically handle semantic types.
plot : The plot submodule is used to plot various kinds of profiles such as read depth, copy number, and allele fraction.
utils : The utils submodule contains main actions of PyPGx.

For getting help on a specific submodule (e.g. utils):

>>> from pypgx.api import utils
>>> help(utils)

For getting help on a specific method (e.g. predict_phenotype):

>>> import pypgx
>>> help(pypgx.predict_phenotype)

CLI examples

We can print the metadata of an archive file:

$ pypgx print-metadata grch37-depth-of-coverage.zip

Above will print:

Assembly=GRCh37
SemanticType=CovFrame[DepthOfCoverage]
Platform=WGS

We can run the NGS pipeline for the CYP2D6 gene:

$ pypgx run-ngs-pipeline \
CYP2D6 \
grch37-CYP2D6-pipeline \
--variants grch37-variants.vcf.gz \
--depth-of-coverage grch37-depth-of-coverage.zip \
--control-statistics grch37-control-statistics-VDR.zip

Above will create a number of archive files:

Saved VcfFrame[Imported] to: grch37-CYP2D6-pipeline/imported-variants.zip
Saved VcfFrame[Phased] to: grch37-CYP2D6-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: grch37-CYP2D6-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: grch37-CYP2D6-pipeline/alleles.zip
Saved CovFrame[ReadDepth] to: grch37-CYP2D6-pipeline/read-depth.zip
Saved CovFrame[CopyNumber] to: grch37-CYP2D6-pipeline/copy-number.zip
Saved SampleTable[CNVCalls] to: grch37-CYP2D6-pipeline/cnv-calls.zip
Saved SampleTable[Genotypes] to: grch37-CYP2D6-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: grch37-CYP2D6-pipeline/phenotypes.zip
Saved SampleTable[Results] to: grch37-CYP2D6-pipeline/results.zip

API examples

We can obtain allele function for the CYP2D6 gene:

>>> import pypgx
>>> pypgx.get_function('CYP2D6', '*1')
'Normal Function'
>>> pypgx.get_function('CYP2D6', '*4')
'No Function'
>>> pypgx.get_function('CYP2D6', '*22')
'Uncertain Function'
>>> pypgx.get_function('CYP2D6', '*140')
'Unknown Function'

We can predict phenotype for the CYP2D6 gene based on two haplotype calls:

>>> import pypgx
>>> pypgx.predict_phenotype('CYP2D6', '*4', '*5')   # Both alleles have no function
'Poor Metabolizer'
>>> pypgx.predict_phenotype('CYP2D6', '*5', '*4')   # The order of alleles does not matter
'Poor Metabolizer'
>>> pypgx.predict_phenotype('CYP2D6', '*1', '*22')  # *22 has uncertain function
'Indeterminate'
>>> pypgx.predict_phenotype('CYP2D6', '*1', '*1x2') # Gene duplication
'Ultrarapid Metabolizer'

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.24.0

Mar 31, 2024

0.23.0

Dec 24, 2023

0.22.0

Dec 10, 2023

0.21.0

Aug 25, 2023

0.20.0

Jan 11, 2023

0.19.0

Sep 13, 2022

0.18.0

Aug 12, 2022

0.17.0

Jul 12, 2022

0.16.0

Jun 8, 2022

0.15.0

May 3, 2022

0.14.0

Apr 2, 2022

0.13.0

Mar 1, 2022

0.12.0

Jan 29, 2022

0.11.0

Jan 1, 2022

0.10.1

Dec 20, 2021

0.10.0

Dec 19, 2021

This version

0.9.0

Dec 5, 2021

0.8.0

Nov 20, 2021

0.7.0

Oct 23, 2021

0.6.0

Oct 9, 2021

0.5.0

Oct 2, 2021

0.4.1

Sep 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypgx-0.9.0.tar.gz (29.0 MB view hashes)

Uploaded Dec 5, 2021 Source

Built Distribution

pypgx-0.9.0-py3-none-any.whl (29.0 MB view hashes)

Uploaded Dec 5, 2021 Python 3

Hashes for pypgx-0.9.0.tar.gz

Hashes for pypgx-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`ead347f582f1514e63247c6696bd497549393e67cc9bc9174b0d7eb3bfa1d157`
MD5	`9591fa13d5c8a10ab87c1039531cf643`
BLAKE2b-256	`5846d398dc3684d48ad0838ec0d6cd8c7205b1711c17fb7aba287dfe552b5ef6`

Hashes for pypgx-0.9.0-py3-none-any.whl

Hashes for pypgx-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`357fd3750a274207c51cb382d23efda8740ca05cec91b74d1aa91129509173a9`
MD5	`4f69feee184ef7f9ddb02ff7d43d57f2`
BLAKE2b-256	`1ee0647e87043a8b1f3aabbb22ffef13853b359c27298a5023ad32710705478a`