A Python package for pharmacogenomics research
Project description
Introduction
The main purpose of the PyPGx package is to provide a unified platform for pharmacogenomics (PGx) research.
The package is written in Python, and supports both command line interface (CLI) and application programming interface (API) whose documentations are available at the Read the Docs.
Your contributions (e.g. feature ideas, pull requests) are most welcome.
Installation
The following packages are required to run PyPGx:
fuc scikit-learn
There are various ways you can install PyPGx. The recommended way is via conda (Anaconda):
$ conda install -c bioconda pypgx
Above will automatically download and install all the dependencies as well. Alternatively, you can use pip (PyPI) to install PyPGx and all of its dependencies:
$ pip install pypgx
Finally, you can clone the GitHub repository and then install PyPGx locally:
$ git clone https://github.com/sbslee/pypgx
$ cd pypgx
$ pip install .
The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the git checkout command. When you do this, please make sure your environment already has all the dependencies installed.
Archive file, semantic type, and metadata
In order to efficiently store and transfer data, PyPGx uses the ZIP archive file format (.zip) which supports lossless data compression. Each archive file created by PyPGx has a metadata file (metadata.txt) and a data file (e.g. data.tsv, data.vcf). A metadata file contains important information about the data file within the same archive, which is expressed as pairs of =-separated keys and values (e.g. Assembly=GRCh37):
Metadata |
Description |
Examples |
---|---|---|
Assembly |
Reference genome assembly. |
GRCh37, GRCh38 |
Control |
Control gene. |
VDR, chr1:10000-20000 |
Gene |
Target gene. |
CYP2D6, GSTT1 |
Platform |
NGS platform. |
WGS, Targeted |
Program |
Name of the phasing program. |
Beagle |
Samples |
Samples used for inter-sample normalization. |
NA07000,NA10854,NA11993 |
SemanticType |
Semantic type of the archive. |
CovFrame[CopyNumber], Model[CNV] |
Notably, all archive files have defined semantic types, which allows us to ensure that the data that is passed to a PyPGx command (CLI) or method (API) is meaningful for the operation that will be performed. Below is a list of currently defined semantic types:
- CovFrame[CopyNumber]
CovFrame for storing target gene’s per-base copy number which is computed from read depth with control statistics.
Requires following metadata: Gene, Assembly, SemanticType, Platform, Control, Samples.
- CovFrame[ReadDepth]
CovFrame for storing target gene’s per-base read depth which is computed from BAM files.
Requires following metadata: Gene, Assembly, SemanticType, Platform.
- Model[CNV]
Model for calling CNV in target gene.
Requires following metadata: Gene, Assembly, SemanticType, Control.
- SampleTable[Alleles]
TSV file for storing target gene’s candidate star alleles for each sample.
Requires following metadata: Gene, Assembly, SemanticType, Program.
- SampleTable[CNVCalls]
TSV file for storing target gene’s CNV call for each sample.
Requires following metadata: Gene, Assembly, SemanticType, Control.
- SampleTable[Genotypes]
TSV file for storing target gene’s genotype call for each sample.
Requires following metadata: Gene, Assembly, SemanticType.
- SampleTable[Results]
TSV file for storing various results for each sample.
Requires following metadata: Gene, Assembly, SemanticType.
- SampleTable[Statistcs]
TSV file for storing control gene’s various statistics on read depth for each sample. Used for converting target gene’s read depth to copy number.
Requires following metadata: Control, Assembly, SemanticType, Platform.
- VcfFrame[Consolidated]
VcfFrame for storing target gene’s consolidated variant data.
Requires following metadata: Gene, Assembly, SemanticType, Program.
- VcfFrame[Imported]
VcfFrame for storing target gene’s raw variant data.
Requires following metadata: Gene, Assembly, SemanticType.
- VcfFrame[Phased]
VcfFrame for storing target gene’s phased variant data.
Requires following metadata: Gene, Assembly, SemanticType, Program.
Getting help
For detailed documentations on the CLI and API, please refer to the Read the Docs.
For getting help on the CLI:
$ pypgx -h
usage: pypgx [-h] [-v] COMMAND ...
positional arguments:
COMMAND
call-genotypes Call genotypes for target gene.
combine-results Combine various results for the target gene.
compute-control-statistics
Compute various statistics for control gene with BAM data.
compute-copy-number
Compute copy number from read depth for target gene.
compute-target-depth
Compute read depth for target gene with BAM data.
create-consolidated-vcf
Create consolidated VCF.
create-read-depth-tsv
Compute read depth for target gene with BAM data.
create-regions-bed Create a BED file which contains all regions used by PyPGx.
estimate-phase-beagle
Estimate haplotype phase of observed variants with the Beagle program.
filter-samples Filter Archive file for specified samples.
import-read-depth Import read depth data for target gene.
import-variants Import variant data for target gene.
plot-bam-copy-number
Plot copy number profile with BAM data.
plot-bam-read-depth
Plot read depth profile with BAM data.
plot-vcf-allele-fraction
Plot allele fraction profile with VCF data.
plot-vcf-read-depth
Plot read depth profile with VCF data.
predict-alleles Predict candidate star alleles based on observed variants.
predict-cnv Predict CNV for target gene based on copy number data.
print-metadata Print the metadata of specified archive.
run-ngs-pipeline Run NGS pipeline for the target gene.
test-cnv-caller Test a CNV caller for the target gene.
train-cnv-caller Train a CNV caller for the target gene.
optional arguments:
-h, --help Show this help message and exit.
-v, --version Show the version number and exit.
For getting help on a specific command (e.g. call-genotypes):
$ pypgx call-genotypes -h
Below is the list of submodules available in the API:
genotype : The genotype submodule is a suite of tools for accurately predicting genotype calls.
pipeline : The pipeline submodule is used to provide convenient methods that combine multiple PyPGx actions and automatically handle semantic types.
plot : The plot submodule is used to plot various kinds of profiles such as read depth, copy number, and allele fraction.
utils : The utils submodule is the main suite of tools for PGx research.
For getting help on a specific submodule (e.g. utils):
>>> from pypgx.api import utils
>>> help(utils)
CLI examples
Run NGS pipeline for CYP2D6:
$ pypgx run-ngs-pipeline \
CYP2D6 \
CYP2D6-pipeline \
--vcf input.vcf \
--panel ref.vcf \
--tsv input.tsv \
--control-statistics control-statistics-VDR.zip
API examples
Predict phenotype based on two haplotype calls:
>>> import pypgx
>>> pypgx.predict_phenotype('CYP2D6', '*4', '*5') # Both alleles have no function
'Poor Metabolizer'
>>> pypgx.predict_phenotype('CYP2D6', '*5', '*4') # The order of alleles does not matter
'Poor Metabolizer'
>>> pypgx.predict_phenotype('CYP2D6', '*1', '*22') # *22 has uncertain function
'Indeterminate'
>>> pypgx.predict_phenotype('CYP2D6', '*1', '*1x2') # Gene duplication
'Ultrarapid Metabolizer'
>>> pypgx.predict_phenotype('CYP2B6', '*1', '*4') # *4 has increased function
'Rapid Metabolizer'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.