GPU-based QTL mapper
Project description
tensorQTL
tensorQTL is a GPU-based QTL mapper, enabling ~200-300 fold faster cis- and trans-QTL mapping compared to CPU-based implementations.
If you use tensorQTL in your research, please cite the following paper: Taylor-Weiner, Aguet, et al., Genome Biol. 20:228, 2019.
Empirical beta-approximated p-values are computed as described in FastQTL (Ongen et al., 2016).
Install
You can install tensorQTL using pip:
pip3 install tensorqtl
or directly from this repository:
$ git clone git@github.com:broadinstitute/tensorqtl.git
$ cd tensorqtl
# set up virtual environment and install
$ virtualenv venv
$ source venv/bin/activate
(venv)$ pip install -r install/requirements.txt .
Requirements
tensorQTL requires an environment configured with a GPU. Instructions for setting up a virtual machine on Google Cloud Platform are provided here.
Input formats
tensorQTL requires three input files: genotypes, phenotypes, and covariates. Phenotypes must be provided in BED format (phenotypes x samples), and covariates as a text file (covariates x samples). Both are in the format used by FastQTL. Genotypes must currently be in PLINK format, and can be converted as follows:
plink2 --make-bed \
--output-chr chrM \
--vcf ${plink_prefix_path}.vcf.gz \
--out ${plink_prefix_path}
Examples
For examples illustrating cis- and trans-QTL mapping, please see tensorqtl_examples.ipynb.
Running tensorQTL from the command line
This section describes how to run tensorQTL from the command line. For a full list of options, run
python3 -m tensorqtl --help
cis-QTL mapping
Phenotype-level summary statistics with empirical p-values:
python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
--covariates ${covariates_file} \
--mode cis
All variant-phenotype associations:
python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
--covariates ${covariates_file} \
--mode cis_nominal
This will generate a parquet file for each chromosome. These files can be read using pandas
:
import pandas as pd
df = pd.read_parquet(file_name)
Conditionally independent cis-QTL (as described in GTEx Consortium, 2017):
python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
--covariates ${covariates_file} \
--cis_output ${cis_output_file} \
--mode cis_independent
trans-QTL mapping
python3 -m tensorqtl ${plink_prefix_path} ${expression_bed} ${prefix} \
--covariates ${covariates_file} \
--mode trans
For trans-QTL mapping, tensorQTL generates sparse output by default (associations with p-value < 1e-5). cis-associations are filtered out. The output is in parquet format, with four columns: phenotype_id, variant_id, pval, maf.
Running tensorQTL as a Python module
TensorQTL can also be run as a module to more efficiently run multiple analyses:
import pandas as pd
import tensorqtl
from tensorqtl import genotypeio, cis, trans
Loading input files
Load phenotypes and covariates:
phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(phenotype_bed_file)
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T # samples x covariates
Phenotypes must be provided in BED format (for compatibility with FastQTL), with a single header line and the first four columns containing: chr
, start
, end
, phenotype_id
. end
is assumed to correspond to the TSS (or center of the cis-window). The remaining columns correspond to samples.
covariates_file
is assumed to be tab-delimited and in the format covariates
x samples
.
Genotypes can be loaded as follows, where plink_prefix_path
is the path to the VCF in PLINK format:
pr = genotypeio.PlinkReader(plink_prefix_path)
# load genotypes and variants into data frames
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos']]
To save memory when using genotypes for a subset of samples, you can specify the samples as follows (this is not strictly necessary, since tensorQTL will select the relevant samples from genotype_df
otherwise):
pr = genotypeio.PlinkReader(plink_prefix_path, select_samples=phenotype_df.columns)
cis-QTL mapping: permutations
cis_df = cis.map_cis(genotype_df, variant_df, phenotype_df, phenotype_pos_df, covariates_df)
tensorqtl.calculate_qvalues(cis_df, qvalue_lambda=0.85)
cis-QTL mapping: summary statistics for all variant-phenotype pairs
cis.map_nominal(genotype_df, variant_df, phenotype_df, phenotype_pos_df,
covariates_df, prefix, output_dir='.')
cis-QTL mapping: conditionally independent QTLs
This requires the output from the permutations step (map_cis
) above.
indep_df = cis.map_independent(genotype_df, variant_df, cis_df,
phenotype_df, phenotype_pos_df, covariates_df)
cis-QTL mapping: interactions
Instead of mapping the standard linear model (p ~ g), includes an interaction term (p ~ g + i + gi) and returns full summary statistics for this model. The interaction term is a pd.Series
mapping sample ID to interaction value.
With the run_eigenmt=True
option, eigenMT-adjusted p-values are computed.
cis.map_nominal(genotype_df, variant_df, phenotype_df, phenotype_pos_df, covariates_df, prefix,
interaction_s=interaction_s, maf_threshold_interaction=0.05,
group_s=None, run_eigenmt=True, output_dir='.')
Full summary statistics are saved as parquet files for each chromosome, in ${output_dir}/${prefix}.cis_qtl_pairs.${chr}.parquet
, and the top association for each phenotype is saved to ${output_dir}/${prefix}.cis_qtl_top_assoc.txt.gz
. In these files, the columns b_g
, b_g_se
, pval_g
are the effect size, standard error, and p-value of g in the model, with matching columns for i and gi. In the *.cis_qtl_top_assoc.txt.gz
file, tests_emt
is the effective number of independent variants in the cis-window estimated with eigenMT, i.e., based on the eigenvalue decomposition of the regularized genotype correlation matrix (Davis et al., AJHG, 2016). pval_emt = pval_gi * tests_emt
, and pval_adj_bh
are the Benjamini-Hochberg adjusted p-values corresponding to pval_emt
.
trans-QTL mapping
trans_df = trans.map_trans(genotype_df, phenotype_df, covariates_df, return_sparse=True)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tensorqtl-1.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a86b75eca1a3ade7bb74e104e820c8cacbfcc0872087b9bcf759229ce298408 |
|
MD5 | 7ec08f1a4590a68ce286056f8ebd02c9 |
|
BLAKE2b-256 | fbb4c62e4ac921dbcb33e49d735ca7ea7bc4ce995f2efa7b0f696236910e073b |