Command line tool for finding trans-eQTLs using reverse regression
Project description
Tejaas: Discover trans-eQTLs
Description
Tejaas is a command line tool to find trans-eQTLs from eQTL data. It is released under the GNU General Public License version 3.
Tejaas is based on the hypothesis that a trans-eQTL should regulate the expression levels of multiple genes. In brief, it implements two statistical methods to find trans-eQTLs:
- RR-score (Reverse Regression): It performs a multiple linear regression with L2-regularization using expression levels of all genes to explain the genotype of a candidate SNP. In contrast to conventional methods, the direction of the regression is reversed, with the gene expressions as explanatory variables. RR-score is a statistic which estimates whether more genes are required to explain the allele counts of a SNP than expected by chance.
- JPA-score (Joint P-value Analysis): It evaluates the distribution of p-values of the pairwise linear association of a candidate SNP with all available gene expression levels. Any null SNP (no trans-effect) will have a uniform distribution of p-values, while a trans-eQTL will be associated with more genes than expected by chance, leading to overdispersion near zero. The JPA-score is a statistic which estimates whether the distribution of p-values is significantly overdispersed near zero.
Additionally, it also implements a non-linear unsupervised confounder correction using k-nearest neighbors called KNN correction.
Dependencies
- Python version 3.4 or higher,
- Intel MKL library
- C compiler
- Python libraries:
- NumPy / array operations
- SciPy / optimization and other special functions
- statsmodel / used for ECDF calculation in JPA-score
- Pygtrie / used for reading MAF file in RR-score / maf null
- mpi4py / linked to MPI and MKL for python parallelization
- scikit-learn / used for PCA decomposition in KNN correction
Optional:
- any flavor of MPI linked to the Intel MKL library (e.g. OpenMPI)
You can find examples of getting started here:
Installation
- Clone this repository.
- Compile the C libraries provided in the
libsubdirectory. Some example makefiles are provided within thelibsubdirectory.
cd lib
make all -f Makefile
- Run Tejaas!
bin/tejaas [OPTIONS]
See below for valid options or try bin/tejaas --help.
Run an example to check installation
An example script test/run_test.sh is provided to check the installation.
Open the script in your favorite editor and modify the variables NCORE and DATA_DIR.
The script will download some example input files in the DATA_DIR directory and run Tejaas on NCORE cores.
The output will be created in DATA_DIR.
Check if the output matches with the results provided in the test/gold subdirectory.
cd test
./run_test.sh
Input Files
- Gene expression file
- Genotype file
- VCF
- Oxford
- Dosage
- GENCODE file
- Population minor allele frequency
Tejaas [OPTIONS]
| Option | Argument | Description | Priority | Default value |
|---|---|---|---|---|
--vcf |
FILEPATH |
Input genotype file in vcf.gz format | Required (vcf or oxf) | -- |
--oxf |
FILEPATH |
Input genotype file in Oxford format | Required (vcf or oxf) | -- |
--dosage |
Flag for reading dosage files. The file is specified with the --oxf option, e.g. --oxf FILEPATH --dosage |
Optional | False |
|
--fam |
FILEPATH |
Input fam file for samples names of Oxford genotype | Optional | -- |
--chrom |
INT |
Chromosome number of the genotype file | Required | -- |
--include-SNPs |
START:END |
Colon-separated index of SNPs to be included | Optional | -- |
--gx |
FILEPATH |
Input gene expression file for trans-eQTL discovery | Required | -- |
--gxcorr |
FILEPATH |
Input gene expression file for target gene discovery | Optional | --gx file |
--gxfmt |
OPTION |
Input gene expression file format (see format details below). Supported options: gtex, cardiogenics, geuvadis |
Optional | gtex |
--gtf |
FILEPATH |
Input GTF file from GENCODE to read gene Ensembl IDs. Used for selecting biotypes and getting genomic locations. | Required | -- |
--trim |
Flag to trim version number from GENCODE Ensembl IDs | Optional | False |
|
--biotype |
OPTION |
Which biotypes to select from the GTF file. Supported options: protein_coding, lncRNA. |
Optional | protein_coding lncRNA |
--outprefix |
STRING |
Full path to output file names. The extensions are generated by Tejaas. | Optional | out |
--method |
OPTION |
Name of method to run. Supported options: jpa or rr |
Optional | rr |
--null |
OPTION |
Null model to use for RR-score. Supported options: perm or maf |
Optional | perm |
--cismask |
Flag to mask cis-Genes within a window for each candidate SNP. Gene positions are obtained from the GENCODE annotation file. | Optional | False |
|
--window |
FLOAT |
Window (number of base pairs) used for masking cis genes | Optional | 1e6 |
--prior-sigma |
FLOAT |
Standard deviation of the normal prior for reverse multiple linear regression | Optional | 0.1 |
--knn |
INT |
Number of neighbours for KNN (use 0 if you do not want to use KNN) | Optional | 0 |
--psnpthres |
FLOAT |
Target genes will be reported only for trans-eQTLs below this threshold p-value for RR/JPA-score | Optional | 0.0001 |
--pgenethres |
FLOAT |
Target genes will be reported only if their association with trans-eQTLs are below this threshold p-value | Optional | 0.05 |
--jpanull |
FILEPATH |
File containing list of null model JPA-scores | Optional | -- |
--maf-file |
FILEPATH |
Read minor allele frequency (MAF) of SNPs from this file, e.g. to read population MAF for maf null (see documentation for file format) |
Optional | -- |
--shuffle |
Flag to randomly shuffle the genotypes to obtain a null distribution | Optional | False |
|
--shuffle-with |
FILEPATH |
Shuffle the genotypes in the same order of donor IDs specified in FILEPATH |
Optional | -- |
--test |
Flag to do test run | Optional | -- |
Usage Examples
- For quick start or installation check, run Tejaas with all default options:
bin/tejaas --vcf ${VCFFILE} --chrom ${CHRM} --gx ${GXFILE} --gtf ${GTFFILE} --cismask --outprefix ${OUTPREFIX}
This will create RR-scores at γ=0.1 and masking all genes within 1Mb of each SNP. The p-values will be computed from the permuted null model. Default format for the gene expression is the same as the GTEx format, and default gtf file is the GENCODE v26 release. For target gene discovery, it will use the same file as used for trans-eQTL discovery.
- Example of running Tejaas RR-score.
We recommend using the
permnull model for calcuting p-values from the RR-score and a separate confounder-corrected gene expression file for target gene discovery. In this example, RR-score is calculated for first 1000 SNPs excluding the first 20--include-SNPs 21:1000. KNN correction is performed with 20 nearest neighbors--knn 20. All cis-genes within +2MB and -2Mb are masked during analysis--cismask --window 2e6. RR-score calculation uses a prior normal distribution with standard deviation of 0.05--prior-sigma 0.05. The output reports target genes only for SNPs with p-value < 1e-6--psnpthres 0.000001. Here,GXFILEis the raw gene expression file,GXCORRFILEis the confounder-corrected gene expression file,VCFFILEis the genotype file in.vcf.gzformat andGTFFILEis the GENCODE annotation file.
mpirun -n 8 bin/tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs 21:1000 --gx ${GXFILE} --gxcorr ${GXCORRFILE} \
--gxfmt gtex --gtf ${GTFFILE} --trim --outprefix ${OUTPREFIX} \
--cismask --window 2e6 --psnpthres 0.000001 \
--knn 20 --method rr --null perm --prior-sigma 0.05
- Example of running JPA-score with no KNN correction.
Empirical p-values are calculated from the null scores loaded from
NULLFILEspecified by the--jpanulloption. IfNULLFILEdoes not exist, then it will create100000null scores and write them in theNULLFILEbefore calculating JPA-scores. If--jpanulloption is not used, then p-values for the JPA-scores are calculated from an analytical construction of null model.
mpirun -n 8 bin/tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs 1:100 \
--gx ${GXFILE} --gxfmt gtex --gtf ${GTFFILE} --outprefix ${OUTPREFIX} \
--knn 0 --method jpa --jpanull ${NULLFILE}
- Example of parallelizing job submission.
NMAX=20000 # number of SNPs per job
for CHRM in $( seq 1 22 ); do
VCFFILE="file_path_here_${CHRM}.vcf.gz"
NTOT=$( calculate_no_of_SNPs_in_this_chromosome )
NJOB=$( echo $(( (NTOT + NMAX - 1)/NMAX )) )
for (( i=0; i < ${NJOB}; i++ )); do
STARTSNP=$(( NMAX * i + 1 ))
ENDSNP=$(( NMAX * (i + 1) ))
if [ ${ENDSNP} -gt ${NTOT} ]; then
ENDSNP=${NTOT}
fi
mpirun -n 8 bin/tejaas --vcf ${VCFFILE} --chrom ${CHRM} --include-SNPs ${STARTSNP}:${ENDSNP} --gx ${GXFILE} --gxcorr ${GXCORRFILE} \
--gxfmt gtex --gtf ${GTFFILE} --trim --outprefix ${OUTPREFIX} \
--cismask --psnpthres 0.000001 --knn 20 --method rr --null perm --prior-sigma 0.05
done
done
Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tejaas-1.0.1.tar.gz.
File metadata
- Download URL: tejaas-1.0.1.tar.gz
- Upload date:
- Size: 96.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af9ed4006e5f3bc10e5603504d88054c3c0b025a760de20f9c63cdf3436b8346
|
|
| MD5 |
355ce807c65cd4135f6c4ecff0301032
|
|
| BLAKE2b-256 |
9cc7b4187f12f48c12c3358ff0bee59d9951403c2677328d37d038ea2e7a2c6d
|