This is our little project for GWAS
Project description
SYNOPSIS
PACtool is a program that performs GWAS analysis on control/case data of SNP variants and was implemented using Python 3.
PACTOOL Features:
- Allele frequency of each variant for all selected SNPs
- HWE statistic and respective p-value calculation for the unified controls and and cases dataset
- Linkage Disequilibrium evaluation between two selected SNPs (D' and r-squared are calculated)
- Association Test on the provided dataset, with optional generation of manhattan plot and qq-plot
- Information retrieval for selected SNPs, from Ensembl Variant Effect Predictor database.
PACtool Limitations:
- Current version of PACtool does not support analysis for variants located on different chromosomes.
Please make sure that all SNPs in your dataset are located on the same chromosome.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
INPUT FILES FORMAT
The input files must be provided in accordance to the HAPGEN2 program, which can be applied on various genomic datasets.
The input file consists of rows and columns, where each row represents a SNP, and the columns contain information about the following:
Column_1: snp_id
- e.g. snp_0
- a unique identifier for each SNP
Columns_2: rs_id
- e.g. rs6054257
- or alternatively the genomic coordinates, with a pefix indicating the chromosome e.g. 20-9150
Columns_3: snp_coordinates based on NCBI build 36
- e.g. 9150
Column_4: reference allele
- denoted with the nucleotide base
- e.g. C
Column_5: alternative allele
- denoted with the nucleotide base
- e.g. G
Column_6 to Last_Column:
- These columns contain the genotype for each sample.
Each sample genotype is represented by 3 columns consisted of '0' and '1' digits, with the positioning of '1' being the indicator of the genotype.
More specifically,
- 1 0 0 ---> ref-ref , homozygous for the reference allele
- 0 1 0 ---> ref-alt , heterozygous
- 0 0 1 ---> alt-alt , homozygous for the alternative allele
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
OUTPUT FILES FORMAT
The output files follow the same space separated file format as the input files described above.
The first column for each SNP in all output files is the snp_id, while the rest of the columns hold the values from the respective statistical tests and analyses.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
INSTALLATION-UTILIZATION
PACtool can be downloaded from the following PyPI page:
https://pypi.python.org/pypi/pactool
By unzipping/untaring the downloaded .tgz file, you can get necessary information on how to execute the program by typing (while being in pactool directory):
python3 pactool.py -h
You can execute the file pactool.py using python 3, and include all preferred arguments.
Three of the arguments are required and must be provided every time, or else you will receive an error message.
Please be sure to include the following arguments:
-controls_file Indicates the input file containing control samples
-cases_file Indicates the input file containing case samples
-output Specifies the prefix of each output file
The following optional arguments can also be included, alongside with a file with snp_codes (one in each line) to perform the corresponding action:
-keep_snps Keeps for analysis only the SNPs specified in the provided file.
-remove_snps Removes from following analysis the SNPs specified in the provided file.
The above actions can be applied to the samples as well (the lines in given file shall be e.g. control_5 or case_10):
-keep_samples Keeps for analysis only the control/case samples specified in the provided file.
-remove_samples Removes from further analysis the control/case samples specified in the provided file.
After performing the keep/remove actions, the following analysis options can be selected:
-allele_frequency Calculates the frequencies of the reference and alternative variants in control samples, case samples as well as their total frequencies.
Outputs the file 'output'.frequency with 7 columns: snp_code ref_freq_control alt_freq_control ref_freq_cases alt_freq_cases ref_freq_total alt_freq_total
-hwe, -HWE Calculates the Hardy-Weinberg Equilibrium and the corresponding p-value.
Outputs the file 'output'.hwe with 3 columns: snp_code hwe_statistic p-value
-ld SNP1 SNP2 Estimates if the two given SNPs are in Linkage Disequilibrium, by calculating the D' and r-squared statistics. SNP1 and SNP2 are required snp_codes.
Ouputs the file 'output'.ld with 4 colums: snp1_code snp2_code D' r-squared
-association_test Performs genotypic association test for each SNP and calculates the odds-ratios (r=reference, a=alternative, OR_control_case)
Outputs the file 'output'.association with 8 columns: snp_code locus ref alt p-value OR_rr_ra OR_rr_aa OR_ra_aa
-manhattan Draws a manhattan plot for the p-values of the association test.
Can only be used if -association_test argument is given.
-qqplot Draws a qq-plot for the p-values of the association test.
Can only be used of -association_test argument is given
-get_info SNP Retrieves information about variant with snp_code SNP. Prints a json format output with all information obtained from VEP database of Ensembl.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
TESTING
PACtool was built around and tested using the following files:
gwas.controls.gen
gwas.cases.gen
which can be download from the following link:
https://s3.eu-central-1.amazonaws.com/pythonprojectgwas/gwas.tar.gz
Credits to the creation of this artificial dataset goes to the team running 'BIO-102 Introduction to Programming' course of MSc Bioinformatics, UoCrete, Heraklion.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CONTRIBUTORS
Authors of PACtool are:
Christina Chatzipantsiou (chatzipantsiou@gmail.com)
Panayiotis Linardos (mondestrasz@gmail.com)
Paschalis Natsidis (pnatsidis@hotmail.com)
For any bug report, contribution or general comment please contact any of the authors using the provided e-mail addresses.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
LICENSE
PACtool lies under MIT license described in this link:
https://opensource.org/license/MIT
PACtool is a program that performs GWAS analysis on control/case data of SNP variants and was implemented using Python 3.
PACTOOL Features:
- Allele frequency of each variant for all selected SNPs
- HWE statistic and respective p-value calculation for the unified controls and and cases dataset
- Linkage Disequilibrium evaluation between two selected SNPs (D' and r-squared are calculated)
- Association Test on the provided dataset, with optional generation of manhattan plot and qq-plot
- Information retrieval for selected SNPs, from Ensembl Variant Effect Predictor database.
PACtool Limitations:
- Current version of PACtool does not support analysis for variants located on different chromosomes.
Please make sure that all SNPs in your dataset are located on the same chromosome.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
INPUT FILES FORMAT
The input files must be provided in accordance to the HAPGEN2 program, which can be applied on various genomic datasets.
The input file consists of rows and columns, where each row represents a SNP, and the columns contain information about the following:
Column_1: snp_id
- e.g. snp_0
- a unique identifier for each SNP
Columns_2: rs_id
- e.g. rs6054257
- or alternatively the genomic coordinates, with a pefix indicating the chromosome e.g. 20-9150
Columns_3: snp_coordinates based on NCBI build 36
- e.g. 9150
Column_4: reference allele
- denoted with the nucleotide base
- e.g. C
Column_5: alternative allele
- denoted with the nucleotide base
- e.g. G
Column_6 to Last_Column:
- These columns contain the genotype for each sample.
Each sample genotype is represented by 3 columns consisted of '0' and '1' digits, with the positioning of '1' being the indicator of the genotype.
More specifically,
- 1 0 0 ---> ref-ref , homozygous for the reference allele
- 0 1 0 ---> ref-alt , heterozygous
- 0 0 1 ---> alt-alt , homozygous for the alternative allele
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
OUTPUT FILES FORMAT
The output files follow the same space separated file format as the input files described above.
The first column for each SNP in all output files is the snp_id, while the rest of the columns hold the values from the respective statistical tests and analyses.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
INSTALLATION-UTILIZATION
PACtool can be downloaded from the following PyPI page:
https://pypi.python.org/pypi/pactool
By unzipping/untaring the downloaded .tgz file, you can get necessary information on how to execute the program by typing (while being in pactool directory):
python3 pactool.py -h
You can execute the file pactool.py using python 3, and include all preferred arguments.
Three of the arguments are required and must be provided every time, or else you will receive an error message.
Please be sure to include the following arguments:
-controls_file Indicates the input file containing control samples
-cases_file Indicates the input file containing case samples
-output Specifies the prefix of each output file
The following optional arguments can also be included, alongside with a file with snp_codes (one in each line) to perform the corresponding action:
-keep_snps Keeps for analysis only the SNPs specified in the provided file.
-remove_snps Removes from following analysis the SNPs specified in the provided file.
The above actions can be applied to the samples as well (the lines in given file shall be e.g. control_5 or case_10):
-keep_samples Keeps for analysis only the control/case samples specified in the provided file.
-remove_samples Removes from further analysis the control/case samples specified in the provided file.
After performing the keep/remove actions, the following analysis options can be selected:
-allele_frequency Calculates the frequencies of the reference and alternative variants in control samples, case samples as well as their total frequencies.
Outputs the file 'output'.frequency with 7 columns: snp_code ref_freq_control alt_freq_control ref_freq_cases alt_freq_cases ref_freq_total alt_freq_total
-hwe, -HWE Calculates the Hardy-Weinberg Equilibrium and the corresponding p-value.
Outputs the file 'output'.hwe with 3 columns: snp_code hwe_statistic p-value
-ld SNP1 SNP2 Estimates if the two given SNPs are in Linkage Disequilibrium, by calculating the D' and r-squared statistics. SNP1 and SNP2 are required snp_codes.
Ouputs the file 'output'.ld with 4 colums: snp1_code snp2_code D' r-squared
-association_test Performs genotypic association test for each SNP and calculates the odds-ratios (r=reference, a=alternative, OR_control_case)
Outputs the file 'output'.association with 8 columns: snp_code locus ref alt p-value OR_rr_ra OR_rr_aa OR_ra_aa
-manhattan Draws a manhattan plot for the p-values of the association test.
Can only be used if -association_test argument is given.
-qqplot Draws a qq-plot for the p-values of the association test.
Can only be used of -association_test argument is given
-get_info SNP Retrieves information about variant with snp_code SNP. Prints a json format output with all information obtained from VEP database of Ensembl.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
TESTING
PACtool was built around and tested using the following files:
gwas.controls.gen
gwas.cases.gen
which can be download from the following link:
https://s3.eu-central-1.amazonaws.com/pythonprojectgwas/gwas.tar.gz
Credits to the creation of this artificial dataset goes to the team running 'BIO-102 Introduction to Programming' course of MSc Bioinformatics, UoCrete, Heraklion.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CONTRIBUTORS
Authors of PACtool are:
Christina Chatzipantsiou (chatzipantsiou@gmail.com)
Panayiotis Linardos (mondestrasz@gmail.com)
Paschalis Natsidis (pnatsidis@hotmail.com)
For any bug report, contribution or general comment please contact any of the authors using the provided e-mail addresses.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
LICENSE
PACtool lies under MIT license described in this link:
https://opensource.org/license/MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
PAC-tool-1.0.6a.tar.gz
(11.1 kB
view hashes)