package to run a genotype quality control pipeline
Project description
Genotype Quality Control Pipeline
This is a Python package designed to perform a genotype quality control pipeline. It encapsulates several years of research at CGE Tübingen.
Basic requirements
The quality control pipeline is build on PLINK 1.9
as main tool. The cge-comrare-pipeline
works as a wrapper for the different pipeline steps. Then, to run the pipeline, PLINK 1.9
must be installed in the system.
The pipeline is designed to run seamlessly with a "minimum" input and get cleaned binary files as result. In order to accomplish this, it is expected the following folder structure:
projectFolder
|
|---inputData
|
|---outputData
|
|---dependables
|
|---configFiles
-
The folder
inputData
should contain the binary files with the genotype data to analyze inPLINK
format (.bed
,.bim
,.fam
files). -
The folder
outputData
will contain the resultant files of the quality control pipeline. Bellow it will be treated in detail the pipeline output. -
The folder
dependables
is designed to contain necessary files for the pipeline. -
The folder
configFiles
is essential for the pipeline correct functioning. It should contain two configuration files:parameters.JSON
,paths.JSON
andsteps.JSON
.
Configuration files
These two files contain all the information necessary to run the pipeline.
Quality control pipeline parameters
The file parameters.JSON
contains values for PLINK
commands that will be used in the pipeline. If this file is not provided, the default values of the pipeline will be taken into account. These are
{
"maf" : 0.05,
"geno": 0.1,
"mind": 0.1,
"hwe" : 0.00000005,
"sex_check": [0.2, 0.8],
"indep-pairwise": [50, 5, 0.2],
"chr": 24,
"outlier_threshold": 6,
"pca": 10
}
If one wants to change at least one of the default values, please provide the full information in the configuration file. In the repository can be found the .JSON
file corresponding to the cge-comrare-pipeline
default parameters.
Paths to project folders
The file paths.JSON
contain the addresses to the project folder as well as the prefix of the input and output data. The file must contain the following fields:
{
"input_directory" : "<path to folder with project input data>",
"input_prefix" : "<prefix of the input data>",
"output_directory" : "<path to folder where the output data will go>",
"output_prefix" : "<prefix for the output data>",
"dependables_directory": "<path to folder with dependables files>"
}
Pipeline steps
The file steps.JSON
has the following structure:
{
"pca" : true,
"sample" : true,
"variant": true
}
With the above configuration all three steps will run seamlessly, which is the recommended initial configuration. If some step want to be skipped the value should be change to false
. For example,
{
"pca" : false,
"sample" : true,
"variant": true
}
allows to run only the sample and variant quality control. Notice that the an exception will be raised if the PCA steps has not be run, because the necessary files to run the sample steps would no be available.
Dependable files
In this folder should be allocated additional files to run the quality control pipeline. The structure inside the directory should be as follows:
dependables
|
|---all_phase3.bed
|
|---all_phase3.bim
|
|---all_phase3.fam
|
|---all_phase3.psam
|
|---high-LD-regions.txt
Notice that the files all_phase3.bed
, all_phase3.bim
, all_phase3.fam
and all_phase3.psam
correspond to the 1000 Genomes phase 3. In addition, the file high-LD-regions.txt
corresponds to the built 38, in order to be consistent with 1000 Genomes phase 3 built.
Output data
Usage
The pipeline is easy to use. Once installed in the system or in a virtual enviroment one needs to run the following command:
python3 cge_comrare_pipeline --path_params <path to parameters.JSON>
--file_folders <path to paths.JSON>
--steps <path to steps.JSON>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cge_comrare_pipeline-0.1.0.tar.gz
.
File metadata
- Download URL: cge_comrare_pipeline-0.1.0.tar.gz
- Upload date:
- Size: 17.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.12.2 Linux/6.8.4-200.fc39.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6fcb9955379647a5f6bd169381576e9c256e792ceff23f5374b7cb96c83a65a2 |
|
MD5 | 5ac641acce9218f3b0ab6c1893bc3e46 |
|
BLAKE2b-256 | 64c10479eb0247320180abc1fd6bd0e647d90ce2750682c7dab6d03f28c82561 |
File details
Details for the file cge_comrare_pipeline-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: cge_comrare_pipeline-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.12.2 Linux/6.8.4-200.fc39.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64e305fd599ef4d8e3fef5c9f6f5c7fd0321ec8d004fa8498b232366638d3d2a |
|
MD5 | aa48d6263c2e62c5d3460d442cf29150 |
|
BLAKE2b-256 | 54fce0da9ae6567fe3b0811f9fa57ceeba76d92be95573fca4c883e60f2e0fca |