package to run a genotype quality control pipeline

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

Genotype Quality Control Pipeline

This Python package is designed to execute a genotype quality control pipeline, encapsulating several years of research at CGE Tübingen.

Basic requirements

The quality control pipeline is built on PLINK as the main tool. The ideal_genom_qc serves as a wrapper for the various QC pipeline steps. To run the pipeline, PLINK1.9 and PLINK2 must be installed on the system.

The pipeline is designed to seamlessly run with minimal input and produce cleaned binary files as a result as well as several plots along the way. To accomplish this, the following folder structure is expected:

projectFolder
    |
    |---inputData
    |
    |---outputData
    |
    |---configFiles
    |
    |---dependables

The inputData folder should contain the binary files with the genotype data to be analyzed in PLINK format (.bed, .bim, .fam files).
The outputData folder will contain the resultant files of the quality control pipeline. Below, the pipeline output will be detailed.
The dependables folder is designed to contain complemenatry files for the quality control pipeline. This folder is optional.
The configFiles folder is essential for the correct functioning of the pipeline. It should contain three configuration files: parameters.JSON, paths.JSON and steps.JSON.

Configuration Files

These three files contain all the information necessary to run the pipeline.

Quality Control Pipeline Parameters

The parameters.JSON file contains values for PLINK commands that will be used in the pipeline as well as other parameters to tailor other steps. The parameters for the CLI (command line interface) must be provided in a .JSON file with the following structure:

{
    "sample_qc": {
        "rename_snp"   : true,
        "hh_to_missing": true,
        "use_kingship" : true,
        "ind_pair"     : [50, 5, 0.2],
        "mind"         : 0.2,
        "sex_check"    : [0.2, 0.8],
        "maf"          : 0.01,
        "het_deviation": 3,
        "kingship"     : 0.354,
        "ibd_threshold": 0.185
    },
    "ancestry_qc": {
        "ind_pair"     : [50, 5, 0.2],
        "pca"          : 10,
        "maf"          : 0.01,
        "ref_threshold": 4,
        "stu_threshold": 4,
        "reference_pop": "SAS",
        "num_pcs"      : 10,
    },
    "variant_qc": {
        "chr_y": 24,
        "miss_data_rate": 0.2,
        "diff_genotype_rate": 1e-5,
        "geno": 0.1,
        "maf": 5e-8,
        "hwe": 5e-8,
    },
    "umap_plot": {
        "umap_maf": 0.01,
        "umap_mind": 0.2,
        "umap_geno": 0.1,
        "umap_hwe": 5e-8,
        "umap_ind_pair": [50, 5, 0.2],
        "umap_pca": 10,
        "n_neighbors": [5, 10, 15],
        "metric": ["euclidean", "chebyshev"],
        "min_dist": [0.01, 0.1, 0.2],
        "random_state": 42,
        "case_control_marker": true,
        "color_hue_file": "path/to/color_hue_file.txt",
        "umap_kwargs": {}
    }
}

The values that come with each parameter are the default values used in our research group. If the user wishes to change at least one of them, please provide the full information in the configuration file.

Paths to Project Folders

The paths.JSON file contains the addresses to the project folder as well as the prefix of the input and output data. The file must contain the following fields:

{
    "input_directory"      : "<path to folder with project input data>",
    "input_prefix"         : "<prefix of the input data>",
    "output_directory"     : "<path to folder where the output data will go>",
    "output_prefix"        : "<prefix for the output data>",
    "high_ld_file"         : "<path to file with high LD regions>"
}

If the CLI is run locally the user should provide the full path to file and directories. If no high LD file is provided or if the path is wrong, the library will use the one it has by default.

Pipeline Steps

The steps.JSON file has the following structure:

{
    "ancestry": true,
    "sample"  : true,
    "variant" : true,
    "umap"    : true
}

With the above configuration, all three steps will run seamlessly, which is the recommended initial configuration. If you want to skip some steps, change the value to false. For example,

{
    "sample"   : false,
    "ancestry" : false,
    "variant"  : true,
    "umap"     : true
}

allows you to run only the variant QC and generate the UMAP plot(s). Note that an exception will be raised if the ancestry cehck step has not been run, as the necessary files for the variant step would not be available.

Dependable Files

This folder should contain additional files to run the quality control pipeline. For example, the user might use this directory to store the high LD regions files in case it wants to use a different one from the library's default. Moreover, if the user wants to explore the population structure with respect to some category, the corresponding file should be located in this folder.

dependables
    |
    |---high-LD_regions.txt
    |
    |---population_categories.txt

Regarding the population_structure.txt, we expect a file with three colums, the first two are the ones for the IID and FID from PLINK .fam file, and the third one with the category that wants to be explored.

The other external files needed to perform the QC pipeline are the reference genome files. The library has the facility of fetch and process the reference genome automatically.

Output Data

This folder has the following structure:

outputData
    |
    |---ancestry_results
    |
    |---umap_plots
    |
    |---sample_qc_results
    |
    |---variant_qc_results

Results of ancestry outliers analysis

This folder contains the results from the ancestry analysis. Once the process is finished the folder will contain three folders and several files (we intend to reduce the files at a leter step). The three folders are

fail_samples: it contains a .txt file with the samples that failed the ancestry check;
clean_files: it contains the cleaned files after the ancestry check in PLINK format;
ancestryQC_plots: it contains two plots showing the PCA decomposition of the study population against the reference panel. The files are those resulting from the several steps of the ancestry check.

Recall that the cleaned binary files will feed the next steps.

UMAP Plots

In this folder one can find the plot(s) generated after the UMAP dimensionality reduction, in order to explore the structure of the study population.

Results of Sample Quality Control

This folder contains the results from the Sample Quality Control. Once the process is done the folder will contain three folders and multiple files. The three folders are

fail_samples: it contains .txt files with the samples that failed the different stages of the sample QC;
clean_files: it contains the cleaned files after the sample quality control in PLINK format;
sampleQC_plots: it contains different plots that serve as a report of the different stages and might suggest a different selection of parameters. The files are those resulting from the several steps of the sample QC.

Recall that the cleaned binary files will feed the next steps.

Results of Variant Quality Control

This folder contains the results from the Variant Quality Control. Once the process is done the folder will contain three folders and several files. The three folders are

fail_samples: it contains .txt files with the samples that failed the different stages of the variant QC;
clean_files: it contains the cleaned files after the variant quality control in PLINK format;
variantQC_plots: it contains different plots that serve as a report of the different stages and might suggest a different selection of parameters. The files are those resulting from the steps of the varaint QC.

These cleaned binary files are ready for the next steps of the GWAS analysis.

Installation and usage

The library can be installed by cloning the GitHub repository:

git clone https://github.com/cge-tubingens/IDEAL-GENOM-QC.git

or directly from PyPI:

pip install ideal_genom_qc

It is important to remark that the version in PyPI is the stable one, while the one on GitHub is on development.

Setting up the environment

The virtual environment can be created using either Poetry or pip. Since this is a Poetry-based project, we recommend using Poetry. Once Poetry is installed on your system (refer to Poetry documentation for installation details), navigate to the cloned repository folder and run the following command:

poetry install

It is important to remark that currently the project has been updated to use Poetry 2.0.

Pipeline usage options

1. Inside a virtual environment

After running the poetry install activate the virtual environment with

poetry shell

Once the environment is active, you can execute the pipeline with the following command:

python3 ideal_genom_qc --path_params <path to parameters.JSON> 
                             --file_folders <path to paths.JSON> 
                             --steps <path to steps.JSON>
                             --recompute-merge true
                             --built 38

The first three parameters are the path to the three configuration files. The fourth is used to control the pipeline behavior.

2. Using `Poetry` directly

One of the benefits of using Poetry s that it eliminates the need to activate a virtual environment. Run the pipeline directly with:

poetry run python3 ideal_genom_qc --path_params <path to parameters.JSON> 
                             --file_folders <path to paths.JSON> 
                             --steps <path to steps.JSON>
                             --recompute-merge true
                             --built 38

3. Jupyter Notebooks

The package includes Jupyter notebooks located in the notebooks folder. Each notebook corresponds to a specific step of the pipeline. Simply provide the required parameters to execute the steps interactively.

Using the notebooks is a great way to gain a deeper understanding of how the pipeline operates.

4. Docker Container

A Dockerfile is provided to build a container for the pipeline. Since the container interacts with physical files, it is recommended to use the following command:

docker run -v <path to project folder>:/data <docker_image_name>:<tag> --path_params <relative path to parameters.JSON> --file_folders <relative path to paths.JSON> --steps <relative path to steps.JSON> ---recompute-merge true --built 38

It is important to remark that the path to the files in paths.JSON must be relative to their location inside data folder in the Docker container.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.2.0

Jun 2, 2025

0.1.2

Apr 22, 2025

0.1.0

Apr 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ideal_genom_qc-0.2.0.tar.gz (56.4 kB view details)

Uploaded Jun 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ideal_genom_qc-0.2.0-py3-none-any.whl (59.3 kB view details)

Uploaded Jun 2, 2025 Python 3

File details

Details for the file ideal_genom_qc-0.2.0.tar.gz.

File metadata

Download URL: ideal_genom_qc-0.2.0.tar.gz
Upload date: Jun 2, 2025
Size: 56.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.12.3 Linux/6.11.0-26-generic

File hashes

Hashes for ideal_genom_qc-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`9d67a180a14cb30639a31fbc8c8d081e38ff40ceece1026d472c10920664fa92`
MD5	`565b4460d8f38a4a53f682dbd6e24435`
BLAKE2b-256	`99cc6cabc444c90a7b56339a47be183d9c9668022fd1e752919d0fce8a1a5bae`

See more details on using hashes here.

File details

Details for the file ideal_genom_qc-0.2.0-py3-none-any.whl.

File metadata

Download URL: ideal_genom_qc-0.2.0-py3-none-any.whl
Upload date: Jun 2, 2025
Size: 59.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.12.3 Linux/6.11.0-26-generic

File hashes

Hashes for ideal_genom_qc-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d0ea12390e111c5ecc905496ac780a7e6abe6fb579cd76c2833a28433bf16e3`
MD5	`a1e9a96b44c0f72c37c82c46f018b882`
BLAKE2b-256	`cf9111b31f4f006d83c412ddd79881539ea630482ae2b45ddb6282ad13de786d`

See more details on using hashes here.

ideal-genom-qc 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Genotype Quality Control Pipeline

Basic requirements

Configuration Files

Quality Control Pipeline Parameters

Paths to Project Folders

Pipeline Steps

Dependable Files

Output Data

Results of ancestry outliers analysis

UMAP Plots

Results of Sample Quality Control

Results of Variant Quality Control

Installation and usage

Setting up the environment

Pipeline usage options

1. Inside a virtual environment

2. Using Poetry directly

3. Jupyter Notebooks

4. Docker Container

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

2. Using `Poetry` directly