Skip to main content

Phenotypic Data Quality Control Toolkit for Genomic Data Infrastructure (GDI)

Project description

PhenoQC

PhenoQC is a lightweight, efficient, and user-friendly toolkit designed to perform comprehensive quality control (QC) on phenotypic datasets within the Genomic Data Infrastructure (GDI) framework. It ensures that phenotypic data adheres to standardized formats, maintains consistency, and is harmonized with recognized ontologies, thereby facilitating seamless integration with genomic data for advanced research.

Features

  • Comprehensive Data Validation: Checks format compliance, schema adherence, and data consistency.
  • Ontology Mapping: Maps phenotypic terms to multiple standardized ontologies (HPO, DO, MPO) with synonym resolution and custom mapping support.
  • Missing Data Handling: Detects and imputes missing data using simple strategies or flags for manual review.
  • Batch Processing: Supports processing multiple files simultaneously with parallel execution.
  • User-Friendly Interfaces: CLI for power users and an optional Streamlit-based GUI for interactive use.
  • Reporting and Visualization: Generates detailed QC reports and visual summaries of data quality metrics.
  • Extensibility: Modular design allows for easy addition of new validation rules or mapping functionalities.

Installation

Ensure you have Python 3.6 or higher installed.

pip install phenoqc

Alternatively, clone the repository and install manually:

git clone https://github.com/jorgeMFS/PhenoQC.git
cd PhenoQC
pip install -e .

Usage

Command-Line Interface (CLI)

Process a single file:

phenoqc --input examples/samples/sample_data.json \
--output ./reports/ \
--schema examples/schemas/pheno_schema.json \
--config config.yaml \
--custom_mappings examples/mapping/custom_mappings.json \
--impute mice \
--unique_identifiers SampleID \
--ontologies HPO DO MPO

Batch process multiple files:

phenoqc --input examples/samples/sample_data.csv examples/samples/sample_data.json examples/samples/sample_data.tsv \
--output ./reports/ \
--schema examples/schemas/pheno_schema.json \
--config config.yaml \
--custom_mappings examples/mapping/custom_mappings.json \
--impute none \
--unique_identifiers SampleID \
--ontologies HPO DO MPO

Parameters:

  • --input: One or more input data files or directories (supported formats: csv, tsv, json).
  • --output: Directory to save reports and processed data. Defaults to ./reports/.
  • --schema: Path to the JSON schema file for data validation.
  • --config: Path to the configuration YAML file (config.yaml) defining ontology mappings. Defaults to config.yaml.
  • --custom_mappings: (Optional) Path to a custom mapping JSON file for ontology term resolutions.
  • --impute: Strategy for imputing missing data. Choices:
    • mean: Impute missing numeric data with the column mean.
    • median: Impute missing numeric data with the column median.
    • mode: Impute missing categorical data with the column mode.
    • knn: Impute missing numeric data using k-Nearest Neighbors.
    • mice: Impute missing numeric data using Multiple Imputation by Chained Equations.
    • svd: Impute missing numeric data using Iterative Singular Value Decomposition.
    • none: Do not perform imputation; simply flag missing data.
  • --unique_identifiers: List of column names that uniquely identify a record (e.g., SampleID).
  • --ontologies: (Optional) List of ontologies to map to (e.g., HPO DO MPO).
  • --recursive: (Optional) Enable recursive directory scanning when input paths include directories.

Graphical User Interface (GUI)

Launch the GUI using Streamlit:

streamlit run src/gui.py

Note: Ensure you have the GUI dependencies installed.

Steps:

  1. Configuration:

    • Upload JSON Schema: Upload your JSON schema file for data validation.
    • Upload Configuration (config.yaml): Upload the configuration file that defines the ontologies and their respective JSON files.
    • Upload Custom Mapping (Optional): (Optional) Upload a JSON file containing custom term mappings.
    • Select Imputation Strategy: Choose between 'mean' or 'median' for imputing missing data.
  2. Data Ingestion:

    • Select Data Source: Choose between uploading individual phenotype data files or uploading a ZIP archive containing multiple files.
    • Upload Files or ZIP: Depending on the selected option, upload the necessary files.
    • Enable Recursive Directory Scanning: (Optional) Enable if you want the tool to scan directories recursively within the uploaded ZIP archive.
  3. Unique Identifiers & Ontologies:

    • Specify Unique Identifier Columns: Enter column names that uniquely identify each record, separated by commas (e.g., SampleID,PatientID).
    • Specify Ontologies to Map: Enter ontology IDs separated by spaces (e.g., HPO DO MPO). Leave blank to use the default ontology specified in config.yaml.
  4. Run Quality Control:

    • Click the "Run Quality Control" button to start processing.
    • View processing results and download generated reports.

Configuration

PhenoQC uses a YAML configuration file (config.yaml) to specify ontology mappings and other settings. Ensure this file is properly set up in your project directory.

Example config.yaml:

ontologies:
HPO:
name: Human Phenotype Ontology
file: ontologies/HPO.json
DO:
name: Disease Ontology
file: ontologies/DO.json
MPO:
name: Mammalian Phenotype Ontology
file: ontologies/MPO.json
default_ontology: HPO

Ensure that the ontology JSON files (HPO.json, DO.json, MPO.json) are correctly placed in the ontologies/ directory and properly formatted.

Documentation

Comprehensive documentation is available on the GitHub Wiki.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phenoqc-0.1.0.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

PhenoQC-0.1.0-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file phenoqc-0.1.0.tar.gz.

File metadata

  • Download URL: phenoqc-0.1.0.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for phenoqc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9f4fdbe2dc10d34d7d523b5d55b092665f6db65d8050c1c645a5bac15cee3dd5
MD5 d4e87e918809f1f912da6b2021969432
BLAKE2b-256 660e503ffdc45687031569fb71f0410839b033d60255492d80915c89d132ee23

See more details on using hashes here.

File details

Details for the file PhenoQC-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: PhenoQC-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for PhenoQC-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 320d3b7c5ebb25cc38aef82cfcc75e45a07116270cb5e3c07b18117fa24618c2
MD5 79830a9c1a2c5a7a460497537d682511
BLAKE2b-256 df79ae0c461f77913fe7aeda435b9799b94ca6580a22d135e4ce99de6febaaca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page