Skip to main content

PhenoGO: A tool to build WEKA-ready ARFF files from model organism phenotype and Gene Ontology (GO) annotations.

Project description

DOI

PhenGO

Overview

This project provides a unified Python-based tool to generate ready-to-use WEKA ARFF formatted files, specifically designed for machine learning applications involving gene essentiality prediction. The tool integrates phenotype data and Gene Ontology (GO) annotations for genes from selected model organisms, streamlining the data preparation process.

Purpose

The main goal of this project is to simplify and standardise the creation of ARFF files that combine phenotype information with GO-mapped gene data. This enables researchers to efficiently apply machine learning techniques (using WEKA or similar platforms) to analyse gene essentiality and related biological questions across various model organisms.

Features

  • Unified Workflow: Handles data collection, integration, and formatting in a single pipeline.
  • Model Organism Support: Designed for commonly studied organisms (e.g., Saccharomyces cerevisiae, Mus musculus).
  • GO Annotation Integration: Maps genes to their respective GO terms for comprehensive feature representation and traces obo files to acquire parent terms.
  • Phenotype Data Inclusion: Incorporates phenotype labels for supervised learning tasks.
  • WEKA ARFF Output: Produces files in the ARFF format, ready for immediate use in WEKA.

Installation

To install the PhenGO package, you can use pip:

pip install phengo

Usage

PhenGO Package:

PhenGO Example:

PhenGO -species fly -phenotype_file data/fly/phenotype_data/2017/allele_phenotypic_data_fb_2017_05.tsv.gz -gene_association_file data/fly/gene_association/2017/gene_association_2017_05.fb.gz
-go_obo_file data/go/2017/go_2017-05-01.obo.gz -output_dir Documents/PhenGO/fly_2017

The output will be saved in the specified output directory, which will contain the ARFF file and other relevant data files.

Menu:

usage: PhenGO.py [-h] -species SPECIES -phenotype_file PHENOTYPE_FILE
                 -gene_association_file GENE_ASSOCIATION_FILE -go_obo_file
                 GO_OBO_FILE -output_dir OUTPUT_DIR [-filter_unused_gos]
                 [-filter_mixed_terms] [-gene_go_pheno]
                 [-fly_assignments FLY_ASSIGNMENTS]
                 [-driver_lines DRIVER_LINES] [-filt_with]
                 [-worm_phenotypes WORM_PHENOTYPES]
                 [-mouse_phenotypes MOUSE_PHENOTYPES] [-v]

PhenGO v0.1.1 - Convert phenotype and GO data to ARFF format

Required Options:
  -species SPECIES      Species tag (e.g., fly, yeast)
  -phenotype_file PHENOTYPE_FILE
                        Path to the phenotype data file (.gz)
  -gene_association_file GENE_ASSOCIATION_FILE
                        Path to the gene association file (.gz)
  -go_obo_file GO_OBO_FILE
                        Path to the go.obo file
  -output_dir OUTPUT_DIR
                        Output directory

Optional parameters:
  -filter_unused_gos    Filter out unused GO terms from the FUNC and ARFF
                        output (default: True)
  -filter_mixed_terms   Filter out genes which have both lethal and viable
                        phenotypes - Terms not specifically lethal/viable are
                        not counted in this (default: False)
  -gene_go_pheno        Output "Gene-GO-Phenotype" (Rbbp5 GO:0003674 0) file
                        for overrepresentation analysis with tools such as
                        FUNC (default: False)

Fly specific parameters:
  -fly_assignments FLY_ASSIGNMENTS
                        Provide TSV file of fly assignments (file confirming
                        genes are assignment to drosophila melanogaster
                        (default: "data/fly/FlyBase_Fields_2017.txt.gz")
  -driver_lines DRIVER_LINES
                        Provide TSV file of fly driver lines (file containing
                        the name of driver lines (RNAi) to ignore when present
                        with the "with" tag (default: "data/fly/FlyBase_Driver
                        Line_Fields_2025_08_05.txt.gz")
  -filt_with            Filter out phenotype with "with" tag (default: DO NOT
                        FILTER)

Worm specific parameters:
  -worm_phenotypes WORM_PHENOTYPES
                        Provide TSV file of worm phenotypes (default:
                        "data/worm/WS297_lethal_terms.tsv.gz")

Mouse specific parameters:
  -mouse_phenotypes MOUSE_PHENOTYPES
                        Provide TSV file of mouse phenotypes (default:
                        "data/mouse/mouse_lethal_terms.txt.gz")

Misc:
  -v, --version         show program's version number and exit

Compare-ARFF:

usage: compare_arff_genes.py [-h] -arff_a ARFF_A -arff_b ARFF_B -o OUTPUT

PhenoGO v0.1.1 - Compare-ARFF: Compare two ARFF files.

options:
  -h, --help      show this help message and exit
  -arff_a ARFF_A  Master ARFF file (reference)
  -arff_b ARFF_B  Comparison ARFF file
  -o OUTPUT       Output CSV file

Output: The output of the compare-arff function is a CSV file that summarizes the comparison between two ARFF files.

Gene,Label A,Label B,GO Terms Differ,Status
GeneA,lethal,,,"MISSING_IN_B"
GeneB,lethal,viable,,"LABEL_MISMATCH"
GeneC,viable,viable,GO:0008150;GO:0003674,"GO_TERM_MISMATCH"
GeneD,viable,viable,,"EXACT_MATCH"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phengo-0.1.1-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file phengo-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: phengo-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for phengo-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1a244b3c21fbd7b9b80b7e49fedd2047967468ec65dadf0632dbd52bb476787f
MD5 ba3d32ab53dc7e8800662b97d43b5245
BLAKE2b-256 47c32aea4a6ccb8d9c684e062fbb0c7aaddf169e346d92fc93d66101d89b158f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page