PhenoGO: A tool to build WEKA-ready ARFF files from model organism phenotype and Gene Ontology (GO) annotations.
Project description
PhenGO
Overview
This project provides a unified Python-based tool to generate ready-to-use WEKA ARFF formatted files, specifically designed for machine learning applications involving gene essentiality prediction. The tool integrates phenotype data and Gene Ontology (GO) annotations for genes from selected model organisms, streamlining the data preparation process.
Purpose
The main goal of this project is to simplify and standardise the creation of ARFF files that combine phenotype information with GO-mapped gene data. This enables researchers to efficiently apply machine learning techniques (using WEKA or similar platforms) to analyse gene essentiality and related biological questions across various model organisms.
Features
- Unified Workflow: Handles data collection, integration, and formatting in a single pipeline.
- Model Organism Support: Designed for commonly studied organisms (e.g., Saccharomyces cerevisiae, Mus musculus).
- GO Annotation Integration: Maps genes to their respective GO terms for comprehensive feature representation and traces obo files to acquire parent terms.
- Phenotype Data Inclusion: Incorporates phenotype labels for supervised learning tasks.
- WEKA ARFF Output: Produces files in the ARFF format, ready for immediate use in WEKA.
Installation
To install the PhenGO package, you can use pip:
pip install phengo
Usage
PhenGO Package:
PhenGO Example:
PhenGO -species fly -phenotype_file data/fly/phenotype_data/2017/allele_phenotypic_data_fb_2017_05.tsv.gz -gene_association_file data/fly/gene_association/2017/gene_association_2017_05.fb.gz
-go_obo_file data/go/2017/go_2017-05-01.obo.gz -output_dir Documents/PhenGO/fly_2017
The output will be saved in the specified output directory, which will contain the ARFF file and other relevant data files.
Menu:
usage: PhenGO.py [-h] -species SPECIES -phenotype_file PHENOTYPE_FILE
-gene_association_file GENE_ASSOCIATION_FILE -go_obo_file
GO_OBO_FILE -output_dir OUTPUT_DIR [-filter_unused_gos]
[-filter_mixed_terms] [-gene_go_pheno]
[-fly_assignments FLY_ASSIGNMENTS]
[-driver_lines DRIVER_LINES] [-filt_with]
[-worm_phenotypes WORM_PHENOTYPES]
[-mouse_phenotypes MOUSE_PHENOTYPES] [-v]
PhenGO v0.1.1 - Convert phenotype and GO data to ARFF format
Required Options:
-species SPECIES Species tag (e.g., fly, yeast)
-phenotype_file PHENOTYPE_FILE
Path to the phenotype data file (.gz)
-gene_association_file GENE_ASSOCIATION_FILE
Path to the gene association file (.gz)
-go_obo_file GO_OBO_FILE
Path to the go.obo file
-output_dir OUTPUT_DIR
Output directory
Optional parameters:
-filter_unused_gos Filter out unused GO terms from the FUNC and ARFF
output (default: True)
-filter_mixed_terms Filter out genes which have both lethal and viable
phenotypes - Terms not specifically lethal/viable are
not counted in this (default: False)
-gene_go_pheno Output "Gene-GO-Phenotype" (Rbbp5 GO:0003674 0) file
for overrepresentation analysis with tools such as
FUNC (default: False)
Fly specific parameters:
-fly_assignments FLY_ASSIGNMENTS
Provide TSV file of fly assignments (file confirming
genes are assignment to drosophila melanogaster
(default: "data/fly/FlyBase_Fields_2017.txt.gz")
-driver_lines DRIVER_LINES
Provide TSV file of fly driver lines (file containing
the name of driver lines (RNAi) to ignore when present
with the "with" tag (default: "data/fly/FlyBase_Driver
Line_Fields_2025_08_05.txt.gz")
-filt_with Filter out phenotype with "with" tag (default: DO NOT
FILTER)
Worm specific parameters:
-worm_phenotypes WORM_PHENOTYPES
Provide TSV file of worm phenotypes (default:
"data/worm/WS297_lethal_terms.tsv.gz")
Mouse specific parameters:
-mouse_phenotypes MOUSE_PHENOTYPES
Provide TSV file of mouse phenotypes (default:
"data/mouse/mouse_lethal_terms.txt.gz")
Misc:
-v, --version show program's version number and exit
Compare-ARFF:
usage: compare_arff_genes.py [-h] -arff_a ARFF_A -arff_b ARFF_B -o OUTPUT
PhenoGO v0.1.1 - Compare-ARFF: Compare two ARFF files.
options:
-h, --help show this help message and exit
-arff_a ARFF_A Master ARFF file (reference)
-arff_b ARFF_B Comparison ARFF file
-o OUTPUT Output CSV file
Output:
The output of the compare-arff function is a CSV file that summarizes the comparison between two ARFF files.
Gene,Label A,Label B,GO Terms Differ,Status
GeneA,lethal,,,"MISSING_IN_B"
GeneB,lethal,viable,,"LABEL_MISMATCH"
GeneC,viable,viable,GO:0008150;GO:0003674,"GO_TERM_MISMATCH"
GeneD,viable,viable,,"EXACT_MATCH"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phengo-0.1.1-py3-none-any.whl.
File metadata
- Download URL: phengo-0.1.1-py3-none-any.whl
- Upload date:
- Size: 28.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a244b3c21fbd7b9b80b7e49fedd2047967468ec65dadf0632dbd52bb476787f
|
|
| MD5 |
ba3d32ab53dc7e8800662b97d43b5245
|
|
| BLAKE2b-256 |
47c32aea4a6ccb8d9c684e062fbb0c7aaddf169e346d92fc93d66101d89b158f
|