Skip to main content

Gene Annotation Across Diverse Fungal Species Using Deep Learning

Project description

geneML

geneML is a deep learning–based tool for fungal gene prediction.

Installation

The only requirement is python v3.9 or higher.

Using virtualenv

Start with a fresh python virtual environment:

python -m venv geneml
. geneml/bin/activate
# Now install the latest release from PyPI:
pip install geneml

Using conda

Or use a conda environment:

conda create -n geneml -c conda-forge python=3.13 pip
conda activate geneml
# Now install the latest release from PyPI:
pip install geneml

Directly from the repo

Or install directly from this repo (to get access to the latest changes):

git clone https://github.com/hexagonbio/geneML.git
pip install geneML

Frequent options

Basic command:

geneml genome.fasta

To enable verbose mode:

geneml genome.fasta -v

To change the output path:

geneml genome.fasta -o genome_output.gff3

To run only on selected contigs:

geneml genome.fasta --contigs-filter NC_092406.1,NC_092407.1

To write nucleotide and protein sequences of the predicted genes (one sequence per transcript):

geneml genome.fasta -g genes.fna
geneml genome.fasta -p proteins.faa

By default, geneML outputs multiple transcripts per locus (if there are multiple high scoring options).
You can change the maximum number of transcripts produced, for example forcing to output only the best transcript:

geneml genome.fasta --max-transcripts 1

With enough input data, GeneML dynamically determines the minimum score threshold for reporting genes and transcripts.
You can override this threshold manually, for example:

geneml genome.fasta --min-gene-score 0.5

Output

geneML writes gene annotations in GFF3 format.

Fields

For each predicted gene, transcript, exon and CDS feature, the GFF3 includes:

contig_name  source  feature_type  start  end  feature_score  strand  phase  identifiers

Note: As geneML does not include untranslated regions in its predictions, CDS features are identical to exon features (except for the added phase attribute).
For more information on the GFF3 format, see: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Identifiers

Each feature has a unique ID and, for child features, a Parent attribute specifying the parent ID.

Feature Example ID
Gene GML000001
Transcript GML000001_mRNA1
Exon GML000001_mRNA1_exon1
CDS GML000001_mRNA1_CDS1

Scores

The feature score ranges between 0 and 1 and is a measure of how well the prediction aligns with the raw probabilities outputted by the geneML CNN.
A higher score indicates a higher prediction confidence. The directive ##geneml-mean-gene-score (found at the top of the GFF file) stores the average score of predicted genes.

Full Usage

geneml --help
usage: geneml [-h] [--version] [-o OUTPUT] [-g GENES] [-p PROTEINS] [--gene-id-prefix GENE_ID_PREFIX] [-m MODEL] [-cl CONTEXT_LENGTH] [-c CORES] [-v] [-d]
              [--cpu-only] [--strand {forward,reverse,both}] [--contigs-filter CONTIGS_FILTER] [--write-raw-scores]
              [--max-transcripts MAX_TRANSCRIPTS] [--allow-opposite-strand-overlaps {true,false}] [--min-gene-score MIN_GENE_SCORE]
              [--min-exon-size MIN_EXON_SIZE] [--max-exon-size MAX_EXON_SIZE] [--min-intron-size MIN_INTRON_SIZE]
              [--max-intron-size MAX_INTRON_SIZE] [--cds-start-min-score CDS_START_MIN_SCORE]
              [--cds-end-min-score CDS_END_MIN_SCORE] [--exon-start-min-score EXON_START_MIN_SCORE]
              [--exon-end-min-score EXON_END_MIN_SCORE] [--gene-candidates GENE_CANDIDATES]
              sequence

geneML 1.0.0

positional arguments:
  sequence              Sequence file in FASTA/GenBank/EMBL format.

options:
  -h, --help            Show this help message and exit.
  --version             Show version number and exit.
  -o OUTPUT, --output OUTPUT
                        Gene annotations output path (default: based on input filename).
  -g GENES, --genes GENES
                        Gene sequences output path (default: None).
  -p PROTEINS, --proteins PROTEINS
                        Protein sequences output path (default: None).
  --gene-id-prefix GENE_ID_PREFIX
                        Prefix for gene IDs in output (default: None).
  -m MODEL, --model MODEL
                        Path to model file (default: models/geneML_default.keras).
  -cl CONTEXT_LENGTH, --context-length CONTEXT_LENGTH
                        Context length of the model.
  -c CORES, --cores CORES
                        Number of cores to use for processing (default: all available).

advanced options:
  -v, --verbose         Enable verbose mode.
  -d, --debug           Enable debug mode.
  --cpu-only            Use CPU only for inference, disable GPU usage.
  --strand {forward,reverse,both}
                        On which strand to predict genes (default: both).
  --contigs-filter CONTIGS_FILTER
                        Run only on selected contigs (comma separated string).
  --write-raw-scores    Instead of running gene calling, output the raw model scores as a .seg file.
  --max-transcripts MAX_TRANSCRIPTS
                        Maximum number of transcripts per gene (default: 5).
  --allow-opposite-strand-overlaps {true,false}
                        Predict overlapping genes on opposite strands (default: true).
  --min-gene-score MIN_GENE_SCORE
                        Minimum gene score for gene reporting. Can be a float value or 'dynamic' (default: dynamic). Dynamic mode
                        requires >=100,000 bp total input.
  --min-exon-size MIN_EXON_SIZE
                        Minimum exon size (default: 1).
  --max-exon-size MAX_EXON_SIZE
                        Maximum exon size (default: 30000).
  --min-intron-size MIN_INTRON_SIZE
                        Minimum intron size (default: 10).
  --max-intron-size MAX_INTRON_SIZE
                        Maximum intron size (default: 400).
  --cds-start-min-score CDS_START_MIN_SCORE
                        Minimum model score for considering a CDS start (default: 0.01).
  --cds-end-min-score CDS_END_MIN_SCORE
                        Minimum model score for considering a CDS end (default: 0.01).
  --exon-start-min-score EXON_START_MIN_SCORE
                        Minimum model score for considering an exon start (default: 0.01).
  --exon-end-min-score EXON_END_MIN_SCORE
                        Minimum model score for considering an exon end (default: 0.01).
  --gene-candidates GENE_CANDIDATES
                        Maximum number of gene candidates to consider per locus (default: 5000).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geneml-1.0.0.tar.gz (2.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geneml-1.0.0-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file geneml-1.0.0.tar.gz.

File metadata

  • Download URL: geneml-1.0.0.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geneml-1.0.0.tar.gz
Algorithm Hash digest
SHA256 53a17eddddb65e53d0130a27d1ee05173b95bb299a9ed15c6b509ca9680080f3
MD5 1c1f79596976fb001453233b0b39202a
BLAKE2b-256 e37a333ea72e3f1b99dd79c5476b2cf98b8dab8709332ca8ce9d1ae8dcf98f72

See more details on using hashes here.

Provenance

The following attestation bundles were made for geneml-1.0.0.tar.gz:

Publisher: ci.yml on hexagonbio/geneML

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file geneml-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: geneml-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geneml-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 361d6482419cfb3870658cee7a1cbc31ab7b9f9d4e13c56f381837eabe54bf29
MD5 ca2dfbe128a1d05e09fa8ee4ee1915a3
BLAKE2b-256 778b164c57701e0ac8fe6e1c2ff828faaacd408f4dc58a60dcb9007897337bb3

See more details on using hashes here.

Provenance

The following attestation bundles were made for geneml-1.0.0-py3-none-any.whl:

Publisher: ci.yml on hexagonbio/geneML

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page