Skip to main content

Gene Annotation Across Diverse Fungal Species Using Deep Learning

Project description

geneML

geneML is a deep learning–based tool for fungal gene prediction.

Installation

The only requirement is python v3.9 or higher.

Using virtualenv

Start with a fresh python virtual environment:

python -m venv geneml
. geneml/bin/activate
# Now install the latest release from PyPI:
pip install geneml

Using conda

Or use a conda environment:

conda create -n geneml -c conda-forge python=3.13 pip
conda activate geneml
# Now install the latest release from PyPI:
pip install geneml

Directly from the repo

Or install directly from this repo (to get access to the latest changes):

git clone https://github.com/hexagonbio/geneML.git
pip install geneML

Frequent options

Basic command:

geneml genome.fasta

To enable verbose mode:

geneml genome.fasta -v

To change the output path:

geneml genome.fasta -o genome_output.gff3

To run only on selected contigs:

geneml genome.fasta --contigs-filter NC_092406.1,NC_092407.1

To write nucleotide and protein sequences of the predicted genes (one sequence per transcript):

geneml genome.fasta -g genes.fna
geneml genome.fasta -p proteins.faa

By default, geneML outputs multiple transcripts per locus (if there are multiple high scoring options).
You can change the maximum number of transcripts produced, for example forcing to output only the best transcript:

geneml genome.fasta --max-transcripts 1

With enough input data, GeneML dynamically determines the minimum score threshold for reporting genes and transcripts.
You can override this threshold manually, for example:

geneml genome.fasta --min-gene-score 0.5

Output

geneML writes gene annotations in GFF3 format.

Fields

For each predicted gene, transcript, exon and CDS feature, the GFF3 includes:

contig_name  source  feature_type  start  end  feature_score  strand  phase  identifiers

Note: As geneML does not include untranslated regions in its predictions, CDS features are identical to exon features (except for the added phase attribute).
For more information on the GFF3 format, see: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Identifiers

Each feature has a unique ID and, for child features, a Parent attribute specifying the parent ID.

Feature Example ID
Gene GML000001
Transcript GML000001_mRNA1
Exon GML000001_mRNA1_exon1
CDS GML000001_mRNA1_CDS1

Scores

The feature score ranges between 0 and 1 and is a measure of how well the prediction aligns with the raw probabilities outputted by the geneML CNN.
A higher score indicates a higher prediction confidence. The directive ##geneml-mean-gene-score (found at the top of the GFF file) stores the average score of predicted genes.

Full Usage

geneml --help
usage: geneml [-h] [--version] [-o OUTPUT] [-g GENES] [-p PROTEINS] [--gene-id-prefix GENE_ID_PREFIX] [-m MODEL] [-cl CONTEXT_LENGTH] [-c CORES] [-v] [-d]
              [--cpu-only] [--strand {forward,reverse,both}] [--contigs-filter CONTIGS_FILTER] [--write-raw-scores]
              [--max-transcripts MAX_TRANSCRIPTS] [--allow-opposite-strand-overlaps {true,false}] [--min-gene-score MIN_GENE_SCORE]
              [--min-exon-size MIN_EXON_SIZE] [--max-exon-size MAX_EXON_SIZE] [--min-intron-size MIN_INTRON_SIZE]
              [--max-intron-size MAX_INTRON_SIZE] [--cds-start-min-score CDS_START_MIN_SCORE]
              [--cds-end-min-score CDS_END_MIN_SCORE] [--exon-start-min-score EXON_START_MIN_SCORE]
              [--exon-end-min-score EXON_END_MIN_SCORE] [--gene-candidates GENE_CANDIDATES]
              sequence

geneML 1.0.0

positional arguments:
  sequence              Sequence file in FASTA/GenBank/EMBL format.

options:
  -h, --help            Show this help message and exit.
  --version             Show version number and exit.
  -o OUTPUT, --output OUTPUT
                        Gene annotations output path (default: based on input filename).
  -g GENES, --genes GENES
                        Gene sequences output path (default: None).
  -p PROTEINS, --proteins PROTEINS
                        Protein sequences output path (default: None).
  --gene-id-prefix GENE_ID_PREFIX
                        Prefix for gene IDs in output (default: None).
  -m MODEL, --model MODEL
                        Path to model file (default: models/geneML_default.keras).
  -cl CONTEXT_LENGTH, --context-length CONTEXT_LENGTH
                        Context length of the model.
  -c CORES, --cores CORES
                        Number of cores to use for processing (default: all available).

advanced options:
  -v, --verbose         Enable verbose mode.
  -d, --debug           Enable debug mode.
  --cpu-only            Use CPU only for inference, disable GPU usage.
  --strand {forward,reverse,both}
                        On which strand to predict genes (default: both).
  --contigs-filter CONTIGS_FILTER
                        Run only on selected contigs (comma separated string).
  --write-raw-scores    Instead of running gene calling, output the raw model scores as a .seg file.
  --max-transcripts MAX_TRANSCRIPTS
                        Maximum number of transcripts per gene (default: 5).
  --allow-opposite-strand-overlaps {true,false}
                        Predict overlapping genes on opposite strands (default: true).
  --min-gene-score MIN_GENE_SCORE
                        Minimum gene score for gene reporting. Can be a float value or 'dynamic' (default: dynamic). Dynamic mode
                        requires >=100,000 bp total input.
  --min-exon-size MIN_EXON_SIZE
                        Minimum exon size (default: 1).
  --max-exon-size MAX_EXON_SIZE
                        Maximum exon size (default: 30000).
  --min-intron-size MIN_INTRON_SIZE
                        Minimum intron size (default: 10).
  --max-intron-size MAX_INTRON_SIZE
                        Maximum intron size (default: 400).
  --cds-start-min-score CDS_START_MIN_SCORE
                        Minimum model score for considering a CDS start (default: 0.01).
  --cds-end-min-score CDS_END_MIN_SCORE
                        Minimum model score for considering a CDS end (default: 0.01).
  --exon-start-min-score EXON_START_MIN_SCORE
                        Minimum model score for considering an exon start (default: 0.01).
  --exon-end-min-score EXON_END_MIN_SCORE
                        Minimum model score for considering an exon end (default: 0.01).
  --gene-candidates GENE_CANDIDATES
                        Maximum number of gene candidates to consider per locus (default: 5000).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geneml-1.1.0.tar.gz (2.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geneml-1.1.0-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file geneml-1.1.0.tar.gz.

File metadata

  • Download URL: geneml-1.1.0.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geneml-1.1.0.tar.gz
Algorithm Hash digest
SHA256 57e6e24ae9af3ea4ec8c50697c000266fb88f3a9779d40c97b6939936e213841
MD5 2ba72766fa4b75b635d6e259e1fb4572
BLAKE2b-256 83b31526619d92c9c3eb8ad9c011d16b55992d2a5ce7f02b64468d54e9424b0a

See more details on using hashes here.

Provenance

The following attestation bundles were made for geneml-1.1.0.tar.gz:

Publisher: ci.yml on hexagonbio/geneML

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file geneml-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: geneml-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geneml-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2c82070d7f449455d6055ec45eec186a4f7215f5ee2c32ff65fbe75b6bacb40
MD5 96aaa1cfe4dc694ec9d7f71592ca39d5
BLAKE2b-256 12f146bed9c968fb84f45e897ef16c387f0d455ab34506c28c00d0d2ff92d0d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for geneml-1.1.0-py3-none-any.whl:

Publisher: ci.yml on hexagonbio/geneML

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page