Skip to main content

Evolutionary Genome Topology — analysis toolkit for chromosome evolution across metazoan genomes using reciprocal-best-hits data.

Project description

egt — Evolutionary Genome Topology

egt is a Python / Snakemake analysis toolkit for characterizing chromosome evolution across metazoan genomes. It builds on reciprocal-best-hits data from odp and provides tools for:

  • ALG (ancestral linkage group) fusion, dispersal, and rate analyses
  • PhyloTreeUMAP: manifold projection of per-species ALG state (MGT, MLT, and one-dot-one-genome variants)
  • perspective-chromosome reconstruction with Monte Carlo support
  • branch-wise rate analyses against a calibrated tree
  • Fourier-period analysis of rate time series
  • phylogenetic subsampling, tree prep, taxonomy utilities

Getting Started

git clone https://github.com/conchoecia/egt.git
cd egt
python -m venv .venv && source .venv/bin/activate
pip install -e .

egt --help
bash tests/smoke/test_cli.sh

Primary input is a directory of per-species RBH files produced by odp against the BCnS ALG database. From there, most analyses are a single egt <subcommand> call or a Snakefile under workflows/.

Quick Start

PhyloTreeUMAP — manifold projection of per-species ALG state

# 1. build per-sample distance matrices + sampledf
egt phylotreeumap build-distances \
    --rbh-dir /path/to/rbh_files \
    --alg-name BCnSSimakov2022 \
    --sampledf-out GTUMAP/sampledf.tsv \
    --distance-dir GTUMAP/distance_matrices/

# 2. index ALG locus pairs
egt phylotreeumap algcomboix \
    --alg-rbh /path/to/LG_db/BCnSSimakov2022/BCnSSimakov2022.rbh \
    --output GTUMAP/alg_combo_to_ix.tsv

# 3. run the UMAP + HTML plot (MGT / MLT / ODOG variants)
egt phylotreeumap mgt-mlt-umap --help

ALG fusion analysis on a calibrated tree

egt alg-fusions --help

Perspective-chromosome tree mapping + Monte Carlo rates

egt perspchrom-df-to-tree --help

Rate analyses, Fourier periodicity, branch stats

egt branch-stats-vs-time    --help
egt fourier-of-rates        --help
egt fourier-support-vs-time --help
egt collapsed-tree          --help
egt tree-changes            --help
egt decay-pairwise          --help
egt decay-many-species      --help

Phylogeny preparation

egt taxids-to-newick           --help
egt newick-to-common-ancestors --help

Users' Guide

egt is a collection of analysis scripts rather than a monolithic pipeline. Each script is also registered as a subcommand of the egt console script:

egt alg-fusions --help
# equivalent to
python -m egt.plot_alg_fusions --help

Installation

git clone https://github.com/conchoecia/egt.git
cd egt
python -m venv .venv && source .venv/bin/activate
pip install -e .

Python requirements

Python 3.10 or newer. pip install -e . pulls the deps from pyproject.toml:

  • numpy, pandas, scipy, scikit-learn, matplotlib, networkx, Pillow
  • umap-learn[plot] — UMAP + the plotting extras needed by PhyloTreeUMAP
  • bokeh — interactive HTML plots
  • ete4 — taxonomy trees and NCBI taxid handling
  • snakemake (>=7, <9)
  • pyyaml

Conda equivalent:

mamba install -c conda-forge -c bioconda \
      python=3.11 numpy pandas scipy scikit-learn matplotlib networkx pillow \
      "umap-learn" bokeh ete4 "snakemake<9" pyyaml
pip install --no-deps -e .

Upstream tools

egt consumes outputs of several companion tools:

  • odp — per-species RBH files, ALG databases (BCnSSimakov2022 etc.)
  • chrombase — chromosome-scale NCBI genome database builder
  • genbargo — embargo-aware assembly curation
  • chromsim — chromosome-evolution simulations

CLI overview

phylotreeumap             — UMAP-over-ALG-topology (MGT, MLT, ODOG subcommands)
phylotreeumap-subsample   — subsample species phylogenetically with per-clade caps
alg-fusions               — plot fusion events on a phylogeny (canonical v3)
alg-dispersion            — plot ALG dispersion across species
perspchrom-df-to-tree     — map perspective-chromosome changes onto a tree (Monte Carlo)
decay-pairwise            — pairwise ALG-decay analysis
decay-many-species        — cross-species ALG conservation / decay
chrom-number-vs-changes   — chromosome count vs rearrangement-rate scatter
branch-stats-vs-time      — branch statistics against geologic time
branch-stats-tree         — branch statistics laid out on a tree
branch-stats-tree-pair    — paired branch-stats tree plots
collapsed-tree            — collapsed-tree visualization
tree-changes              — per-branch changes on a tree
fourier-of-rates          — Fourier analysis of chromosomal change rates
fourier-support-vs-time   — Fourier-support-vs-time plots
count-unique-changes      — count unique changes per branch
defining-features         — identify clade-defining features
defining-features-plot    — plot defining features
defining-features-plotRBH — plot defining features on RBH dataframes
taxids-to-newick          — build a Newick tree from NCBI taxids
newick-to-common-ancestors — divergence-time annotation from a timetree
algs-split-across-scaffolds — find ALGs split across scaffolds
get-assembly-sizes        — summarize assembly sizes
pull-entries-from-yaml    — select rows from a YAML sample list
aggregate-filechecker     — aggregate filechecker benchmarks
aggregate-filesizes       — aggregate file-size summaries
join-supplementary-tables — join table fragments
phylotreeumap-plotdfs     — PhyloTreeUMAP plotting dataframe helper

Snakemake workflows

Multi-stage Snakemake definitions live under workflows/:

workflows/
├── phylotree_umap.smk
├── phylotree_umap_subsampling.smk
├── perspchrom_df_stats_and_mc.smk
├── annotate_sample_df.smk
├── sample_to_num_chromosomes.smk
├── odol_annotate_blast.smk
└── pipeline/
    ├── README.md
    ├── config.template.yaml
    └── run.sh

Each workflow is standalone and parameterized via a YAML config.

Input file formats

  • RBH files (.rbh) — tab-separated reciprocal-best-hits output of odp. Filenames must embed the NCBI taxid as the second hyphen-separated field, e.g. speciesname-7777-something.rbh.
  • Sample dataframe (sampledf.tsv) — output of egt phylotreeumap build-distances; consumed by most downstream commands.
  • ALG database RBH — e.g. BCnSSimakov2022.rbh, from odp's LG_db.
  • Newick trees — ete4-readable. egt taxids-to-newick emits these.
  • Divergence-time tables — TSV, as accepted by egt newick-to-common-ancestors.

Layout

egt/
├── src/egt/                    — Python package
│   ├── cli.py                  — argparse dispatcher
│   ├── _vendor/                — vendored, frozen plotting utilities
│   ├── legacy/                 — prior versions of plot_ALG_fusions kept for parity
│   └── *.py                    — one module per subcommand
├── workflows/                  — Snakemake workflows
├── configs/                    — example configs
├── data/                       — small bundled data
├── tests/
│   ├── testdb/                 — mini_hydra + mini_urchin fixtures
│   └── smoke/test_cli.sh       — CLI smoke test
└── docs/

Related tools

Citing egt

If you use this toolkit, please cite:

Schultz, D.T., Blümel, A., Destanović, D., Sarigol, F., Simakov, O. (2024). Topological mixing and irreversibility in animal chromosome evolution. bioRxiv. doi:10.1101/2024.07.29.605683

For background on the topological framework for comparative genomics, see:

Schultz, D.T., Simakov, O. (2026). Topological Approaches in Animal Comparative Genomics. Annual Review of Animal Biosciences 14(1), 17–48. doi:10.1146/annurev-animal-030424-084541

See also CITATION.cff.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

egt-0.1.0.tar.gz (386.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

egt-0.1.0-py3-none-any.whl (405.5 kB view details)

Uploaded Python 3

File details

Details for the file egt-0.1.0.tar.gz.

File metadata

  • Download URL: egt-0.1.0.tar.gz
  • Upload date:
  • Size: 386.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for egt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 65c4c9daa8cef1191c2cf88fc368ff645fc4f48f9afe0c53ca7ebe243a7ab838
MD5 d473841258b574dc49c37bd6e0d723e3
BLAKE2b-256 724d3512fdd7a9fb4b9c0da0d222db83c7f33e61d115728060548b4a2e779dfc

See more details on using hashes here.

File details

Details for the file egt-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: egt-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 405.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for egt-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5791d37d80f25dad65f3194b1ab483d4a46f28a68b824da59f90188495328326
MD5 64bbf8a83a993b7eb2f37f8bc7b00194
BLAKE2b-256 60cf5074e33b42edef6f70fd3b8fbfb92bf7ffe430fb0174d54912ec0938e60d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page