Evolutionary Genome Topology — analysis toolkit for chromosome evolution across metazoan genomes using reciprocal-best-hits data.
Project description
egt — Evolutionary Genome Topology
egt is a Python / Snakemake analysis toolkit for characterizing
chromosome evolution across metazoan genomes. It builds on reciprocal-best-hits
data from odp and provides tools for:
- ALG (ancestral linkage group) fusion, dispersal, and rate analyses
- PhyloTreeUMAP: manifold projection of per-species ALG state (MGT, MLT, and one-dot-one-genome variants)
- perspective-chromosome reconstruction with Monte Carlo support
- branch-wise rate analyses against a calibrated tree
- Fourier-period analysis of rate time series
- phylogenetic subsampling, tree prep, taxonomy utilities
Getting Started
git clone https://github.com/conchoecia/egt.git
cd egt
python -m venv .venv && source .venv/bin/activate
pip install -e .
egt --help
bash tests/smoke/test_cli.sh
Primary input is a directory of per-species RBH files produced by odp
against the BCnS ALG database. From there, most analyses are a single
egt <subcommand> call or a Snakefile under workflows/.
Quick Start
PhyloTreeUMAP — manifold projection of per-species ALG state
# 1. build per-sample distance matrices + sampledf
egt phylotreeumap build-distances \
--rbh-dir /path/to/rbh_files \
--alg-name BCnSSimakov2022 \
--sampledf-out GTUMAP/sampledf.tsv \
--distance-dir GTUMAP/distance_matrices/
# 2. index ALG locus pairs
egt phylotreeumap algcomboix \
--alg-rbh /path/to/LG_db/BCnSSimakov2022/BCnSSimakov2022.rbh \
--output GTUMAP/alg_combo_to_ix.tsv
# 3. run the UMAP + HTML plot (MGT / MLT / ODOG variants)
egt phylotreeumap mgt-mlt-umap --help
ALG fusion analysis on a calibrated tree
egt alg-fusions --help
Perspective-chromosome tree mapping + Monte Carlo rates
egt perspchrom-df-to-tree --help
Rate analyses, Fourier periodicity, branch stats
egt branch-stats-vs-time --help
egt fourier-of-rates --help
egt fourier-support-vs-time --help
egt collapsed-tree --help
egt tree-changes --help
egt decay-pairwise --help
egt decay-many-species --help
Phylogeny preparation
egt taxids-to-newick --help
egt newick-to-common-ancestors --help
Users' Guide
egt is a collection of analysis scripts rather than a monolithic pipeline.
Each script is also registered as a subcommand of the egt console script:
egt alg-fusions --help
# equivalent to
python -m egt.plot_alg_fusions --help
Installation
git clone https://github.com/conchoecia/egt.git
cd egt
python -m venv .venv && source .venv/bin/activate
pip install -e .
Python requirements
Python 3.10 or newer. pip install -e . pulls the deps from pyproject.toml:
- numpy, pandas, scipy, scikit-learn, matplotlib, networkx, Pillow
- umap-learn[plot] — UMAP + the plotting extras needed by PhyloTreeUMAP
- bokeh — interactive HTML plots
- ete4 — taxonomy trees and NCBI taxid handling
- snakemake (>=7, <9)
- pyyaml
Conda equivalent:
mamba install -c conda-forge -c bioconda \
python=3.11 numpy pandas scipy scikit-learn matplotlib networkx pillow \
"umap-learn" bokeh ete4 "snakemake<9" pyyaml
pip install --no-deps -e .
Upstream tools
egt consumes outputs of several companion tools:
odp— per-species RBH files, ALG databases (BCnSSimakov2022 etc.)chrombase— chromosome-scale NCBI genome database buildergenbargo— embargo-aware assembly curationchromsim— chromosome-evolution simulations
CLI overview
phylotreeumap — UMAP-over-ALG-topology (MGT, MLT, ODOG subcommands)
phylotreeumap-subsample — subsample species phylogenetically with per-clade caps
alg-fusions — plot fusion events on a phylogeny (canonical v3)
alg-dispersion — plot ALG dispersion across species
perspchrom-df-to-tree — map perspective-chromosome changes onto a tree (Monte Carlo)
decay-pairwise — pairwise ALG-decay analysis
decay-many-species — cross-species ALG conservation / decay
chrom-number-vs-changes — chromosome count vs rearrangement-rate scatter
branch-stats-vs-time — branch statistics against geologic time
branch-stats-tree — branch statistics laid out on a tree
branch-stats-tree-pair — paired branch-stats tree plots
collapsed-tree — collapsed-tree visualization
tree-changes — per-branch changes on a tree
fourier-of-rates — Fourier analysis of chromosomal change rates
fourier-support-vs-time — Fourier-support-vs-time plots
count-unique-changes — count unique changes per branch
defining-features — identify clade-defining features
defining-features-plot — plot defining features
defining-features-plotRBH — plot defining features on RBH dataframes
taxids-to-newick — build a Newick tree from NCBI taxids
newick-to-common-ancestors — divergence-time annotation from a timetree
algs-split-across-scaffolds — find ALGs split across scaffolds
get-assembly-sizes — summarize assembly sizes
pull-entries-from-yaml — select rows from a YAML sample list
aggregate-filechecker — aggregate filechecker benchmarks
aggregate-filesizes — aggregate file-size summaries
join-supplementary-tables — join table fragments
phylotreeumap-plotdfs — PhyloTreeUMAP plotting dataframe helper
Snakemake workflows
Multi-stage Snakemake definitions live under workflows/:
workflows/
├── phylotree_umap.smk
├── phylotree_umap_subsampling.smk
├── perspchrom_df_stats_and_mc.smk
├── annotate_sample_df.smk
├── sample_to_num_chromosomes.smk
├── odol_annotate_blast.smk
└── pipeline/
├── README.md
├── config.template.yaml
└── run.sh
Each workflow is standalone and parameterized via a YAML config.
Input file formats
- RBH files (
.rbh) — tab-separated reciprocal-best-hits output ofodp. Filenames must embed the NCBI taxid as the second hyphen-separated field, e.g.speciesname-7777-something.rbh. - Sample dataframe (
sampledf.tsv) — output ofegt phylotreeumap build-distances; consumed by most downstream commands. - ALG database RBH — e.g.
BCnSSimakov2022.rbh, fromodp's LG_db. - Newick trees — ete4-readable.
egt taxids-to-newickemits these. - Divergence-time tables — TSV, as accepted by
egt newick-to-common-ancestors.
Layout
egt/
├── src/egt/ — Python package
│ ├── cli.py — argparse dispatcher
│ ├── _vendor/ — vendored, frozen plotting utilities
│ ├── legacy/ — prior versions of plot_ALG_fusions kept for parity
│ └── *.py — one module per subcommand
├── workflows/ — Snakemake workflows
├── configs/ — example configs
├── data/ — small bundled data
├── tests/
│ ├── testdb/ — mini_hydra + mini_urchin fixtures
│ └── smoke/test_cli.sh — CLI smoke test
└── docs/
Related tools
Citing egt
If you use this toolkit, please cite:
Schultz, D.T., Blümel, A., Destanović, D., Sarigol, F., Simakov, O. (2024). Topological mixing and irreversibility in animal chromosome evolution. bioRxiv. doi:10.1101/2024.07.29.605683
For background on the topological framework for comparative genomics, see:
Schultz, D.T., Simakov, O. (2026). Topological Approaches in Animal Comparative Genomics. Annual Review of Animal Biosciences 14(1), 17–48. doi:10.1146/annurev-animal-030424-084541
See also CITATION.cff.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file egt-0.2.0.tar.gz.
File metadata
- Download URL: egt-0.2.0.tar.gz
- Upload date:
- Size: 413.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd66154e588913e7a356b3a457f0950866de8fe94506024b70921c6ccc7d3522
|
|
| MD5 |
871c01ec8ae780501a257f009a994a24
|
|
| BLAKE2b-256 |
a672202749ec06707ef4255110386059bb7f4a724f5d66ce5578fc941bc725cf
|
File details
Details for the file egt-0.2.0-py3-none-any.whl.
File metadata
- Download URL: egt-0.2.0-py3-none-any.whl
- Upload date:
- Size: 435.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
094a0808d5831fe34789964dc9331eb56e3fd52dc066785a4b252059d131cc52
|
|
| MD5 |
f04ee5592d57ef63338fd812b21109b4
|
|
| BLAKE2b-256 |
ec5e4e27ac82daad098629e70e4e2ed78e01ed41328803dbc10e77cbe2d8fb2b
|