Parse HyPhy and PAML codon-selection analysis outputs into tidy CSVs for downstream, user-defined analyses.
Project description
evolharvester
evolharvester is a command-line utility for parsing HyPhy JSONs and PAML report files for further analysis/custom visualization of selection modeling results.
It reformats JSONs and report text files (whether from Datamonkey, local pipelines, archived results, etc.) into standardized, CSV files for downstream analysis in the user's preferred development environment.
Original data are exported as long-format CSVs, capturing multi-resolution results per output CSV. Broadly, we opt for sequence-/branch-keying results in a rowwise fashion, allowing for codon-/site-level data to be captured in vectorized, JSON-like columns. This preserves multi-resolution results in a unified output CSV.
Installation
From PyPI
pip install evolharvester
From source (development)
git clone https://github.com/apulvino/evolharvester.git
cd evolharvester
pip install -e .
Requirements
evolharvester requires Python ≥ 3.10, other dependencies (numpy ≥ 1.24, pandas ≥ 2.0, biopython ≥ 1.85) are installed during setup from PyPI/source.
Supported analyses
evolharvester parses outputs from HyPhy and PAML substitution-based selection analyses, with individual harvesters for each method/model variant, invokable from an integrated command-line interface.
NOTE: Even though evolharvester provides multi-resolution extraction as routinely as possible, HyPhy does not uniformly claim statistical confidence across all resolutions. Always cross-reference HyPhy and PAML documentation to understand the statistical resolution of your models' results in context of your evolharvester CSVs. NOTE (II): Evolharvester, in addition to simplifying for downstream parsing/stat/viz, is built to handle outputs from multiple runs, data are intentionally pre-filtered for ω>1, pval<0.05, and padj<0.10 as routinely as possible to ensure CSV-output does not become unmanagably large for downstream work in the user's preferred development environment.
Supported HyPhy methods
- GARD: Genetic Algorithm for Recombination Detection — identifies recombination breakpoints in coding sequence alignments
- BUSTED: Branch-Site Unrestricted Statistical Test for Episodic Diversification — tests for gene-/tree-wide positive selection
- FUBAR: Fast Unconstrained Bayesian AppRoximation — site-level posterior probabilities of pervasive selection across tree
- FEL: Fixed Effects Likelihood — site-level tests for pervasive selection across tree
- MEME: Mixed Effects Model of Evolution — site-level tests for episodic positive selection across tree
- aBSREL: adaptive Branch-Site Random Effects Likelihood — branch-level tests for episodic selection
PAML programs
- baseml: nucleotide-substitution likelihood models for tree-wide rate inference
- codeml: codon-substitution likelihood models. Site-model variants (M0, M1a, M2a, M7, M8) for inferring positively selected sites. Branch-model variants (OneRatio, FreeRatio) for inferring branch-specific selection patterns.
- yn00: pairwise dN/dS estimation via Yang & Nielsen (2000)
codeml model coverage
evolharvester provides individual harvesters for each codeml model variant:
- M0 (one-ratio across the tree)
- M1a (NearlyNeutral)
- M2a (PositiveSelection)
- M7 (beta)
- M8 (beta & ω > 1)
- OneRatio (alternative single-ω fit)
- FreeRatio (per-branch ω)
Quickstart
evolharvester is called with eh. Each parser takes one input file (or a directory containing matching outputs) and writes an output CSV.
Parse a HyPhy FEL output
eh fel --input my_FEL_results.json --output ./FEL_HARVEST/
This reads my_FEL_results.json (a HyPhy FEL JSON) and writes ./parsed/fel_filtered_stats.csv with one row per branch, vectorized site-level information, and repeating gene-level summary fields (for each branch/row corresponding to a given tree). Of critical note: even though evolharvester provides multi-resolution as routinely as possible, HyPhy does not uniformly claim statistical confidence across all resolutions. Always cross-reference HyPhy documentation to understand the statistical resolution your model results are capable of.
Parse a PAML codeml M2a output
eh codemlM2a --input my_M2a_run/M2a.out --output ./M2A_HARVEST/
This reads M2a.out (the codeml report file) and writes ./parsed/codeml_M2a_filtered_stats.csv with one row per branch, gene-level summary fields broadcast across rows, including site-level NEB/BEB posterior probabilities as vectorized columns.
List all supported parsers
eh --list-tools
Prints the names of all available HyPhy methods, PAML programs, and codeml model variants.
Usage
General syntax
eh <selection_analysis_name> --input <input_path> --output <output_path> [--verbose]
<selection_analysis_name>— any of the harvesters listed viaeh --list-tools. Names are case-insensitive (e.g.codemlM2aandcodemlm2aare equivalent).<input_path>— a single file, a directory, or a glob pattern. See Input handling below.<output_path>— either a target CSV file or a directory. If a directory, the parser writes to the filename specified inside it (e.g.fel_filtered_stats.csv,codeml_M2a_filtered_stats.csv).--verbose— print per-file progress information to stderr.
Worked examples
HyPhy site-level method (FEL)
eh fel --input results/MyGene_FEL.json --output parsed_results/
Produces parsed_results/fel_filtered_stats.csv with one row per branch and vectorized/JSON-style codon site columns.
HyPhy branch-level method (aBSREL)
eh absrel --input results/MyGene_aBSREL.json --output parsed_results/
Produces parsed_results/absrel_filtered_stats.csv with one row per branch.
PAML codeml site model (M2a)
eh codemlM2a --input results/MyGene/codeml/M2a.out --output parsed_results/
Produces parsed_results/codeml_M2a_filtered_stats.csv with one row per branch and site-level NEB/BEB posterior probabilities as vectorized columns.
PAML codeml branch model (FreeRatio)
eh codemlFreeRatio --input results/MyGene/codeml/FreeRatio.out --output parsed_results/
Produces parsed_results/codeml_FreeRatio_filtered_stats.csv with one row per branch and per-branch dN/dS estimates.
Input handling modes
Every evolharvester parser accepts input in four forms.
Single file:
eh fel --input MyGene_FEL.json --output parsed/
The file is parsed directly without invoking any discovery logic.
Directory input (cascading discovery):
When a directory is supplied, evolharvester applies a cascading discovery logic. Patterns are tried in order and are implemented at the first match:
-
Gene-centric layout first. Looks for the expected output filename inside one subdirectory level beneath a parser-specific intermediate directory. For codeml parsers, the pattern is
<input>/<gene>/codeml/<filename>.out. For baseml, it is<input>/<gene>/baseml/baseml.out. HyPhy parsers similarly look for matching JSON files in standard subdirectory layouts. -
Flat layout second. If the gene-centric pattern finds no matches, the parser looks for the expected filename inside the input directory:
<input>/<filename>.out. -
Recursive fallback last. If neither gene-centric nor flat patterns match, the parser performs a recursive search for any matching files anywhere within the input directory tree.
Examples:
##gene-centric layout: my_results/APOC1/codeml/M2a.out exists
eh codemlM2a --input my_results/ --output parsed/
## matches my_results/<gene>/codeml/M2a.out for each gene
## flat layout: my_dir/M2a.out exists
eh codemlM2a --input my_dir/ --output parsed/
## matches my_dir/M2a.out
##non-canonical layout: M2a.out files scattered at unknown depths
eh codemlM2a --input my_messy_dir/ --output parsed/
##initiates recursive search, matches any M2a.out within dir nest
Glob pattern:
eh fel --input "results/*_FEL.json" --output parsed/
##quotes are required to prevent shell expansion before evolharvester sees the pattern
The path is treated as a shell-style glob where matching files are parsed.
When a directory is passed and matches are found via any of the three discovery patterns, files from multiple genes are combined into a single output CSV. Gene names are extracted from the directory structure: for files matched via the gene-centric layout (inside a parser-specific subdirectory like codeml/ or baseml/), the gene name is taken from the grandparent directory's name. For files matched via the flat layout or recursive fallback, the gene name defaults to the file stem (e.g. M2a from M2a.out), which may not match the user's intended gene identified. Users whose results aren't nested in a gene-style top directory will need to, for example, swap placeholder 'M2a' with the correct gene name.
Output format
evolharvester produces tidy, long-format CSVs designed for downstream analysis using pandas/IPYNBs, RStudio, or other development environments. Each parser produces its own CSV, and although column schemas vary, there are select design principles consistent across evolharvester outputs.
Design principles
Branch-keyed long format. Almost uniformly, evolharvester parsers produce one row per branch or pairwise-branch (branch_id column), with all method-specific statistics for each observation as additional columns. This applies to all PAML codeml parsers (M0, M1a, M2a, M7, M8, OneRatio, FreeRatio), PAML baseml, and HyPhy aBSREL, FEL, MEME, and FUBAR.
Method-level summary fields broadcast across branch rows. For multi-gene runs, gene-level metadata (sequence count, alignment length, log-likelihood, model identity, subsequent AIC/BIC calc, etc.) is written into every branch row of the corresponding evolharvester result CSV. This allows simpler grouping, filtering, and joining for downstream analysis.
Vectorized site- or codon-level columns. Where a parser captures granular site- or codon-level data (e.g. site-level posterior probabilities in M2a/M8, per-site p-values in FEL/MEME, codon usage tables in codeml), these are encoded as bracketed JSON-like, list strings in dedicated columns. Each branch row carries the full vector of site positions, and consistently ordered per-site stats/metrics allowing site-level analysis to be reconstructed.
Exceptions where branch-keying does not apply. Two parsers depart from the branch-keyed pattern because their underlying methods don't produce branch-level results:
- HyPhy BUSTED (gene-level test) produces one row per gene
- HyPhy GARD (recombination breakpoint detection) produces one row per partition
HyPhy parsers
| Parser | Row identity | Key fields |
|---|---|---|
| aBSREL | One row per branch | branch_id, species, aBSREL_LRT, aBSREL_pvalue, aBSREL_pvalue_corrected, aBSREL_branch_dN/dS/omega, aBSREL_omega_classes, aBSREL_omega_weights, aBSREL_max_omega |
| BUSTED | One row per gene | gene, branches, pval, lrt, omega_purifying/neutral/positive, proportion_purifying/neutral/positive, partition info |
| FEL | One row per gene with site data vectorized | gene_id, FEL_n_sites, FEL_n_significant, FEL_n_positive, FEL_n_negative, FEL_sig_sites (vector), FEL_site_pvalues (vector), FEL_site_omegas (vector), FEL_site_LRTs (vector) |
| FUBAR | One row per branch with site data vectorized | gene, branch, sites, alpha, beta, Prob[alpha>beta], Prob[alpha<beta], BayesFactor[alpha<beta] |
| GARD | One row per partition | gene, number_of_sequences, number_of_sites, breakpoint detection columns (potentialBreakpoints, partition_bps, improvements_breakpoints, improvements_deltaAICc, site_positions vector, site_supports vector) |
| MEME | One row per branch with site data vectorized | gene, branch, site_positions (vector), site_pval (vector), site_LRT (vector), site_alpha, site_beta_plus, site_MEMElogl, site_substitution |
PAML parsers
Common columns: PAML codeml parsers
The seven PAML codeml parsers (M0, M1a, M2a, M7, M8, OneRatio, FreeRatio) share a core schema with some model-specific extensions:
| Column | Meaning |
|---|---|
seq |
gene name identifier (taken from input file or directory name) |
branch_id |
PAML branch identifier (e.g. 1..2, 5) |
from_node, to_node |
name of branch (or split internal-branch identifier) |
from_name, to_name |
resolved sequence/node names |
t |
branch length (substitutions per codon) |
N, S |
nonsynonymous and synonymous site counts |
branch_omega |
dN/dS for this branch |
dN, dS |
dN and dS values for this branch |
N_dN, S_dS |
counts of nonsynonymous and synonymous changes |
ns, ls |
number of sequences and alignment length |
lnL |
log-likelihood of the model fit |
kappa |
transition/transversion rate ratio |
tree_length |
total tree length under the model |
model |
model name as reported by codeml |
codon_freq_model |
codon frequency model used (e.g. F3X4) |
codon_pos_base_seq |
per-sequence codon-position × base composition table (vectorized) |
codon_usage_counts |
per-sequence codon usage counts (vectorized) |
np |
number of free parameters |
AIC, BIC |
information criteria/post-hoc calculation |
Model-specific extensions
- codemlM0 / codemlM1a / codemlM2a / codemlOneRatio: Add
omega_global(the single tree-wide ω-value reported for single-rate fits) - codemlM0 / codemlOneRatio / codemlFreeRatio: Add
tree_length_dNandtree_length_dS(separate dN and dS tree lengths reported by codeml for one-ratio and branch-model fits) - codemlM1a / codemlM2a / codemlM8: Add
p_siteclasses,w_siteclasses(site-class proportions and ω values from site-model fits) - codemlM2a / codemlM8: Add Bayes Empirical Bayes and Naive Empirical Bayes site-level posteriors' and associated stat/metric (
BEB_Pr_w_gt1,BEB_post_mean,BEB_post_SE,BEB_signif,NEB_*equivalents), per-site coordinates (site_coords), BEB reference sequence (BEB_ref_seq), grid posteriors, and diagnostic counts (num_BEB_sites,num_BEB_ge95,num_BEB_ge99,num_NEB_sites,num_NEB_ge95,num_NEB_ge99) - codemlM7 / codemlM8: Add beta distribution parameters (
beta_p,beta_q), site-class MLE vectors (MLE_p,MLE_w), Newick tree with branch lengths (tree_newick_with_lengths), additional log-likelihood diagnostics (lnL_ntime,lnL_np), and Bayesian inference status (bayes_flags) - codemlM8: Adds two additional beta-distribution parameters specific to M8's selection-class extension (
beta_p0,beta_w) - codemlFreeRatio: Adds
free_w_branch_values(per-branch ω vector),dS_tree_newick,dN_tree_newick,w_node_label_newick(Newick tree with branch labels), andw_node_label_map(parsed mapping of branch labels to ω values) - codemlM2a / codemlM7 / codemlM8 / codemlFreeRatio: Add
conv_msg(track any notes in report on convergence failure) - codemlM2a / codemlM7 / codemlM8: Add
notes(empty issue tracker column)
baseml
The PAML baseml parser produces one row per branch under nucleotide-substitution models. Output includes branch-level fields (Branch_ID, Branch_Length_t, parent, is_internal, n_descendants, root_to_tip, branch_support), per-gene model parameters (GTR rate matrix elements GTR_a through GTR_e, nucleotide stationary frequencies piA/piC/piG/piT, transition/transversion ratio kappa with source attribution), pairwise sequence diagnostics (pairwise_kappa_*, pairwise_distance_* mean/min/max), compositional homogeneity tests (hom_X2, hom_G), and post-hoc information criteria (AIC, AICc, BIC).
yn00
The PAML yn00 parser produces one row per sequence pair from pairwise dN/dS estimation. Each row includes the primary (Yang-Nielsen 2000) estimates (S, N, t, kappa, omega, dN, dN_SE, dS, dS_SE), plus alternative method estimates: (Li-Wu-Luo 1985) (LWL85_*), (Li-Wu-Luo 1985) modified (LWL85m_*), and (Li-Pamilo-Bianchi 1993) (LPB93_*). Pairwise diagnostics (pair_n_identical_codons, pair_pct_identity, etc.), per-position GC content (GC_pos1 through GC_total), data-quality flags (dS_is_zero, any_nan_inf, omega_placeholder_flag), and codon-level vector columns (codon_usage_counts, codon_pos_base_seq, codon_pos_base_avg) are also captured.
Notes on schema consistency
evolharvester is in active development, and column naming conventions vary across parsers reflecting differences in formatting convention.
The gene identifier column appears as seq in codeml parsers, Gene_ID in baseml, gene_id in yn00 and some HyPhy parsers (FEL, aBSREL), and gene in others (BUSTED, FUBAR, MEME, GARD).
Branch identifiers similarly differ (branch_id in codeml, Branch_ID in baseml, branch in MEME and FUBAR).
Careful adjustments will be made in future releases to advance maximum schema unification across parsers.
Users should be aware of joining outputs from multiple and account for these differences.
Generated CSV headers are the authoritative schema reference for v0.1.0.
NOTE: Evolharvester, in addition to simplifying for downstream parsing/stat/viz, is built to handle outputs from multiple runs, data are intentionally pre-filtered for ω>1, pval<0.05, and padj<0.10 as routinely as possible to ensure CSV-output does not become unmanagably large for downstream work in the user's preferred development environment.
Citation
If you use evolharvester in published work, please cite via:
Pulvino, A.T. (2026). evolharvester (v0.1.0). GitHub repository. https://github.com/apulvino/evolharvester
A peer-reviewed Application Note describing evolharvester is in-preparation. Citation details will be updated upon acceptance.
A Zenodo archive with a citable DOI is forthcoming and will be linked in the badges above.
License
evolharvester is released under the MIT License. See LICENSE for details.
Author
Anthony T. Pulvino — Northwestern University, Interdisciplinary Biological Sciences (IBiS) Graduate Program
evolharvester was developed to support comparative genomics research for HyPhy and PAML users. I hope increased access to these tools helps support larger user-bases for both of these tools which both represent highly valuable contributions to the research community.
Contact
For bug reports, feature requests, or questions about evolharvester, please open an issue on the GitHub issue tracker.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evolharvester-0.5.0.tar.gz.
File metadata
- Download URL: evolharvester-0.5.0.tar.gz
- Upload date:
- Size: 43.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93af85dbc8c914c7be8fde5f243421396d18bab2a6192f3e0e977d083f569b66
|
|
| MD5 |
40f08053509d3477bab3cb2f01b6bc78
|
|
| BLAKE2b-256 |
c32baacce64de9091bb1e5dffb671b9e42de0be00b2adc0985408a4e8ff299b9
|
Provenance
The following attestation bundles were made for evolharvester-0.5.0.tar.gz:
Publisher:
workflow.yml on apulvino/evolharvester
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evolharvester-0.5.0.tar.gz -
Subject digest:
93af85dbc8c914c7be8fde5f243421396d18bab2a6192f3e0e977d083f569b66 - Sigstore transparency entry: 1479433912
- Sigstore integration time:
-
Permalink:
apulvino/evolharvester@db7fe7fc09de94547f866e126c2bebbc51adc14e -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/apulvino
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@db7fe7fc09de94547f866e126c2bebbc51adc14e -
Trigger Event:
release
-
Statement type:
File details
Details for the file evolharvester-0.5.0-py3-none-any.whl.
File metadata
- Download URL: evolharvester-0.5.0-py3-none-any.whl
- Upload date:
- Size: 42.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
848f617f82a3c667c619cf49b9d7bed14c0328d74cd96b3b6cbd5ff328f7aff0
|
|
| MD5 |
00c05662fb8a1d6e033f05b538741041
|
|
| BLAKE2b-256 |
0e7cf2da82342609ca0dcd07045e807bed552188ab1579765b816e2ba14a08ee
|
Provenance
The following attestation bundles were made for evolharvester-0.5.0-py3-none-any.whl:
Publisher:
workflow.yml on apulvino/evolharvester
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evolharvester-0.5.0-py3-none-any.whl -
Subject digest:
848f617f82a3c667c619cf49b9d7bed14c0328d74cd96b3b6cbd5ff328f7aff0 - Sigstore transparency entry: 1479434084
- Sigstore integration time:
-
Permalink:
apulvino/evolharvester@db7fe7fc09de94547f866e126c2bebbc51adc14e -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/apulvino
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@db7fe7fc09de94547f866e126c2bebbc51adc14e -
Trigger Event:
release
-
Statement type: