Parse HyPhy and PAML codon-selection analysis outputs into tidy CSVs for downstream, user-defined analyses.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

evolharvester

evolharvester is a command-line utility for parsing HyPhy JSONs and PAML report files for further analysis/custom visualization of selection modeling results.

It reformats JSONs and report text files (whether from Datamonkey, local pipelines, archived results, etc.) into standardized, CSV files for downstream analysis in the user's preferred development environment.

Original data are exported as long-format CSVs, capturing multi-resolution results per output CSV. Broadly, we opt for sequence-/branch-keying results in a rowwise fashion, allowing for codon-/site-level data to be captured in vectorized, JSON-like columns. This preserves multi-resolution results in a unified output CSV.

Installation

From PyPI

pip install evolharvester

From source (development)

git clone https://github.com/apulvino/evolharvester.git
cd evolharvester
pip install -e .

Requirements

evolharvester requires Python ≥ 3.10, other dependencies (numpy ≥ 1.24, pandas ≥ 2.0, biopython ≥ 1.85) are installed during setup from PyPI/source.

Supported analyses

evolharvester parses outputs from HyPhy and PAML substitution-based selection analyses, with individual harvesters for each method/model variant, invokable from an integrated command-line interface.

NOTE: Even though evolharvester provides multi-resolution extraction as routinely as possible, HyPhy does not uniformly claim statistical confidence across all resolutions. Always cross-reference HyPhy and PAML documentation to understand the statistical resolution of your models' results in context of your evolharvester CSVs. NOTE (II): Evolharvester, in addition to simplifying for downstream parsing/stat/viz, is built to handle outputs from multiple runs, data are intentionally pre-filtered for ω>1, pval<0.05, and padj<0.10 as routinely as possible to ensure CSV-output does not become unmanagably large for downstream work in the user's preferred development environment.

Supported HyPhy methods

GARD: Genetic Algorithm for Recombination Detection — identifies recombination breakpoints in coding sequence alignments
BUSTED: Branch-Site Unrestricted Statistical Test for Episodic Diversification — tests for gene-/tree-wide positive selection
FUBAR: Fast Unconstrained Bayesian AppRoximation — site-level posterior probabilities of pervasive selection across tree
FEL: Fixed Effects Likelihood — site-level tests for pervasive selection across tree
MEME: Mixed Effects Model of Evolution — site-level tests for episodic positive selection across tree
aBSREL: adaptive Branch-Site Random Effects Likelihood — branch-level tests for episodic selection

PAML programs

baseml: nucleotide-substitution likelihood models for tree-wide rate inference
codeml: codon-substitution likelihood models. Site-model variants (M0, M1a, M2a, M7, M8) for inferring positively selected sites. Branch-model variants (OneRatio, FreeRatio) for inferring branch-specific selection patterns.
yn00: pairwise dN/dS estimation via Yang & Nielsen (2000)

codeml model coverage

evolharvester provides individual harvesters for each codeml model variant:

M0 (one-ratio across the tree)
M1a (NearlyNeutral)
M2a (PositiveSelection)
M7 (beta)
M8 (beta & ω > 1)
OneRatio (alternative single-ω fit)
FreeRatio (per-branch ω)

Quickstart

evolharvester is called with eh. Each parser takes one input file (or a directory containing matching outputs) and writes an output CSV.

Parse a HyPhy FEL output

eh fel --input my_FEL_results.json --output ./FEL_HARVEST/

This reads my_FEL_results.json (a HyPhy FEL JSON) and writes ./parsed/fel_filtered_stats.csv with one row per branch, vectorized site-level information, and repeating gene-level summary fields (for each branch/row corresponding to a given tree). Of critical note: even though evolharvester provides multi-resolution as routinely as possible, HyPhy does not uniformly claim statistical confidence across all resolutions. Always cross-reference HyPhy documentation to understand the statistical resolution your model results are capable of.

Parse a PAML codeml M2a output

eh codemlM2a --input my_M2a_run/M2a.out --output ./M2A_HARVEST/

This reads M2a.out (the codeml report file) and writes ./parsed/codeml_M2a_filtered_stats.csv with one row per branch, gene-level summary fields broadcast across rows, including site-level NEB/BEB posterior probabilities as vectorized columns.

List all supported parsers

eh --list-tools

Prints the names of all available HyPhy methods, PAML programs, and codeml model variants.

Usage

General syntax

eh <selection_analysis_name> --input <input_path> --output <output_path> [--verbose]

<selection_analysis_name> — any of the harvesters listed via eh --list-tools. Names are case-insensitive (e.g. codemlM2a and codemlm2a are equivalent).
<input_path> — a single file, a directory, or a glob pattern. See Input handling below.
<output_path> — either a target CSV file or a directory. If a directory, the parser writes to the filename specified inside it (e.g. fel_filtered_stats.csv, codeml_M2a_filtered_stats.csv).
--verbose — print per-file progress information to stderr.

Worked examples

HyPhy site-level method (FEL)

eh fel --input results/MyGene_FEL.json --output parsed_results/

Produces parsed_results/fel_filtered_stats.csv with one row per branch and vectorized/JSON-style codon site columns.

HyPhy branch-level method (aBSREL)

eh absrel --input results/MyGene_aBSREL.json --output parsed_results/

Produces parsed_results/absrel_filtered_stats.csv with one row per branch.

PAML codeml site model (M2a)

eh codemlM2a --input results/MyGene/codeml/M2a.out --output parsed_results/

Produces parsed_results/codeml_M2a_filtered_stats.csv with one row per branch and site-level NEB/BEB posterior probabilities as vectorized columns.

PAML codeml branch model (FreeRatio)

eh codemlFreeRatio --input results/MyGene/codeml/FreeRatio.out --output parsed_results/

Produces parsed_results/codeml_FreeRatio_filtered_stats.csv with one row per branch and per-branch dN/dS estimates.

Input handling modes

Every evolharvester parser accepts input in four forms.

Single file:

eh fel --input MyGene_FEL.json --output parsed/

The file is parsed directly without invoking any discovery logic.

Directory input (cascading discovery):

When a directory is supplied, evolharvester applies a cascading discovery logic. Patterns are tried in order and are implemented at the first match:

Gene-centric layout first. Looks for the expected output filename inside one subdirectory level beneath a parser-specific intermediate directory. For codeml parsers, the pattern is <input>/<gene>/codeml/<filename>.out. For baseml, it is <input>/<gene>/baseml/baseml.out. HyPhy parsers similarly look for matching JSON files in standard subdirectory layouts.
Flat layout second. If the gene-centric pattern finds no matches, the parser looks for the expected filename inside the input directory: <input>/<filename>.out.
Recursive fallback last. If neither gene-centric nor flat patterns match, the parser performs a recursive search for any matching files anywhere within the input directory tree.

Examples:

##gene-centric layout: my_results/APOC1/codeml/M2a.out exists
eh codemlM2a --input my_results/ --output parsed/
## matches my_results/<gene>/codeml/M2a.out for each gene

## flat layout: my_dir/M2a.out exists
eh codemlM2a --input my_dir/ --output parsed/
## matches my_dir/M2a.out

##non-canonical layout: M2a.out files scattered at unknown depths
eh codemlM2a --input my_messy_dir/ --output parsed/
##initiates recursive search, matches any M2a.out within dir nest

Glob pattern:

eh fel --input "results/*_FEL.json" --output parsed/
##quotes are required to prevent shell expansion before evolharvester sees the pattern

The path is treated as a shell-style glob where matching files are parsed.

When a directory is passed and matches are found via any of the three discovery patterns, files from multiple genes are combined into a single output CSV. Gene names are extracted from the directory structure: for files matched via the gene-centric layout (inside a parser-specific subdirectory like codeml/ or baseml/), the gene name is taken from the grandparent directory's name. For files matched via the flat layout or recursive fallback, the gene name defaults to the file stem (e.g. M2a from M2a.out), which may not match the user's intended gene identified. Users whose results aren't nested in a gene-style top directory will need to, for example, swap placeholder 'M2a' with the correct gene name.

Output format

evolharvester produces tidy, long-format CSVs designed for downstream analysis using pandas/IPYNBs, RStudio, or other development environments. Each parser produces its own CSV, and although column schemas vary, there are select design principles consistent across evolharvester outputs.

Design principles

Branch-keyed long format. Almost uniformly, evolharvester parsers produce one row per branch or pairwise-branch (branch_id column), with all method-specific statistics for each observation as additional columns. This applies to all PAML codeml parsers (M0, M1a, M2a, M7, M8, OneRatio, FreeRatio), PAML baseml, and HyPhy aBSREL, FEL, MEME, and FUBAR.

Method-level summary fields broadcast across branch rows. For multi-gene runs, gene-level metadata (sequence count, alignment length, log-likelihood, model identity, subsequent AIC/BIC calc, etc.) is written into every branch row of the corresponding evolharvester result CSV. This allows simpler grouping, filtering, and joining for downstream analysis.

Vectorized site- or codon-level columns. Where a parser captures granular site- or codon-level data (e.g. site-level posterior probabilities in M2a/M8, per-site p-values in FEL/MEME, codon usage tables in codeml), these are encoded as bracketed JSON-like, list strings in dedicated columns. Each branch row carries the full vector of site positions, and consistently ordered per-site stats/metrics allowing site-level analysis to be reconstructed.

Exceptions where branch-keying does not apply. Two parsers depart from the branch-keyed pattern because their underlying methods don't produce branch-level results:

HyPhy BUSTED (gene-level test) produces one row per gene
HyPhy GARD (recombination breakpoint detection) produces one row per partition

HyPhy parsers

Parser	Row identity	Key fields
aBSREL	One row per branch	`branch_id`, `species`, `aBSREL_LRT`, `aBSREL_pvalue`, `aBSREL_pvalue_corrected`, `aBSREL_branch_dN/dS/omega`, `aBSREL_omega_classes`, `aBSREL_omega_weights`, `aBSREL_max_omega`
BUSTED	One row per gene	`gene`, `branches`, `pval`, `lrt`, `omega_purifying/neutral/positive`, `proportion_purifying/neutral/positive`, partition info
FEL	One row per gene with site data vectorized	`gene_id`, `FEL_n_sites`, `FEL_n_significant`, `FEL_n_positive`, `FEL_n_negative`, `FEL_sig_sites` (vector), `FEL_site_pvalues` (vector), `FEL_site_omegas` (vector), `FEL_site_LRTs` (vector)
FUBAR	One row per branch with site data vectorized	`gene`, `branch`, `sites`, `alpha`, `beta`, `Prob[alpha>beta]`, `Prob[alpha<beta]`, `BayesFactor[alpha<beta]`
GARD	One row per partition	`gene`, `number_of_sequences`, `number_of_sites`, breakpoint detection columns (`potentialBreakpoints`, `partition_bps`, `improvements_breakpoints`, `improvements_deltaAICc`, `site_positions` vector, `site_supports` vector)
MEME	One row per branch with site data vectorized	`gene`, `branch`, `site_positions` (vector), `site_pval` (vector), `site_LRT` (vector), `site_alpha`, `site_beta_plus`, `site_MEMElogl`, `site_substitution`

PAML parsers

Common columns: PAML codeml parsers

The seven PAML codeml parsers (M0, M1a, M2a, M7, M8, OneRatio, FreeRatio) share a core schema with some model-specific extensions:

Column	Meaning
`seq`	gene name identifier (taken from input file or directory name)
`branch_id`	PAML branch identifier (e.g. `1..2`, `5`)
`from_node`, `to_node`	name of branch (or split internal-branch identifier)
`from_name`, `to_name`	resolved sequence/node names
`t`	branch length (substitutions per codon)
`N`, `S`	nonsynonymous and synonymous site counts
`branch_omega`	dN/dS for this branch
`dN`, `dS`	dN and dS values for this branch
`N_dN`, `S_dS`	counts of nonsynonymous and synonymous changes
`ns`, `ls`	number of sequences and alignment length
`lnL`	log-likelihood of the model fit
`kappa`	transition/transversion rate ratio
`tree_length`	total tree length under the model
`model`	model name as reported by codeml
`codon_freq_model`	codon frequency model used (e.g. `F3X4`)
`codon_pos_base_seq`	per-sequence codon-position × base composition table (vectorized)
`codon_usage_counts`	per-sequence codon usage counts (vectorized)
`np`	number of free parameters
`AIC`, `BIC`	information criteria/post-hoc calculation

Model-specific extensions

codemlM0 / codemlM1a / codemlM2a / codemlOneRatio: Add omega_global (the single tree-wide ω-value reported for single-rate fits)
codemlM0 / codemlOneRatio / codemlFreeRatio: Add tree_length_dN and tree_length_dS (separate dN and dS tree lengths reported by codeml for one-ratio and branch-model fits)
codemlM1a / codemlM2a / codemlM8: Add p_siteclasses, w_siteclasses (site-class proportions and ω values from site-model fits)
codemlM2a / codemlM8: Add Bayes Empirical Bayes and Naive Empirical Bayes site-level posteriors' and associated stat/metric (BEB_Pr_w_gt1, BEB_post_mean, BEB_post_SE, BEB_signif, NEB_* equivalents), per-site coordinates (site_coords), BEB reference sequence (BEB_ref_seq), grid posteriors, and diagnostic counts (num_BEB_sites, num_BEB_ge95, num_BEB_ge99, num_NEB_sites, num_NEB_ge95, num_NEB_ge99)
codemlM7 / codemlM8: Add beta distribution parameters (beta_p, beta_q), site-class MLE vectors (MLE_p, MLE_w), Newick tree with branch lengths (tree_newick_with_lengths), additional log-likelihood diagnostics (lnL_ntime, lnL_np), and Bayesian inference status (bayes_flags)
codemlM8: Adds two additional beta-distribution parameters specific to M8's selection-class extension (beta_p0, beta_w)
codemlFreeRatio: Adds free_w_branch_values (per-branch ω vector), dS_tree_newick, dN_tree_newick, w_node_label_newick (Newick tree with branch labels), and w_node_label_map (parsed mapping of branch labels to ω values)
codemlM2a / codemlM7 / codemlM8 / codemlFreeRatio: Add conv_msg (track any notes in report on convergence failure)
codemlM2a / codemlM7 / codemlM8: Add notes (empty issue tracker column)

baseml

The PAML baseml parser produces one row per branch under nucleotide-substitution models. Output includes branch-level fields (Branch_ID, Branch_Length_t, parent, is_internal, n_descendants, root_to_tip, branch_support), per-gene model parameters (GTR rate matrix elements GTR_a through GTR_e, nucleotide stationary frequencies piA/piC/piG/piT, transition/transversion ratio kappa with source attribution), pairwise sequence diagnostics (pairwise_kappa_*, pairwise_distance_* mean/min/max), compositional homogeneity tests (hom_X2, hom_G), and post-hoc information criteria (AIC, AICc, BIC).

yn00

The PAML yn00 parser produces one row per sequence pair from pairwise dN/dS estimation. Each row includes the primary (Yang-Nielsen 2000) estimates (S, N, t, kappa, omega, dN, dN_SE, dS, dS_SE), plus alternative method estimates: (Li-Wu-Luo 1985) (LWL85_*), (Li-Wu-Luo 1985) modified (LWL85m_*), and (Li-Pamilo-Bianchi 1993) (LPB93_*). Pairwise diagnostics (pair_n_identical_codons, pair_pct_identity, etc.), per-position GC content (GC_pos1 through GC_total), data-quality flags (dS_is_zero, any_nan_inf, omega_placeholder_flag), and codon-level vector columns (codon_usage_counts, codon_pos_base_seq, codon_pos_base_avg) are also captured.

Notes on schema consistency

evolharvester is in active development, and column naming conventions vary across parsers reflecting differences in formatting convention. The gene identifier column appears as seq in codeml parsers, Gene_ID in baseml, gene_id in yn00 and some HyPhy parsers (FEL, aBSREL), and gene in others (BUSTED, FUBAR, MEME, GARD). Branch identifiers similarly differ (branch_id in codeml, Branch_ID in baseml, branch in MEME and FUBAR). Careful adjustments will be made in future releases to advance maximum schema unification across parsers. Users should be aware of joining outputs from multiple and account for these differences. Generated CSV headers are the authoritative schema reference for v0.1.0. NOTE: Evolharvester, in addition to simplifying for downstream parsing/stat/viz, is built to handle outputs from multiple runs, data are intentionally pre-filtered for ω>1, pval<0.05, and padj<0.10 as routinely as possible to ensure CSV-output does not become unmanagably large for downstream work in the user's preferred development environment.

Citation

If you use evolharvester in published work, please cite via:

Pulvino, A.T. (2026). evolharvester (v0.1.0). GitHub repository. https://github.com/apulvino/evolharvester

A peer-reviewed Application Note describing evolharvester is in-preparation. Citation details will be updated upon acceptance.

A Zenodo archive with a citable DOI is forthcoming and will be linked in the badges above.

License

evolharvester is released under the MIT License. See LICENSE for details.

Author

Anthony T. Pulvino — Northwestern University, Interdisciplinary Biological Sciences (IBiS) Graduate Program

evolharvester was developed to support comparative genomics research for HyPhy and PAML users. I hope increased access to these tools helps support larger user-bases for both of these tools which both represent highly valuable contributions to the research community.

Contact

For bug reports, feature requests, or questions about evolharvester, please open an issue on the GitHub issue tracker.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

apulvino

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.0

May 8, 2026

0.1.0.dev1 pre-release yanked

May 8, 2026

Reason this release was yanked:

wrong release/gh mixup

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evolharvester-0.5.0.tar.gz (43.1 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evolharvester-0.5.0-py3-none-any.whl (42.4 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file evolharvester-0.5.0.tar.gz.

File metadata

Download URL: evolharvester-0.5.0.tar.gz
Upload date: May 8, 2026
Size: 43.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evolharvester-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`93af85dbc8c914c7be8fde5f243421396d18bab2a6192f3e0e977d083f569b66`
MD5	`40f08053509d3477bab3cb2f01b6bc78`
BLAKE2b-256	`c32baacce64de9091bb1e5dffb671b9e42de0be00b2adc0985408a4e8ff299b9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for evolharvester-0.5.0.tar.gz:

Publisher: workflow.yml on apulvino/evolharvester

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: evolharvester-0.5.0.tar.gz
- Subject digest: 93af85dbc8c914c7be8fde5f243421396d18bab2a6192f3e0e977d083f569b66
- Sigstore transparency entry: 1479433912
- Sigstore integration time: May 8, 2026
Source repository:
- Permalink: apulvino/evolharvester@db7fe7fc09de94547f866e126c2bebbc51adc14e
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/apulvino
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@db7fe7fc09de94547f866e126c2bebbc51adc14e
- Trigger Event: release

File details

Details for the file evolharvester-0.5.0-py3-none-any.whl.

File metadata

Download URL: evolharvester-0.5.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 42.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evolharvester-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`848f617f82a3c667c619cf49b9d7bed14c0328d74cd96b3b6cbd5ff328f7aff0`
MD5	`00c05662fb8a1d6e033f05b538741041`
BLAKE2b-256	`0e7cf2da82342609ca0dcd07045e807bed552188ab1579765b816e2ba14a08ee`

See more details on using hashes here.

Provenance

The following attestation bundles were made for evolharvester-0.5.0-py3-none-any.whl:

Publisher: workflow.yml on apulvino/evolharvester

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: evolharvester-0.5.0-py3-none-any.whl
- Subject digest: 848f617f82a3c667c619cf49b9d7bed14c0328d74cd96b3b6cbd5ff328f7aff0
- Sigstore transparency entry: 1479434084
- Sigstore integration time: May 8, 2026
Source repository:
- Permalink: apulvino/evolharvester@db7fe7fc09de94547f866e126c2bebbc51adc14e
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/apulvino
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@db7fe7fc09de94547f866e126c2bebbc51adc14e
- Trigger Event: release

evolharvester 0.5.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

evolharvester

Installation

From PyPI

From source (development)

Requirements

Supported analyses

Supported HyPhy methods

PAML programs

codeml model coverage

Quickstart

Parse a HyPhy FEL output

Parse a PAML codeml M2a output

List all supported parsers

Usage

General syntax

Worked examples

HyPhy site-level method (FEL)

HyPhy branch-level method (aBSREL)

PAML codeml site model (M2a)

PAML codeml branch model (FreeRatio)

Input handling modes

Output format

Design principles

HyPhy parsers

PAML parsers

Common columns: PAML codeml parsers

Model-specific extensions

baseml

yn00

Notes on schema consistency

Citation

License

Author

Contact

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance