Dataframe-based tools for working with phylogenetic trees.
Project description
phyloframe
Dataframe-based tools for working with phylogenetic trees.
- Free software: MIT license
- Documentation: https://phyloframe.readthedocs.io
- Repository: https://github.com/mmore500/phyloframe
Why a DataFrame-based Tree Representation?
The R ecosystem's success with the ape data structure demonstrates the utility of edge matrix tree representations --- phyloframe pushes this idea further with a fully tabular format hosted within DataFrame objects (e.g., pd.DataFrame, pl.LazyFrame, pl.DataFrame, etc.).
DataFrames are scripting-friendly and end-user extensible, enabling a composable, interoperable, high-performance ecosystem for phylogenetic analysis --- in applications to our work, scalable to billion-tip phylogenies.
Fast and highly portable load/save.
Use pandas.read_csv, polars.read_parquet, etc. --- libraries transparently fetch from URLs, cloud providers (S3, Google Cloud, etc.).
Fast, full-featured Newick format I/O support.
Benefit from modern tabular data formats. Granular deserialization of selected columns (e.g., Parquet), transparent compression configuration (e.g., Parquet), columnar compression for efficient storage, categorical strings, and explicit column typing.
Benefit from modern high-performance dataframe tooling. Lazy query optimization (e.g. Polars), larger-than-memory streaming operations (e.g., Polars), distributed computing operations (e.g., Dask), multithreaded operations (e.g., Polars), vectorized operations (e.g., NumPy), and just-in-time compilation (e.g., Numba).
Benefit from rich, expressive dataframe functionality. Leverage powerful querying and transformation APIs (e.g., Polars expressions, Pandas indexing), enabling flexible filtering, bulk column calculations, grouped aggregations, join/merge operations, and chained transformations directly over tree data without manual loops.
Cache-friendly, memory-efficient, flexible data structure. Data occupies contiguous arrays, expediting tree creation and topological order traversals (e.g., parents before children or vice versa). Base memory footprint is lightweight (e.g., as little as 32 bits per node), but can be dynamically augmented to expedite traversals and calculations (e.g., linked list over children via DataFrame columns for first child/next sibling indices).
Rich interoperative ecosystem.
Multi-language interoperation (e.g., possible future support for zero-copy interop between R and Python via reticulate and Arrow; possible future support for zero-copy Polars DataFrames shared between Rust and Python).
Multi-library interoperation (e.g., highly-optimized or zero-copy interoperation between Polars and Pandas; Python dataframe protocol).
Compatibility with existing alife data standards ecosystem.
Performance
Computational throughput (left) and memory efficiency (right) across tree sizes.
Higher is better.
At large tree sizes, phyloframe improves speed and memory-efficiency for most operations.
Notably, newick reads and topological-order tree traversal (i.e., parents before children) are up to 10× faster than existing tools --- including implementations backed by native code. Newick writes are up to 2× faster.
Benchmarked operations include tree traversals, newick read/write, and pairwise operations.
Install
python3 -m pip install "phyloframe[jit]==0.8.0"
The [jit] extra installs Numba for just-in-time compilation, providing native-level performance for many operations.
Jit dependency is strongly recommended.
A containerized release of phyloframe is available via ghcr.io
singularity exec docker://ghcr.io/mmore500/phyloframe:v0.8.0 python3 -m phyloframe --help
Quickstart
Phyloframe represents phylogenies as DataFrames in the alife standard format.
from phyloframe import legacy as pfl
# Parse a Newick tree (already in working format)
df = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")
# Mark properties and transform using df.pipe() (pandas syntactic sugar)
df = (
df.pipe(pfl.alifestd_mark_leaves)
.pipe(pfl.alifestd_mark_node_depth_asexual)
.pipe(pfl.alifestd_collapse_unifurcations)
)
print("leaf count:", pfl.alifestd_count_leaf_nodes(df))
print(df[["id", "ancestor_id", "is_leaf", "node_depth"]].head())
The legacy module (from phyloframe import legacy) provides all current phyloframe operations.
The legacy API is stable and will continue to be maintained for backward compatibility.
A redesigned API will accompany phyloframe v1.0.0.
For a deeper introduction covering tree representation semantics, tree creation, tree computations, tree transforms, Polars, CLI, JIT compilation, and more, see the full quickstart guide.
Citing
If phyloframe contributes to a scholarly work, please cite it as
Matthew Andres Moreno. (2026). mmore500/phyloframe. Zenodo. https://doi.org/10.5281/zenodo.18842674
@software{moreno2026phyloframe,
author = {Matthew Andres Moreno},
title = {mmore500/phyloframe},
month = mar,
year = 2026,
publisher = {Zenodo},
doi = {10.5281/zenodo.18842674},
url = {https://doi.org/10.5281/zenodo.18842674}
}
And don't forget to leave a star on GitHub!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file phyloframe-0.8.0.tar.gz.
File metadata
- Download URL: phyloframe-0.8.0.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77b7c0c7973cba0269b9ec223592ce574a1be81028b427beeb8b60f9527cf2a3
|
|
| MD5 |
ed98ec79646b6ce3636d351df1e6aa90
|
|
| BLAKE2b-256 |
b45c5ce93127e9edd2ea461ebc42e5b57f2bb16b4452399d107e175547b2fcd6
|