Skip to main content

Dataframe-based tools for working with phylogenetic trees.

Project description

phyloframe

PyPI CI codecov GitHub stars DOI

Dataframe-based tools for working with phylogenetic trees.

Why a DataFrame-based Tree Representation?

The R ecosystem's success with the ape data structure demonstrates the utility of edge matrix tree representations --- phyloframe pushes this idea further with a fully tabular format hosted within DataFrame objects (e.g., pd.DataFrame, pl.LazyFrame, pl.DataFrame, etc.).

DataFrames are scripting-friendly and end-user extensible, enabling a composable, interoperable, high-performance ecosystem for phylogenetic analysis --- in applications to our work, scalable to billion-tip phylogenies.

Fast and highly portable load/save. Use pandas.read_csv, polars.read_parquet, etc. --- libraries transparently fetch from URLs, cloud providers (S3, Google Cloud, etc.). Fast, full-featured Newick format I/O support.

Benefit from modern tabular data formats. Granular deserialization of selected columns (e.g., Parquet), transparent compression configuration (e.g., Parquet), columnar compression for efficient storage, categorical strings, and explicit column typing.

Benefit from modern high-performance dataframe tooling. Lazy query optimization (e.g. Polars), larger-than-memory streaming operations (e.g., Polars), distributed computing operations (e.g., Dask), multithreaded operations (e.g., Polars), vectorized operations (e.g., NumPy), and just-in-time compilation (e.g., Numba).

Benefit from rich, expressive dataframe functionality. Leverage powerful querying and transformation APIs (e.g., Polars expressions, Pandas indexing), enabling flexible filtering, bulk column calculations, grouped aggregations, join/merge operations, and chained transformations directly over tree data without manual loops.​​​​​​​​​​​​​​​​

Cache-friendly, memory-efficient, flexible data structure. Data occupies contiguous arrays, expediting tree creation and topological order traversals (e.g., parents before children or vice versa). Base memory footprint is lightweight (e.g., as little as 32 bits per node), but can be dynamically augmented to expedite traversals and calculations (e.g., linked list over children via DataFrame columns for first child/next sibling indices).

Rich interoperative ecosystem. Multi-language interoperation (e.g., possible future support for zero-copy interop between R and Python via reticulate and Arrow; possible future support for zero-copy Polars DataFrames shared between Rust and Python). Multi-library interoperation (e.g., highly-optimized or zero-copy interoperation between Polars and Pandas; Python dataframe protocol). Compatibility with existing alife data standards ecosystem.

Performance

benchmark results Computational throughput (left) and memory efficiency (right) across tree sizes. Higher is better.

At large tree sizes, phyloframe improves speed and memory-efficiency for most operations.

Notably, newick reads and topological-order tree traversal (i.e., parents before children) are up to 10× faster than existing tools --- including implementations backed by native code. Newick writes are up to 2× faster.

Benchmarked operations include tree traversals, newick read/write, and pairwise operations.

Install

python3 -m pip install "phyloframe[jit]==0.8.0"

The [jit] extra installs Numba for just-in-time compilation, providing native-level performance for many operations. Jit dependency is strongly recommended.

A containerized release of phyloframe is available via ghcr.io

singularity exec docker://ghcr.io/mmore500/phyloframe:v0.8.0 python3 -m phyloframe --help

Quickstart

Phyloframe represents phylogenies as DataFrames in the alife standard format.

from phyloframe import legacy as pfl

# Parse a Newick tree (already in working format)
df = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")

# Mark properties and transform using df.pipe() (pandas syntactic sugar)
df = (
    df.pipe(pfl.alifestd_mark_leaves)
    .pipe(pfl.alifestd_mark_node_depth_asexual)
    .pipe(pfl.alifestd_collapse_unifurcations)
)

print("leaf count:", pfl.alifestd_count_leaf_nodes(df))
print(df[["id", "ancestor_id", "is_leaf", "node_depth"]].head())

The legacy module (from phyloframe import legacy) provides all current phyloframe operations. The legacy API is stable and will continue to be maintained for backward compatibility. A redesigned API will accompany phyloframe v1.0.0.

For a deeper introduction covering tree representation semantics, tree creation, tree computations, tree transforms, Polars, CLI, JIT compilation, and more, see the full quickstart guide.

Citing

If phyloframe contributes to a scholarly work, please cite it as

Matthew Andres Moreno. (2026). mmore500/phyloframe. Zenodo. https://doi.org/10.5281/zenodo.18842674

@software{moreno2026phyloframe,
  author = {Matthew Andres Moreno},
  title = {mmore500/phyloframe},
  month = mar,
  year = 2026,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.18842674},
  url = {https://doi.org/10.5281/zenodo.18842674}
}

And don't forget to leave a star on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phyloframe-0.8.0.tar.gz (1.4 MB view details)

Uploaded Source

File details

Details for the file phyloframe-0.8.0.tar.gz.

File metadata

  • Download URL: phyloframe-0.8.0.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for phyloframe-0.8.0.tar.gz
Algorithm Hash digest
SHA256 77b7c0c7973cba0269b9ec223592ce574a1be81028b427beeb8b60f9527cf2a3
MD5 ed98ec79646b6ce3636d351df1e6aa90
BLAKE2b-256 b45c5ce93127e9edd2ea461ebc42e5b57f2bb16b4452399d107e175547b2fcd6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page