Skip to main content

Dataframe-based tools for working with phylogenetic trees.

Project description

phyloframe

PyPI CI codecov GitHub stars DOI

Dataframe-based tools for working with phylogenetic trees.

Why a DataFrame-based Tree Representation?

The R ecosystem's success with the ape data structure demonstrates the utility of edge matrix tree representations --- phyloframe pushes this idea further with a fully tabular format hosted within DataFrame objects (e.g., pd.DataFrame, pl.LazyFrame, pl.DataFrame, etc.).

DataFrames are scripting-friendly and end-user extensible, enabling a composable, interoperable, high-performance ecosystem for phylogenetic analysis --- in applications to our work, scalable to billion-tip phylogenies.

Fast and highly portable load/save. Use pandas.read_csv, polars.read_parquet, etc. --- libraries transparently fetch from URLs, cloud providers (S3, Google Cloud, etc.). Fast, full-featured Newick format I/O support.

Benefit from modern tabular data formats. Granular deserialization of selected columns (e.g., Parquet), transparent compression configuration (e.g., Parquet), columnar compression for efficient storage, categorical strings, and explicit column typing.

Benefit from modern high-performance dataframe tooling. Lazy query optimization (e.g. Polars), larger-than-memory streaming operations (e.g., Polars), distributed computing operations (e.g., Dask), multithreaded operations (e.g., Polars), vectorized operations (e.g., NumPy), and just-in-time compilation (e.g., Numba).

Benefit from rich, expressive dataframe functionality. Leverage powerful querying and transformation APIs (e.g., Polars expressions, Pandas indexing), enabling flexible filtering, bulk column calculations, grouped aggregations, join/merge operations, and chained transformations directly over tree data without manual loops.​​​​​​​​​​​​​​​​

Cache-friendly, memory-efficient, flexible data structure. Data occupies contiguous arrays, expediting tree creation and topological order traversals (e.g., parents before children or vice versa). Base memory footprint is lightweight (e.g., as little as 32 bits per node), but can be dynamically augmented to expedite traversals and calculations (e.g., linked list over children via DataFrame columns for first child/next sibling indices).

Rich interoperative ecosystem. Multi-language interoperation (e.g., possible future support for zero-copy interop between R and Python via reticulate and Arrow; possible future support for zero-copy Polars DataFrames shared between Rust and Python). Multi-library interoperation (e.g., highly-optimized or zero-copy interoperation between Polars and Pandas; Python dataframe protocol). Compatibility with existing alife data standards ecosystem.

Install

python3 -m pip install "phyloframe[jit]==0.7.0"

The [jit] extra installs Numba for just-in-time compilation, providing native-level performance for many operations. Jit dependency is strongly recommended.

A containerized release of phyloframe is available via ghcr.io

singularity exec docker://ghcr.io/mmore500/phyloframe:v0.7.0 python3 -m phyloframe --help

Quickstart

Phyloframe represents phylogenies as DataFrames in the alife standard format.

from phyloframe import legacy as pfl

# Parse a Newick tree (already in working format)
df = pfl.alifestd_from_newick("((A:1,B:2):3,(C:4,D:5):6);")

# Mark properties and transform using df.pipe() (pandas syntactic sugar)
df = (
    df.pipe(pfl.alifestd_mark_leaves)
    .pipe(pfl.alifestd_mark_node_depth_asexual)
    .pipe(pfl.alifestd_collapse_unifurcations)
)

print("leaf count:", pfl.alifestd_count_leaf_nodes(df))
print(df[["id", "ancestor_id", "is_leaf", "node_depth"]].head())

The legacy module (from phyloframe import legacy) provides all current phyloframe operations. The legacy API is stable and will continue to be maintained for backward compatibility. A redesigned API will accompany phyloframe v1.0.0.

For a deeper introduction covering tree representation semantics, tree creation, tree computations, tree transforms, Polars, CLI, JIT compilation, and more, see the full quickstart guide.

Citing

If phyloframe contributes to a scholarly work, please cite it as

Matthew Andres Moreno. (2026). mmore500/phyloframe. Zenodo. https://doi.org/10.5281/zenodo.18842674

@software{moreno2026phyloframe,
  author = {Matthew Andres Moreno},
  title = {mmore500/phyloframe},
  month = mar,
  year = 2026,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.18842674},
  url = {https://doi.org/10.5281/zenodo.18842674}
}

And don't forget to leave a star on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phyloframe-0.7.0.tar.gz (512.7 kB view details)

Uploaded Source

File details

Details for the file phyloframe-0.7.0.tar.gz.

File metadata

  • Download URL: phyloframe-0.7.0.tar.gz
  • Upload date:
  • Size: 512.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phyloframe-0.7.0.tar.gz
Algorithm Hash digest
SHA256 b01db659979325055b1710bc9d3a4b140ff64f4698eba52bdc2aea165733267a
MD5 98ee1f7c6d0cd139255a0b708406f53f
BLAKE2b-256 a7fbcddd7185bf2c452d1b12b6ef0c840bd4f51e7c653ce802b6e5eea3691221

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page