Skip to main content

hstrat enables phylogenetic inference on distributed digital evolution populations

Project description

hstrat wordmark

PyPi codecov Codacy Badge CI Read The Docs GitHub stars

hstrat enables phylogenetic inference on distributed digital evolution populations

Install

python3 -m pip install hstrat

Usage

from hstrat import hstrat

stratum_retention_policy = hstrat.geom_seq_nth_root_tapered_algo.Policy(
    parameterizer=hstrat.PropertyAtMostParameterizer(
        target_value=127,
        policy_evaluator \
            =hstrat.MrcaUncertaintyAbsExactEvaluator(
                at_num_strata_deposited=256,
                at_rank=0,
        ),
        param_lower_bound=1,
        param_upper_bound=1024,
    ),
)

individual1 = hstrat.HereditaryStratigraphicColumn(
    stratum_retention_policy=stratum_retention_policy,
)
individual2 = hstrat.HereditaryStratigraphicColumn(
  stratum_retention_policy=stratum_retention_policy,
)

individual1_child1 = individual1.CloneDescendant()

hstrat.does_have_any_common_ancestor(individual1, individual2) # -> False
hstrat.does_have_any_common_ancestor(individual1_child1, individual2) # -> False

individual1_grandchild1 = individual1_child1.CloneDescendant()
individual1_grandchild2 = individual1_child1.CloneDescendant()

result = hstrat.calc_rank_of_mrca_bounds_between(
  individual1_grandchild1,
  individual1_grandchild2,
)
print(result) # -> (1, 2)

As shown in the example above, all library components can be accessed directly from the convenience flat namespace hstrat.hstrat.

How it Works

The goal of this software is to enable approximate inference of the phylogenetic history of a distributed digital population solely through analysis of heritable genome annotations. Put another way, given a scenario where packets of data are being copied and moved within a distributed system, this software enables estimation of how closely any two data packets are related. More precisely, for any two extant data packets, estimation bounds can be produced for the number of copies elapsed from those packets' last shared source copy (i.e., most recent common ancestor a.k.a. MRCA) to yield each extant packet. This is done by means of annotations on the data being copied itself --- no centralized tracking system required.

This capability has direct applications in digital evolution research (e.g., artificial life, genetic programming, genetic algorithms), and also may prove useful for other distributed systems applications.

Cartoon showing relatedness estimation via a bitstring under neutral drift

A simple heritable annotation to enable relatedness estimation would be a bitstring under neutral drift. Under this model, the number of generations elapsed since two bitstrings' MRCA would be inferred from the number of mismatching sites --- more mismatching sites would imply less relatedness. An example scenario is shown above.

This model suffers from several serious drawbacks in application as a tool for relatedness estimation, including difficulty discerning uneven elapse of generations since MRCA and difficulty parameterizing the model for effective inference over greatly varying generational scales.

Cartoon showing relatedness estimation via a hereditary stratigraphic column

Instead, we use randomly-generated binary "fingerprints" as a record of shared ancestry at a particular generation.

An example scenario is shown above. Each generation, a new randomly-generated "fingerprint" (drawn as a solid-colored rectangle above) is appended to the genome annotation. Genomes' annotations will share identical fingerprints for the generations they experienced shared ancestry. Estimation of the MRCA generation is straightforward: the first mismatching fingerprints denote the end of common ancestry.

Internally, we refer to these "fingerprints" as differentia. The bit width of differentiae controls the probability of spurious collision events (where strata happen to share the same randomly-generated value by chance), and is tunable by the end user.

In analogy to geological layering, we refer to the data deposited each generation (differentia and optional user-defined arbitrary annotations) as a stratum. We call the annotation comprised of strata deposited at each generation a hereditary stratigraphic column. In accordance, we describe this general approach for relatedness estimation hereditary stratigraphy.

Comparison of hereditary stratigraphic column before and after pruning retained strata

As stated so far, hereditary stratigraphy suffers from a major pitfall: linear space complexity of the annotation with respect to the number of generations elapsed. This means that time required to copy and transmit annotations, as well as memory requirements to store those annotations, will grow rapidly and without bound.

Systematically pruning strata --- deleting data from certain generations from the annotation as generations elapse --- can reduce space complexity. However, pruning introduces uncertainty to MRCA generation estimates. The exact last generation of common ancestry is bounded between the last-matching and first-mismatching strata, but cannot be resolved within that window.

In this way, hereditary stratigraphy exposes a direct, well-delimited, and flexible trade-off between space complexity and estimation accuracy. A pruning strategy must specify the set of strata to be pruned at each generational step. We call the particular sequence of strata sets pruned at each generation a stratum retention policy.

Comparison of column size trajectories under stratum retention policies with different space complexities

The most obvious consideration with respect to stratum retention policy is the number of strata to prune. If, on average, one stratum is pruned for each added then a constant O(1) space complexity will be achieved. On the other hand, pruning no strata or pruning every other, every third, etc. strata will yield a linear O(n) space complexity. Intermediate space complexities, such as O(log(n)), can also be achieved.

Comparison of even pruning and recency-proportional pruning

Another, more subtle, stratum retention policy consideration is where to prune. An obvious choice might be to prune uniformly, yielding evenly-spaced retained strata. However, pruning may also be performed to distribute retained strata so that estimation windows provide fixed relative error rather than absolute error. That is, error in MRCA generation estimation would be proportional to the number of generations elapsed since the true MRCA generation. Such a strategy exchanges lower absolute error in estimating more recent MRCA events for higher absolute error in estimating more ancient MRCA events. The cartoon above contrasts retained strata under even and recency-proportional pruning.

Requirements on acceptable space-vs-resolution and distribution-of-resolution trade-offs will vary fundamentally between use cases. So, the hstrat software accommodates modular, interchangeable specification of stratum retention policy. We provide a library of predefined "stratum retention algorithms," summarized below. However, end users can also use custom stratum retention algorithms defined within their own codebases outside the library, if needed. (Any custom algorithms with general appeal are welcome to be contributed back to the hstrat library!)

More detail on the rationale, implementation details, and performance guarantees of hereditary stratigraphy can be found in (Moreno et al., 2022).

Available Stratum Retention Algorithms

The space-vs-resolution and distribution-of-resolution trade-offs of library-provided stratum retention algorithms are summarized below.

Stratum Retention Algorithm Space Complexity MRCA Gen Uncertainty
Fixed Resolution Algorithm n/k k
Recency Proportional Resolution Algorithm k * log(n) m/k
Depth Proportional Resolution Algorithm k n/k
Geometric Sequence Nth Root Algorithm k m * n^(1/k)

where n is generations elapsed, m is generations since MRCA, and k is an arbitrary user-determined constant.

Note that distribution-of-resolution trade-offs are described via the definition of uncertainty bounds in terms of generations since MRCA m versus overall generations elapsed n.

The hstrat library includes a suite of variants for several of these stratum retention algorithms. These variants differ in terms of secondary considerations, for example whether column size exactly traces the asymptotic guarantee or fluctuates around it. Computational intensity to calculate the set of strata to be dropped at each generation may also differ between variants.

The next sections tour available stratum retention algorithms in detail.

Retention Drip Plot Visualization

No History Retained History All History

Because stratum retention policies unfold across space and time (i.e., the location of pruned strata and the generations at which they are pruned), the retention drip plot visualizations we use to depict them require a bit of explanation to get up to speed with. This section builds up the visualizations step-by-step in the table above. (Note that the animations loop at 256 generations; if you are impatient you can refresh the page to restart the animations.)

The leftmost column animates the evolution of a single hereditary stratigraphic column over generations with no historical overlay. The column itself is shown "tipped over," appearing as a horizontal blue rectangle. Upside-down triangles are stratum within the column. The position of strata in the x dimension correspond to the generation of their deposition; new strata are appended on the right. Retention status is depicted with color. Black strata are retained and red strata have been pruned.

The center column uses the y dimension to introduce historical traces for retained strata. The line below each retained stratum spans from the generation it was deposited to the current generation. Note that the column itself is positioned at the current generation at the top of the diagram.

Finally, the rightmost column further layers on historical traces of pruned strata. Instead of having their trace deleted when pruned, their trace is turned red and frozen in place. So, red traces span from the generation a pruned stratum was deposited to the generation it was pruned.

Animated panels below situate this visualization alongside additional graphs depicting absolute and relative estimation error across possible MRCA generations as well as relative and absolute column size (i.e., number of strata retained). Two policy instantiations are visualized for each algorithm --- one yielded from a sparse parameterization and the other from a dense parameterization (i.e., one configured to retain fewer strata and one configured to retain more).

Depth Proportional Resolution Algorithm

The depth proportional resolution algorithm drops half of retained strata when a threshold size cap is met in order to maintain O(1) space complexity. Retained strata are evenly spaced.

Sparse Parameterization Dense Parameterization
Dense Parameterization Detail

Tapered Depth Proportional Resolution Algorithm

The tapered depth proportional resolution algorithm drops strata gradually from the rear once the threshold size cap is met (instead of purging half of strata all at once).

Sparse Parameterization Dense Parameterization
Dense Parameterization Detail

Fixed Resolution Algorithm

The fixed resolution algorithm permanently retains every nth stratum, giving it linear O(n) space complexity and uniform distribution of retained strata.

Sparse Parameterization Dense Parameterization
Dense Parameterization Detail

Geometric Sequence Nth Root Algorithm

The geometric sequence algorithm provides constant O(1) space complexity with recency-proportional distribution of retained strata. To accomplish this, a fixed number k of fixed-size subcomponents track k waypoints exponentially spaced backward from the most-recent stratum (i.e., spaced corresponding to the roots 0/k, 1/k, 2/k, ..., k/k).

Sparse Parameterization Dense Parameterization
Dense Parameterization Detail

Tapered Geometric Sequence Nth Root Algorithm

The tapered geometric sequence nth root algorithm maintains exactly-constant size at its vanilla counterpart's upper bound instead of fluctuating below that bound. It is more computationally expensive than the vanilla geometric sequence nth root algorithm.

Sparse Parameterization Dense Parameterization
Dense Parameterization Detail

Recency Proportional Resolution Algorithm

The recency proportional resolution algorithm maintains relative MRCA estimation error below a constant bound. It exhibits logarithmic O(log(n)) space complexity.

Sparse Parameterization Dense Parameterization
Dense Parameterization Detail

Other Algorithms

These stratum retention algorithms are less likely of interest to end users, included in the library primarily for testing and design validation, but are available nonetheless.

  • nominal resolution policy
  • perfect resolution policy
  • pseudostochastic policy
  • stochastic policy

Custom Algorithm Design

Custom stratum retention algorithms can easily be substituted for supplied algorithms without performing any modifications to library code. To start, you'll most likely want to copy an existing algorithm from hstrat/stratum_retention_strategy/stratum_retention_algorithms to use as an API scaffold.

If writing a custom stratum retention algorithm, you will need to consider

  1. Prune sequencing.

    When you discard a stratum now, it won't be available later. If you need a stratum at a particular time point, you must be able to guarantee it hasn't already been discarded at any prior time point.

  2. Column composition across intermediate generations.

    For many use cases, resolution and column size guarantees will need to hold at all generations because the number of generations elapsed at the end of an experiment is not known a priori or the option of continuing the experiment with evolved genomes is desired.

  3. Tractability of computing the deposit generations of retained strata at an arbitrary generation.

    If this set of generations can be computed efficiently, then an efficient reverse mapping from column array index to deposition generation can be attained. Such a mapping enables deposition generation to be omitted from strata, potentially yielding a 50%+ space savings (depending on the differentia bit width used).

Algorithm Parameterizers

Stratum retention algorithms can be automatically parameterized for desired properties. For example, you may want to produce a policy from a particular algorithm that retains at most 100 strata at generation 1 million.

Two components are required to perform a parameterization.

The first, an "evaluator" controls which policy property is being parameterized for. Available options are

  • MrcaUncertaintyAbsExactEvaluator
  • MrcaUncertaintyAbsUpperBoundEvaluator
  • MrcaUncertaintyRelExactEvaluator
  • MrcaUncertaintyRelUpperBoundEvaluator
  • NumStrataRetainedExactEvaluator
  • NumStrataRetainedUpperBoundEvaluator

The second, a "parameterizer" controls whether the policy property should be paramaterized to be greater than, equal, or less than equal a target value. Available options are

  • PropertyAtMostParameterizer
  • PropertyAtLeastParameterizer
  • PropertyExactlyParameterizer

The evaluator should be provided as an argument to parameterizer, which should in turn be provided as an argument to the algorithm's Policy initializer.

stratum_retention_policy = hstrat.geom_seq_nth_root_tapered_algo.Policy(
    parameterizer=hstrat.PropertyAtMostParameterizer(
        target_value=127,
        policy_evaluator \
            =hstrat.MrcaUncertaintyAbsExactEvaluator(
                at_num_strata_deposited=256,
                at_rank=0,
        ),
    ),
)

References

Matthew Andres Moreno, Emily Dolson, Charles Ofria; July 18–22, 2022. "Hereditary Stratigraphy: Genome Annotations to Enable Phylogenetic Inference over Distributed Populations." Proceedings of the ALIFE 2022: The 2022 Conference on Artificial Life. ALIFE 2022: The 2022 Conference on Artificial Life. Online. (pp. 64). ASME.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

hcat

hcat

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hstrat-1.0.5.tar.gz (98.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hstrat-1.0.5-py2.py3-none-any.whl (293.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file hstrat-1.0.5.tar.gz.

File metadata

  • Download URL: hstrat-1.0.5.tar.gz
  • Upload date:
  • Size: 98.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for hstrat-1.0.5.tar.gz
Algorithm Hash digest
SHA256 23d0e72cc385aa690cca7e9d621e59323066fbaf03ebbef0a89890e67a96c2b6
MD5 7ec34462613010ff1b9d525338a6b5f6
BLAKE2b-256 8a0d55fee98912a65b8176d6610ddd9c232a5b53df5272bb378a3c3a6a0ccf4f

See more details on using hashes here.

File details

Details for the file hstrat-1.0.5-py2.py3-none-any.whl.

File metadata

  • Download URL: hstrat-1.0.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 293.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for hstrat-1.0.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 464f500ef370a28cfdb7637f933f7a84c508b6793dcb7d6ba67ed256571dffed
MD5 6c2b612a00450060602f71fa47d796fc
BLAKE2b-256 8db1e95e8d70bb4743c31f08cd967cf21605f5b53483f8e0ff9b84638e918e9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page