Skip to main content

Geographic Ancestry Inference Algorithm - Python implementation (continuous space parsimony methods)

Project description

gaiapy: Geographic Ancestry Inference Algorithm (Python)

gaiapy is a Python port of the GAIA R package for inferring the geographic locations of genetic ancestors using tree sequences. This package implements generalized parsimony methods for ancestral location reconstruction in continuous geographic space.

Note: This implementation is under active development. Use the R implementation for all scientific applications.

Note: This package is distributed on PyPI as geoancestry but the module name is gaiapy.

Current Implementation Status

Implemented and Ready to Use:

  • Quadratic parsimony - for ancestors in continuous space, minimizing sum of squared Euclidean distances
  • Linear parsimony - for ancestors in continuous space, minimizing sum of absolute (Manhattan) distances
  • Full metadata integration and tree sequence augmentation
  • Comprehensive validation and utility functions

🚧 Not Yet Implemented:

  • Discrete parsimony - for ancestors restricted to finite location sets (coming soon)
  • Ancestry coefficients - temporal analysis of ancestry proportions (coming soon)
  • Migration flux - migration flow analysis between regions (coming soon)

This package leverages the Python tskit API directly, avoiding the need for C wrappers and making the implementation more accessible for Python users and web applications.

Installation

Install from PyPI:

pip install geoancestry

Install from source for development:

git clone https://github.com/chris-a-talbot/gaiapy
cd gaiapy
pip install -e ".[dev]"

Quick Start

Basic Continuous Space Reconstruction

import gaiapy as gp
import tskit
import numpy as np

# Load your tree sequence
ts = tskit.load("path/to/treesequence.trees")

# Define sample locations as [node_id, x_coord, y_coord]
# node_id: Tree sequence node IDs (0-based)
# x_coord, y_coord: Geographic coordinates (any coordinate system)
samples = np.array([
    [0, 1.5, 2.0],  # node 0 at coordinates (1.5, 2.0)
    [1, 4.2, 3.1],  # node 1 at coordinates (4.2, 3.1) 
    [2, 6.7, 5.5],  # node 2 at coordinates (6.7, 5.5)
    # ... more samples
])

# Quadratic reconstruction (minimizes sum of squared Euclidean distances)
mpr_quad = gp.quadratic_mpr(ts, samples)
locations_quad = gp.quadratic_mpr_minimize(mpr_quad)

# Linear reconstruction (minimizes sum of Manhattan distances)
mpr_lin = gp.linear_mpr(ts, samples)
locations_lin = gp.linear_mpr_minimize(mpr_lin)

print(f"Quadratic reconstruction shape: {locations_quad.shape}")
print(f"Linear reconstruction shape: {locations_lin.shape}")

Working with Tree Sequence Metadata

# If your tree sequence has location metadata, you can extract it automatically
sample_locs = gp.extract_sample_locations_from_metadata(ts)

# Or augment a tree sequence with location data
ts_with_locs = gp.augment_tree_sequence_with_locations(ts, samples)

# Use metadata-aware reconstruction
mpr_meta = gp.quadratic_mpr_with_metadata(ts_with_locs)
locations_meta = gp.quadratic_mpr_minimize(mpr_meta)

Advanced Options

# Discrete coordinate system reconstruction (useful for grid-based coordinates)
locations_discrete = gp.quadratic_mpr_minimize_discrete(mpr_quad)

# Alternative linear parsimony with discrete output
locations_lin_discrete = gp.linear_mpr_minimize_discrete(mpr_lin)

# Export results for further analysis
gp.export_locations_to_file(locations_quad, "ancestral_locations.tsv")

Key Functions (Currently Implemented)

Continuous Space Functions

  • quadratic_mpr() - Continuous space reconstruction using squared distances
  • linear_mpr() - Continuous space reconstruction using absolute distances
  • quadratic_mpr_minimize() - Find optimal continuous locations (quadratic)
  • linear_mpr_minimize() - Find optimal continuous locations (linear)
  • quadratic_mpr_minimize_discrete() - Discrete coordinate optimization (quadratic)
  • linear_mpr_minimize_discrete() - Discrete coordinate optimization (linear)

Metadata Integration

  • quadratic_mpr_with_metadata() - Metadata-aware quadratic reconstruction
  • linear_mpr_with_metadata() - Metadata-aware linear reconstruction
  • extract_sample_locations_from_metadata() - Extract locations from tree sequence metadata
  • augment_tree_sequence_with_locations() - Add location data to tree sequences
  • validate_location_metadata() - Validate location data format
  • export_locations_to_file() / import_locations_from_file() - I/O utilities

Input Data Format

Sample locations should be provided as a NumPy array with shape (n_samples, 3):

samples = np.array([
    [node_id, x_coordinate, y_coordinate],
    [node_id, x_coordinate, y_coordinate],
    # ...
])
  • node_id: Tree sequence node ID (0-based, integer)
  • x_coordinate, y_coordinate: Geographic coordinates (float, any coordinate system)

Output Format

Reconstructed locations are returned as NumPy arrays with shape (n_nodes, 2) where:

  • Row index corresponds to tree sequence node ID
  • Column 0: x-coordinate of reconstructed location
  • Column 1: y-coordinate of reconstructed location

Coming Soon

The following features from the original GAIA R package are planned for future releases:

  • Discrete parsimony - discrete_mpr(), discrete_mpr_minimize(), discrete_mpr_edge_history()
  • Ancestry analysis - discrete_mpr_ancestry(), discrete_mpr_ancestry_flux()

References

Grundler, M.C., Terhorst, J., and Bradburd, G.S. (2025) A geographic history of human genetic ancestry. Science 387(6741): 1391-1397. DOI: 10.1126/science.adp4642

License

MIT License (adapted from original CC-BY 4.0 International)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geoancestry-0.1.2.tar.gz (37.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geoancestry-0.1.2-py3-none-any.whl (31.3 kB view details)

Uploaded Python 3

File details

Details for the file geoancestry-0.1.2.tar.gz.

File metadata

  • Download URL: geoancestry-0.1.2.tar.gz
  • Upload date:
  • Size: 37.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for geoancestry-0.1.2.tar.gz
Algorithm Hash digest
SHA256 33cbae7e69daa9bad3bb2266136365729fd1ed7b5f97da7e42b601bcf1663371
MD5 f84ba6c7d0ba922b8c379fb2fcbccea9
BLAKE2b-256 73856e39c96797873d50cc75cbc5fb3912cd36f9d790de1819583f25014beee4

See more details on using hashes here.

File details

Details for the file geoancestry-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: geoancestry-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 31.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for geoancestry-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 86ddad96767bdd09dd8bd8baa5cce7609f4b6d018d8ab39f4646b954354ffcb6
MD5 2759129e377b3dac1018c977ca4e4eac
BLAKE2b-256 86ed53fbc0c5cfe097a114640c7bb244f0f7d38ee24a4b87fa27390bb7216233

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page