Project description

These details have not been verified by PyPI

Project links

Project description

tspaint — Tree-Sequence Local Ancestry Inference

Soft, calibrated local ancestry along individual haplotypes, inferred from an inferred tree sequence (Relate → tskit, or tsinfer/tsdate). Ancestry is modelled as a discrete character evolving up the branches of each marginal tree under a continuous-time Markov chain (CTMC, "mugration"); reference haplotypes label tips through a soft noise model with learned per-tip credibility; and (Q, root frequencies, {w_i}) are fit by EM whose sufficient statistics are accumulated per tree-sequence edge, span-weighted — so a clade persisting across many trees is counted once. Output: for every haplotype, at every position, a calibrated posterior P(ancestry) — soft, not a hard call.

The same machinery, run on reference tips, is a leave-one-out introgression/mislabel detector. See CLAUDE.md for the authoritative spec, the math, and the design decisions.

Status

Research implementation. The inference engine and the soft-LAI deliverable are implemented and validated on simulated truth:

Component	Module	Validation
Ancestry CTMC, emission, credibility	`model.py`	unit
Per-branch dwell/jump stats (Van Loan)	`branch_stats.py`	analytic 2-state + quadrature
Felsenstein pruning (polytomy/forest/isolated, leave-one-out)	`pruning.py`	brute-force enumeration
Edge-blocked span-weighted sufficient stats	`accumulate.py`	no-double-count keystone
Blocked EM `(Q, π, {w_i})`	`em.py`	monotone log-lik; Q recovery; mislabel detection
Per-haplotype posteriors + missing-info tagging	`output.py`	coverage, prior-fallback
Accuracy / calibration / flicker metrics	`validate.py`, `experiments.py`	end-to-end on admixture sims
tsinfer inferred-ARG front end	`io_tsinfer.py`	inferred-ARG accuracy (bounded by ARG accuracy)
ARG-posterior ensemble merge	`ensemble.py`	average paintings across tree-sequence samples + uncertainty band
SINGER posterior front end	`io_singer.py`	run SINGER MCMC → posterior ARG samples (largely lifts the §9 ARG bound)
Head-to-head harness + RFMix comparator	`compare.py`, `io_rfmix.py`	score painters (tspaint, ARG-native baseline, RFMix) on one truth; tspaint matches RFMix accuracy, stays calibrated
Hard segmentation (deadband) + fragmentation metrics	`output.py`, `validate.py`	recovers the true tract-length distribution for dating (CLAUDE.md §9)
High-level `paint()` API	`api.py`	one-call fit + paint → `Painting`
Horizontal BP/EP smoother	`bp/`	wins on inferred ARGs (breakpoint F1 0.71→0.98), redundant on true ARG (CLAUDE.md §7)
Reference QC + actionable soft_refs/mask (Task 1)	`introgression.py`, `sim.py`	impure-ref discrimination + LOO introgression recall ~0.9 (CLAUDE.md §9)
Ghost / archaic search — depth-emission HMM (Task 2)	`archaic.py` (`detect_ghost`)	per-locus recall 0.99–1.00 at precision 1.0 reference-free; SINGER-ensemble + calibration-robust `depth="rank"` (CLAUDE.md §9)

On strong-structure msprime sims (the true ARG), painting accuracy is ~1.0 with good calibration, and breakpoint flicker is ~1000× below the true-tract discontinuity — so the blocked-EM approximation is sufficient on the true ARG, where the hard_segments deadband recovers the tract-length distribution for dating. On inferred ARGs a horizontal BP/EP smoother adds value (it suppresses tree-inference-induced spurious switches a per-position threshold cannot — breakpoint F1 0.71→0.98): use tspaint.paint(..., smooth=True) or tspaint.bp (CLAUDE.md §7). On a tsinfer-inferred ARG, painting is bounded by ARG accuracy: ~0.88 with dense variants down toward chance when variants are sparse — tree accuracy, not tract length, is the binding constraint (§9). With SINGER (Bayesian posterior ARG sampling) that bound largely disappears: single posterior samples paint ~0.99 even at sparse data where tsinfer gives chance, and merging the posterior ensemble adds a calibrated uncertainty band — the ARG inference method matters far more than the merge (§7.4).

Scaling. The E-step is O(#trees × #nodes) per EM iteration (after exact expm caching); at a fixed region length #trees is region-bounded, so runtime is roughly linear in sample size. Measured (true ARG, 0.1 Mb region): ~480 haplotypes fits in ~5 s (4 iterations), ~1.2 s/iteration; tsinfer inference adds ~1–3 s. ~500 haplotypes is comfortable for region/chromosome-scale analyses, with accuracy and flicker unaffected by sample size. The genome E-step is an exact map-reduce over independent trees, so it parallelises near-linearly across cores: paint(..., n_jobs=N) / tspaint.fit(..., n_jobs=N) (and the tspaint --cores CLI) split it bit-exactly over a process pool (tspaint.parallel; see paral-assess.md), and across the cluster each tspaint subcommand is a GWF job over SINGER windows × ensemble members.

Outstanding: the remaining external comparators (RFMix is wired via tspaint.compare.rfmix_paint; MOSAIC/FLARE and the ARG-native ARGMix / Pearson & Durbin need separate installs); the Relate --compress front end (io_relate.py); and the regime where merging SINGER posterior samples improves accuracy rather than only adding a calibrated band (CLAUDE.md §7.4).

Closed: the §6 order-only / ranked variant (measured — not beneficial; the short-region failure was π-identifiability, fixed by holding π uniform, estimate_pi=False); the RFMix head-to-head (tspaint matches RFMix's accuracy while staying calibrated); the segment-fragmentation question for dating (the hard_segments deadband recovers the true tract-length distribution, and on inferred ARGs the horizontal BP smoother — now in tspaint.bp / paint(smooth=True) — helps further, CLAUDE.md §7, §9).

Install

pixi install          # dev environment (tskit, msprime, numpy, scipy, ...)
pixi run test         # run the test suite

Quickstart

import tspaint

# 1. Get a tree sequence — here, simulate admixture with known truth
#    (or build one from genotypes: tspaint.io.tsinfer / relate / singer — ts | VCF Zarr | VCF).
ts = tspaint.simulate_admixture(n_admix=10, n_ref=10, sequence_length=1e6,
                              T_admix=100, T_split=5000, Ne=1000, random_seed=1)

# 2. Label the reference haplotypes (sample-node id -> ancestry state) and paint the rest.
pop = ts.tables.nodes.population
name = {p: ts.population(p).metadata["name"] for p in range(ts.num_populations)}
state = {p: i for i, p in enumerate(p for p, n in name.items() if n in ("A", "B"))}
labels = {int(s): state[pop[s]] for s in ts.samples() if pop[s] in state}

painting = tspaint.paint(ts, labels)          # EM-fit (Q[, π, w]) on references, paint the queries
painting.posteriors[painting.queries[0]]    # soft per-position posterior over ancestry (Segments)
painting.segments(deadband=0.4)             # hard ancestry tracts (for tract-length / dating)

Command-line interface (GWF / cluster)

The tspaint command wraps the library so a pipeline runs as file-in/file-out jobs (e.g. under GWF) with no glue Python. Computed results are .npz; hand-authored inputs are text (labels JSON {"<node>": <state>}; id-lists inline 3,4,5 or @file). --cores/-j defaults to $SLURM_JOB_CPUS_PER_NODE.

tspaint simulate --n-admix 8 --n-ref 8 -o sim.trees --labels-out labels.json   # or your own data
tspaint fit   sim.trees --labels labels.json -j 8 -o params.npz                 # one pooled fit
tspaint paint sim.trees --params params.npz -j 8 -o painting.npz                # paint (per member)
tspaint merge painting.npz -o merged.npz                                        # ensemble mean + band
tspaint date  sim.trees --labels labels.json -o rtt.npz                         # also qc/introgress/ghost/archaic

For a large chromosome, build the SINGER posterior ensemble window-by-window (tspaint trees singer-window per 5 Mb window, then tspaint trees merge-arg per member) and feed the members to fit / paint / merge. See examples/workflow.py for a runnable GWF Workflow and paral-assess.md for the parallelism design and the bit-exactness contract.

Public API

Core — tspaint.paint(ts, labels, queries=None, *, deadband=…) returns a Painting with .posteriors (soft Segment tracks), .segments(deadband=…) (hard tracts), and the fitted .Q / .pi / .w. Building blocks: tspaint.fit, posterior_table, hard_segments, Segment, make_generator_2state. Simulation: simulate_admixture, local_ancestry_truth.

Dating (optional) — tspaint.fit_rate_through_time(ts, labels) estimates the directional cross-ancestry rate through time (RateThroughTime with .q_AB / .q_BA / .plot()) — when the labelled ancestries diverged / exchanged genes. Painting.rate_through_time() reuses a painting's fit; it returns a new profile and does not change the painting (Q(t) gives no painting-accuracy gain — the paths stay side by side).

Namespaces

Namespace	What
`tspaint.metrics`	`balanced_accuracy`, `reliability_curve`, `breakpoint_precision_recall`, `switch_density`, `tract_boundary_error`, …
`tspaint.compare`	painters `tspaint_paint`, `nearest_reference_paint`, `rfmix_paint` + `head_to_head`
`tspaint.io`	unified front ends `tsinfer` / `relate` / `singer` (accept ts \| VCF Zarr \| VCF) + `add_mutations`
`tspaint.bp`	horizontal BP/EP smoother (`bp_paint`, `bp_smooth`) — also `paint(smooth=True)`; helps on inferred ARGs (§7)
`tspaint.dating`	admixture rate through time (time-inhomogeneous directional mugration EM): `fit_rate_through_time`, `RateThroughTime`, `paint_qt`
`tspaint.experiments`	end-to-end drivers: `admixture_experiment`, `age_sweep`, `fragmentation_experiment`, `singer_ensemble_experiment`, …
`tspaint.introgression`	Task 1 — reference QC: `reference_qc` (with actionable `.soft_refs()` / `.mask()`), anonymous `foreign_tracts` (`min_depth=` → the fast deep-ghost flag) — built on the `loo_posterior_table` lens
`tspaint.archaic`	Task 2 — ghost search: `detect_ghost` — reference-free depth-emission HMM for calibrated per-locus `P(ghost)`; accepts a SINGER ensemble and `depth="rank"` (calibration-robust). (`detect_archaic` is a deprecated alias)

Lower-level machinery is in the named submodules (tspaint.model, tspaint.pruning, tspaint.accumulate, tspaint.em, tspaint.output, tspaint.ensemble, tspaint.ranked); tspaint.parallel (bit-exact multi-core E-step), tspaint.serialize (.npz save/load), and tspaint.cli (the tspaint command).

See notebooks/ for the persistence go/no-go (00), accuracy / calibration / accuracy-vs-age (01), the flicker / bp/ decision (02), and haplotype paintings + runtime/correctness/flicker scaling across regimes (03).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.18

Jul 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tspaint-0.1.18.tar.gz (2.4 MB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tspaint-0.1.18-py3-none-any.whl (200.5 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file tspaint-0.1.18.tar.gz.

File metadata

Download URL: tspaint-0.1.18.tar.gz
Upload date: Jul 1, 2026
Size: 2.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tspaint-0.1.18.tar.gz
Algorithm	Hash digest
SHA256	`fd472f4983883378f8571bc0929899b5d943bd985ed79ea0811ab739be6199fa`
MD5	`460c533b4f063b1e372de4bb37e29ccc`
BLAKE2b-256	`10b45ad5dd39ac3888b15c8a35c0ce6ea3cdcf88228c330fad46835d0667315c`

See more details on using hashes here.

File details

Details for the file tspaint-0.1.18-py3-none-any.whl.

File metadata

Download URL: tspaint-0.1.18-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 200.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tspaint-0.1.18-py3-none-any.whl
Algorithm	Hash digest
SHA256	`48ef9f5c0a165a6602162176ef615c8c4123ff06d49803d7ebb3b659afbba711`
MD5	`294f69a125078bc94b04e889a1e06163`
BLAKE2b-256	`bb39f1467cbc218b2b2a3a6ec3054c7b985a5171b2dee0aeccbe049031c1fcb9`

See more details on using hashes here.

tspaint 0.1.18

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tspaint — Tree-Sequence Local Ancestry Inference

Status

Install

Quickstart

Command-line interface (GWF / cluster)

Public API

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes