Project description
Project description
tspaint — Tree-Sequence Local Ancestry Inference
Soft, calibrated local ancestry along individual haplotypes, inferred from an
inferred tree sequence (Relate → tskit, or tsinfer/tsdate). Ancestry is modelled
as a discrete character evolving up the branches of each marginal tree under a
continuous-time Markov chain (CTMC, "mugration"); reference haplotypes label tips
through a soft noise model with learned per-tip credibility; and
(Q, root frequencies, {w_i}) are fit by EM whose sufficient statistics are
accumulated per tree-sequence edge, span-weighted — so a clade persisting across
many trees is counted once. Output: for every haplotype, at every position, a
calibrated posterior P(ancestry) — soft, not a hard call.
The same machinery, run on reference tips, is a leave-one-out
introgression/mislabel detector. See CLAUDE.md for the authoritative
spec, the math, and the design decisions.
Status
Research implementation. The inference engine and the soft-LAI deliverable are implemented and validated on simulated truth:
| Component | Module | Validation |
|---|---|---|
| Ancestry CTMC, emission, credibility | model.py |
unit |
| Per-branch dwell/jump stats (Van Loan) | branch_stats.py |
analytic 2-state + quadrature |
| Felsenstein pruning (polytomy/forest/isolated, leave-one-out) | pruning.py |
brute-force enumeration |
| Edge-blocked span-weighted sufficient stats | accumulate.py |
no-double-count keystone |
Blocked EM (Q, π, {w_i}) |
em.py |
monotone log-lik; Q recovery; mislabel detection |
| Per-haplotype posteriors + missing-info tagging | output.py |
coverage, prior-fallback |
| Accuracy / calibration / flicker metrics | validate.py, experiments.py |
end-to-end on admixture sims |
| tsinfer inferred-ARG front end | io_tsinfer.py |
inferred-ARG accuracy (bounded by ARG accuracy) |
| ARG-posterior ensemble merge | ensemble.py |
average paintings across tree-sequence samples + uncertainty band |
| SINGER posterior front end | io_singer.py |
run SINGER MCMC → posterior ARG samples (largely lifts the §9 ARG bound) |
| Head-to-head harness + RFMix comparator | compare.py, io_rfmix.py |
score painters (tspaint, ARG-native baseline, RFMix) on one truth; tspaint matches RFMix accuracy, stays calibrated |
| Hard segmentation (deadband) + fragmentation metrics | output.py, validate.py |
recovers the true tract-length distribution for dating (CLAUDE.md §9) |
High-level paint() API |
api.py |
one-call fit + paint → Painting |
| Horizontal BP/EP smoother | bp/ |
wins on inferred ARGs (breakpoint F1 0.71→0.98), redundant on true ARG (CLAUDE.md §7) |
| Reference QC + actionable soft_refs/mask (Task 1) | introgression.py, sim.py |
impure-ref discrimination + LOO introgression recall ~0.9 (CLAUDE.md §9) |
| Ghost / archaic search — depth-emission HMM (Task 2) | archaic.py (detect_ghost) |
per-locus recall 0.99–1.00 at precision 1.0 reference-free; SINGER-ensemble + calibration-robust depth="rank" (CLAUDE.md §9) |
On strong-structure msprime sims (the true ARG), painting accuracy is ~1.0 with good
calibration, and breakpoint flicker is ~1000× below the true-tract discontinuity — so the
blocked-EM approximation is sufficient on the true ARG, where the hard_segments deadband
recovers the tract-length distribution for dating. On inferred ARGs a horizontal BP/EP
smoother adds value (it suppresses tree-inference-induced spurious switches a per-position
threshold cannot — breakpoint F1 0.71→0.98): use tspaint.paint(..., smooth=True) or tspaint.bp
(CLAUDE.md §7). On a tsinfer-inferred ARG, painting is
bounded by ARG accuracy: ~0.88 with dense variants down toward chance when variants
are sparse — tree accuracy, not tract length, is the binding constraint (§9). With
SINGER (Bayesian posterior ARG sampling) that bound largely disappears: single
posterior samples paint ~0.99 even at sparse data where tsinfer gives chance, and merging
the posterior ensemble adds a calibrated uncertainty band — the ARG inference method
matters far more than the merge (§7.4).
Scaling. The E-step is O(#trees × #nodes) per EM iteration (after exact expm
caching); at a fixed region length #trees is region-bounded, so runtime is roughly
linear in sample size. Measured (true ARG, 0.1 Mb region): ~480 haplotypes fits in
~5 s (4 iterations), ~1.2 s/iteration; tsinfer inference adds ~1–3 s. ~500
haplotypes is comfortable for region/chromosome-scale analyses, with accuracy and
flicker unaffected by sample size. The genome E-step is an exact map-reduce over
independent trees, so it parallelises near-linearly across cores: paint(..., n_jobs=N)
/ tspaint.fit(..., n_jobs=N) (and the tspaint --cores CLI) split it bit-exactly over
a process pool (tspaint.parallel; see paral-assess.md), and across
the cluster each tspaint subcommand is a GWF job over SINGER windows × ensemble members.
Outstanding: the remaining external comparators (RFMix is wired via
tspaint.compare.rfmix_paint; MOSAIC/FLARE and the ARG-native ARGMix / Pearson & Durbin need
separate installs); the Relate --compress front end (io_relate.py); and the regime where
merging SINGER posterior samples improves accuracy rather than only adding a calibrated band
(CLAUDE.md §7.4).
Closed: the §6 order-only / ranked variant (measured — not beneficial; the short-region
failure was π-identifiability, fixed by holding π uniform, estimate_pi=False); the RFMix
head-to-head (tspaint matches RFMix's accuracy while staying calibrated); the segment-fragmentation
question for dating (the hard_segments deadband recovers the true tract-length distribution, and
on inferred ARGs the horizontal BP smoother — now in tspaint.bp / paint(smooth=True) — helps
further, CLAUDE.md §7, §9).
Install
pixi install # dev environment (tskit, msprime, numpy, scipy, ...)
pixi run test # run the test suite
Quickstart
import tspaint
# 1. Get a tree sequence — here, simulate admixture with known truth
# (or build one from genotypes: tspaint.io.tsinfer / relate / singer — ts | VCF Zarr | VCF).
ts = tspaint.simulate_admixture(n_admix=10, n_ref=10, sequence_length=1e6,
T_admix=100, T_split=5000, Ne=1000, random_seed=1)
# 2. Label the reference haplotypes (sample-node id -> ancestry state) and paint the rest.
pop = ts.tables.nodes.population
name = {p: ts.population(p).metadata["name"] for p in range(ts.num_populations)}
state = {p: i for i, p in enumerate(p for p, n in name.items() if n in ("A", "B"))}
labels = {int(s): state[pop[s]] for s in ts.samples() if pop[s] in state}
painting = tspaint.paint(ts, labels) # EM-fit (Q[, π, w]) on references, paint the queries
painting.posteriors[painting.queries[0]] # soft per-position posterior over ancestry (Segments)
painting.segments(deadband=0.4) # hard ancestry tracts (for tract-length / dating)
Command-line interface (GWF / cluster)
The tspaint command wraps the library so a pipeline runs as file-in/file-out jobs (e.g. under
GWF) with no glue Python. Computed results are .npz; hand-authored inputs are
text (labels JSON {"<node>": <state>}; id-lists inline 3,4,5 or @file). --cores/-j defaults
to $SLURM_JOB_CPUS_PER_NODE.
tspaint simulate --n-admix 8 --n-ref 8 -o sim.trees --labels-out labels.json # or your own data
tspaint fit sim.trees --labels labels.json -j 8 -o params.npz # one pooled fit
tspaint paint sim.trees --params params.npz -j 8 -o painting.npz # paint (per member)
tspaint merge painting.npz -o merged.npz # ensemble mean + band
tspaint date sim.trees --labels labels.json -o rtt.npz # also qc/introgress/ghost/archaic
For a large chromosome, build the SINGER posterior ensemble window-by-window
(tspaint trees singer-window per 5 Mb window, then tspaint trees merge-arg per member) and
feed the members to fit / paint / merge. See examples/workflow.py
for a runnable GWF Workflow and paral-assess.md for the parallelism design
and the bit-exactness contract.
Public API
Core — tspaint.paint(ts, labels, queries=None, *, deadband=…) returns a Painting with
.posteriors (soft Segment tracks), .segments(deadband=…) (hard tracts), and the fitted
.Q / .pi / .w. Building blocks: tspaint.fit, posterior_table, hard_segments, Segment,
make_generator_2state. Simulation: simulate_admixture, local_ancestry_truth.
Dating (optional) — tspaint.fit_rate_through_time(ts, labels) estimates the directional
cross-ancestry rate through time (RateThroughTime with .q_AB / .q_BA / .plot()) — when the
labelled ancestries diverged / exchanged genes. Painting.rate_through_time() reuses a painting's
fit; it returns a new profile and does not change the painting (Q(t) gives no painting-accuracy
gain — the paths stay side by side).
Namespaces
| Namespace | What |
|---|---|
tspaint.metrics |
balanced_accuracy, reliability_curve, breakpoint_precision_recall, switch_density, tract_boundary_error, … |
tspaint.compare |
painters tspaint_paint, nearest_reference_paint, rfmix_paint + head_to_head |
tspaint.io |
unified front ends tsinfer / relate / singer (accept ts | VCF Zarr | VCF) + add_mutations |
tspaint.bp |
horizontal BP/EP smoother (bp_paint, bp_smooth) — also paint(smooth=True); helps on inferred ARGs (§7) |
tspaint.dating |
admixture rate through time (time-inhomogeneous directional mugration EM): fit_rate_through_time, RateThroughTime, paint_qt |
tspaint.experiments |
end-to-end drivers: admixture_experiment, age_sweep, fragmentation_experiment, singer_ensemble_experiment, … |
tspaint.introgression |
Task 1 — reference QC: reference_qc (with actionable .soft_refs() / .mask()), anonymous foreign_tracts (min_depth= → the fast deep-ghost flag) — built on the loo_posterior_table lens |
tspaint.archaic |
Task 2 — ghost search: detect_ghost — reference-free depth-emission HMM for calibrated per-locus P(ghost); accepts a SINGER ensemble and depth="rank" (calibration-robust). (detect_archaic is a deprecated alias) |
Lower-level machinery is in the named submodules (tspaint.model, tspaint.pruning,
tspaint.accumulate, tspaint.em, tspaint.output, tspaint.ensemble, tspaint.ranked);
tspaint.parallel (bit-exact multi-core E-step), tspaint.serialize (.npz save/load), and
tspaint.cli (the tspaint command).
See notebooks/ for the persistence go/no-go (00), accuracy / calibration /
accuracy-vs-age (01), the flicker / bp/ decision (02), and haplotype paintings +
runtime/correctness/flicker scaling across regimes (03).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tspaint-0.1.18.tar.gz.
File metadata
- Download URL: tspaint-0.1.18.tar.gz
- Upload date:
- Size: 2.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd472f4983883378f8571bc0929899b5d943bd985ed79ea0811ab739be6199fa
|
|
| MD5 |
460c533b4f063b1e372de4bb37e29ccc
|
|
| BLAKE2b-256 |
10b45ad5dd39ac3888b15c8a35c0ce6ea3cdcf88228c330fad46835d0667315c
|
File details
Details for the file tspaint-0.1.18-py3-none-any.whl.
File metadata
- Download URL: tspaint-0.1.18-py3-none-any.whl
- Upload date:
- Size: 200.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48ef9f5c0a165a6602162176ef615c8c4123ff06d49803d7ebb3b659afbba711
|
|
| MD5 |
294f69a125078bc94b04e889a1e06163
|
|
| BLAKE2b-256 |
bb39f1467cbc218b2b2a3a6ec3054c7b985a5171b2dee0aeccbe049031c1fcb9
|