AADR cross-version GeneticID / MasterID join utility for ancient-DNA / population-genetics workflows.
Project description
aadr-resolve
AADR cross-version GeneticID / MasterID join utility for ancient-DNA and population-genetics workflows.
aadr-resolve reads AADR (Allen Ancient DNA Resource) .anno files
across one or more releases and resolves the cross-version sample-ID
join through the Master ID column — the part every ancient-DNA pipeline
currently re-implements with custom awk. It handles AADR's progressive
de-anonymization (I0001 in v44.3 → Loschbour.AG in v66) and the
periodic Master-ID renames (9-18 per consecutive version pair; ~62
cumulative v44.3 → v66.0) automatically.
The HLD pins behavior and the LLD pins implementation; both live in the companion wiki:
- HLD:
cs-wiki/projects/aadr-resolve.md - LLD:
cs-wiki/projects/aadr-resolve-lld.md - Bench-verify report:
cs-wiki/projects/aadr-resolve-bench-verify.md
Install
pip install aadr-resolve
Requires Python 3.11+. Dependencies: pandas 2.x, click 8.x, PyYAML 6.x.
Quickstart
Resolve a single sample across two AADR releases.
aadr-resolve lookup I0001 \
--anno-files v44.3_1240K_public.anno \
--anno-files v66.0_1240K_public.anno
Output (stdout):
query: I0001
canonical individual_id: Loschbour (matched via individual_id)
v44.3 rows: 1
I0001 Luxembourg_Loschbour 537,182 SNPs
v66.0 rows: 2
Loschbour.AG Luxembourg_Mesolithic.AG 155,036 SNPs pgid=33
Loschbour.DG Luxembourg_Mesolithic.DG 620,881 SNPs pgid=39136
master_id_bridge: v44.3 I0001 → v66.0 Loschbour (via shared GID Loschbour.DG)
status: present_in_2_of_2_versions; multi_row; individual_id_renamed
Recreate a cohort against a newer release.
aadr-resolve cohort patterson_2022_whga.txt \
--anno-files v44.3_1240K_public.anno \
--anno-files v66.0_1240K_public.anno \
--cohort-version v44.3 \
-o whga_v66_manifest.tsv
The manifest is a TSV with one row per (individual × library), with
per-version genetic_id / group_id / snps_hit_1240k columns plus
per-adjacent-pair group_id_change_class_v{old}_to_v{new} columns,
ready to feed into downstream relabeling tools like pgen-samplebind.
Output (stdout summary block):
Loaded 2 .anno file(s):
[v44.3] v44.3_1240K_public.anno: 9,275 rows × 43 cols, class A
[v66.0] v66.0_1240K_public.anno: 23,250 rows × 49 cols, class E
Cross-version bridge:
GID-stable MID-rename detection: 9 events
Manual --mid-bridge entries: 0
Cross-lab MID collision check: no collisions detected
Cohort input: patterson_2022_whga.txt (40 individuals)
Resolved in latest version: 37
Added after earliest: 1
Removed before latest: 2
Group ID changes (v44.3 → v66.0):
convention_restructure_suffix 18
partial 1
substantive_regroup 2
Wrote whga_v66_manifest.tsv (40 rows × 15 cols)
Sample turnover within cohort: 5.0% — PASS
Done in 1.4s.
Add --quiet to suppress the block. Add --report-json summary.json
to also emit a run-level JSON sidecar that loads cheaply via
json.load — see docs/REPORT_JSON_SCHEMA.md.
Structured diff between two releases.
aadr-resolve diff v62.0.anno v66.0.anno --tsv > v62_to_v66_changes.tsv
Emits one row per change event: added, removed, genetic_id_renamed,
master_id_renamed, group_changed (with a per-class label —
convention_restructure_suffix etc.).
For large diffs at AADR scale, stream the per-event TSV alongside a small summary JSON instead of buffering the full event list:
aadr-resolve diff v62.0.anno v66.0.anno \
-o changes_summary.json \
--report changes_events.tsv \
--report-json summary.json
--report PATH streams one row per event (constant memory) and
--report-json PATH writes the run-level summary (~few KB, loads
cheaply via json.load). The diff stdout summary block routes to
stderr when stdout is carrying the JSON payload, so pipes stay clean.
Subcommands
| Command | Purpose |
|---|---|
lookup |
Resolve a single sample across N versions |
cohort |
Emit a cross-version manifest for a user-supplied cohort |
diff |
Structured diff between two versions |
join |
Wide-format pairwise table over the full intersection |
schema |
Diagnostic: report the detected schema class |
aadr-resolve lookup
aadr-resolve lookup INDIVIDUAL_OR_GENETIC_ID \
--anno-files PATH [--anno-files PATH ...]
[--json]
Treated as individual_id by default; falls back to genetic_id if no
IID matches. The MID-rename bridge is built automatically from the
supplied versions and reported under master_id_bridge in the output.
aadr-resolve cohort
aadr-resolve cohort COHORT_FILE \
--anno-files PATH [--anno-files PATH ...]
[--cohort-version LABEL]
-o OUT.tsv [--json]
[--no-propagate]
[--collapse-to-individual]
[--gid-preference AG,DG,SG,HO,TW,BY,AA,EC,WGC,bare]
[--turnover-warn 0.05] [--turnover-fail 0.30]
[--cohort-coverage-warn 0.50] [--cohort-coverage-fail 0.25]
[--report-json PATH]
COHORT_FILE is a TSV: one column for individual_id, optional second
column for cohort_label. --cohort-version is auto-detected from the
supplied annos when omitted. Default output is row-per-(individual ×
library); --collapse-to-individual reduces to one row per individual
via the --gid-preference suffix priority. --report-json PATH writes
a run-level summary sidecar (~few KB) for CI dashboards.
aadr-resolve diff
aadr-resolve diff V_OLD.anno V_NEW.anno
[--json | --tsv]
[-o OUT]
[--include-class CLASS [--include-class CLASS ...]]
[--all-events]
[--turnover-warn 0.05] [--turnover-fail 0.30]
[--substantive-regroup-fail INT]
[--report PATH] [--report-json PATH]
JSON output is summary-first: per-class counts always included;
per-event arrays only for substantive_regroup (always) and any class
named via --include-class, or all classes when --all-events is set.
--tsv switches to streamed one-row-per-event format.
For large diffs, prefer the streamed sidecars: --report PATH writes
per-event TSV with constant memory; --report-json PATH writes the
run-level summary. The summary block routes to stderr when stdout is
the JSON payload, so aadr-resolve diff a.anno b.anno | jq ... works
without breaking the pipe.
aadr-resolve join
aadr-resolve join V_OLD.anno V_NEW.anno
-o OUT.tsv [--json]
[--collapse-to-individual]
[--gid-preference AG,DG,SG,HO,TW,BY,AA,EC,WGC,bare]
Wide-format pairwise table over the full v_old ∪ v_new canonical
individual_id set. Same output schema as cohort; useful when you
don't have a pre-existing cohort list.
aadr-resolve schema
aadr-resolve schema PATH [--json]
Diagnostic: detects which schema class (A–E) the .anno belongs to,
reports the column layout. Useful for debugging "why does this .anno
not load."
Shared options
These apply to all subcommands:
| Option | Default | Notes |
|---|---|---|
--schema-override CLASS |
auto | Force schema class A/B/C/D/E (e.g., renamed .anno) |
--version-label LABEL |
auto | Force version label (when filename pattern doesn't match) |
--mid-bridge FILE |
none | Manual master_id-rename TSV layered on auto-detected bridge |
--on-mid-collision {error,warn} |
error | Cross-lab MID collision policy |
--quiet |
false | Suppress the "Wrote N rows" progress line |
Library API
The same functionality is available in-process:
from aadr_resolve import (
AnnoFrame,
resolve_master_ids,
resolve_genetic_ids,
)
# Resolve v44.3 Master IDs to v66.0 GeneticIDs
result = resolve_master_ids(
["I0001", "Bichon", "Mota"],
src_version="v44.3",
dst_version="v66.0",
anno_paths={
"v44.3": "v44.3_1240K_public.anno",
"v66.0": "v66.0_1240K_public.anno",
},
)
# result = {"I0001": "Loschbour.AG", "Bichon": "Bichon.SG", "Mota": None}
resolve_genetic_ids does the GID → GID inverse:
result = resolve_genetic_ids(
["I0001"],
src_version="v44.3",
dst_version="v66.0",
anno_paths={...},
)
# result = {"I0001": ["Loschbour.AG", "Loschbour.DG"]} # multi-row IID
Direct AnnoFrame access for lower-level work:
from aadr_resolve import AnnoFrame
af = AnnoFrame.from_path("v66.0_1240K_public.anno", version_label="v66.0")
af.schema_class # SchemaClass.E
af.individual_id # pd.Series of canonical IIDs
af.genetic_id # pd.Series
af.persistent_genetic_id # pd.Series of Int64 nullable (E only; all-NaN elsewhere)
af.date_calbp # pd.Series of Int64 nullable
af.coverage # pd.Series of Float64 nullable
af.path # original Path, useful for re-creating anno_paths dicts
Exception hierarchy
All errors derive from aadr_resolve.AadrResolveError. Sibling tools
catching aadr-resolve errors can except aadr_resolve.<Class>:
| Class | Maps to exit | Trigger |
|---|---|---|
ValidationError |
1 | Turnover gate, coverage gate, substantive-regroup gate |
IOFailure |
2 | File not found, lock held, malformed TSV |
InvariantViolation |
3 | Schema YAML malformed (rare) |
SchemaDetectionError |
3 | Header signature unknown |
MissingNativeFieldError |
3 | Canonical field requested for a class that lacks it |
CollisionDetected |
3 | Cross-lab MID collision under error policy |
UsageError |
4 | Bad CLI args; cohort file has no matching version |
Exit codes
Stable across versions. CI workflows can grep:
0— success1— soft-validation failure (any of the gates)2— I/O failure3— invariant violation (schema, MID collision)4— usage error (bad CLI args)
Troubleshooting
"unknown .anno schema signature" — your .anno header doesn't
match any of the 5 known classes. Either the file is from a newer AADR
release (file an issue with the bench-verify diff), or the file has
been edited. Workarounds:
--schema-override A|B|C|D|Eforces a class without signature check.--version-label vN.Nforces a version label when the filename doesn't match a known pattern.
"cross-lab MID collision" — the GID-stability check found a Master ID that maps to two different individuals in different versions. This indicates either a real data error in AADR or a cross-lab naming collision (rare). Workarounds:
--on-mid-collision warncontinues with a stderr warning and marks affected rows withlibrary_chain_ambiguousstatus.--mid-bridge FILElets you specify the correct mapping manually.
"sample turnover gate (fail)" — removal rate exceeded the
--turnover-fail threshold (default 30%). Indicates either a major
AADR cleanup (the v62→v66 bump removed ~17%) or that the wrong files
are being compared. Override with --turnover-fail 1.0 to disable.
"cohort coverage gate (fail)" — fewer than 25% of cohort entries
resolved in the supplied versions. Usually means the cohort file uses
IDs from a version not in the supplied set. Check --cohort-version.
Pandas ParserError on a v52 / v54 .anno — these versions contain
embedded quote characters in some full_date cells. aadr-resolve reads
with csv.QUOTE_NONE to side-step pandas's default quote-handling;
upgrade if you're on an older version.
Composition with the broader ecosystem
aadr-resolve cohort patterson_2022.txt \
--anno-files v44.3.anno --anno-files v66.0.anno \
-o cohort_manifest.tsv
pgen-samplebind merge \
--relabel-from cohort_manifest.tsv \
--output merged_v66.pgen \
v44.3.pgen v66.0.pgen
The manifest's column layout is documented in HLD §Output: cohort.
Development
git clone https://github.com/carstenerickson/aadr-resolve
cd aadr-resolve
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# Default suite (fast; ~10s)
pytest -ra
# Slow tests (synth perf benchmark)
pytest -m slow -ra
# External tests (real AADR files; requires AADR_CACHE env var)
AADR_CACHE=/path/to/cache pytest -m external -ra
# Standalone perf benchmark with per-phase timings
AADR_CACHE=/path/to/cache python -m benchmarks.perf_bench
# Lint + format + types
ruff check src/ tests/
ruff format --check src/ tests/
mypy src/
CI runs the default suite across Python 3.11/3.12/3.13 × Ubuntu+macOS;
see .github/workflows/ci.yml.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aadr_resolve-0.2.0.tar.gz.
File metadata
- Download URL: aadr_resolve-0.2.0.tar.gz
- Upload date:
- Size: 66.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1942fb8a1e3fb787d1749ac3cfd8dc824e693caacc5cfd2d24869a8d6876f496
|
|
| MD5 |
1bcaa85c279b8cb737a9af270606bc23
|
|
| BLAKE2b-256 |
87951fad7f48f4a57a3e51e9d7101575906d612c453caae4d71b09e31cbc6ebe
|
Provenance
The following attestation bundles were made for aadr_resolve-0.2.0.tar.gz:
Publisher:
publish.yml on carstenerickson/aadr-resolve
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aadr_resolve-0.2.0.tar.gz -
Subject digest:
1942fb8a1e3fb787d1749ac3cfd8dc824e693caacc5cfd2d24869a8d6876f496 - Sigstore transparency entry: 1519994825
- Sigstore integration time:
-
Permalink:
carstenerickson/aadr-resolve@e4011b73af070a748f1717c3e3059910a3ea94bd -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/carstenerickson
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e4011b73af070a748f1717c3e3059910a3ea94bd -
Trigger Event:
release
-
Statement type:
File details
Details for the file aadr_resolve-0.2.0-py3-none-any.whl.
File metadata
- Download URL: aadr_resolve-0.2.0-py3-none-any.whl
- Upload date:
- Size: 79.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12e6b529fdf283b8e25f8c6878eb265e0c08467f622022d3e815a884a9f1a5ae
|
|
| MD5 |
6d1091119ab22fc81a694c6939c76a70
|
|
| BLAKE2b-256 |
42ea8627ca3eabfe1908592672d5bcec887de238b3625165576c7b494523ebc8
|
Provenance
The following attestation bundles were made for aadr_resolve-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on carstenerickson/aadr-resolve
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aadr_resolve-0.2.0-py3-none-any.whl -
Subject digest:
12e6b529fdf283b8e25f8c6878eb265e0c08467f622022d3e815a884a9f1a5ae - Sigstore transparency entry: 1519994853
- Sigstore integration time:
-
Permalink:
carstenerickson/aadr-resolve@e4011b73af070a748f1717c3e3059910a3ea94bd -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/carstenerickson
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e4011b73af070a748f1717c3e3059910a3ea94bd -
Trigger Event:
release
-
Statement type: