Skip to main content

AADR cross-version GeneticID / MasterID join utility for ancient-DNA / population-genetics workflows.

Project description

aadr-resolve

AADR cross-version GeneticID / MasterID join utility for ancient-DNA and population-genetics workflows.

aadr-resolve reads AADR (Allen Ancient DNA Resource) .anno files across one or more releases and resolves the cross-version sample-ID join through the Master ID column — the part every ancient-DNA pipeline currently re-implements with custom awk. It handles AADR's progressive de-anonymization (I0001 in v44.3 → Loschbour.AG in v66) and the periodic Master-ID renames (9-18 per consecutive version pair; ~62 cumulative v44.3 → v66.0) automatically.

The HLD pins behavior and the LLD pins implementation; both live in the companion wiki:

  • HLD: cs-wiki/projects/aadr-resolve.md
  • LLD: cs-wiki/projects/aadr-resolve-lld.md
  • Bench-verify report: cs-wiki/projects/aadr-resolve-bench-verify.md

Install

pip install aadr-resolve

Requires Python 3.11+. Dependencies: pandas 2.x, click 8.x, PyYAML 6.x.

Quickstart

Resolve a single sample across two AADR releases.

aadr-resolve lookup I0001 \
    --anno-files v44.3_1240K_public.anno \
    --anno-files v66.0_1240K_public.anno

Output (stdout):

query: I0001
canonical individual_id: Loschbour    (matched via individual_id)
v44.3 rows: 1
  I0001  Luxembourg_Loschbour  537,182 SNPs
v66.0 rows: 2
  Loschbour.AG  Luxembourg_Mesolithic.AG  155,036 SNPs  pgid=33
  Loschbour.DG  Luxembourg_Mesolithic.DG  620,881 SNPs  pgid=39136
master_id_bridge: v44.3 I0001 → v66.0 Loschbour (via shared GID Loschbour.DG)
status: present_in_2_of_2_versions; multi_row; individual_id_renamed

Recreate a cohort against a newer release.

aadr-resolve cohort patterson_2022_whga.txt \
    --anno-files v44.3_1240K_public.anno \
    --anno-files v66.0_1240K_public.anno \
    --cohort-version v44.3 \
    -o whga_v66_manifest.tsv

The manifest is a TSV with one row per (individual × library), with per-version genetic_id / group_id / snps_hit_1240k columns, ready to feed into downstream relabeling tools like pgen-samplebind.

Structured diff between two releases.

aadr-resolve diff v62.0.anno v66.0.anno --tsv > v62_to_v66_changes.tsv

Emits one row per change event: added, removed, genetic_id_renamed, master_id_renamed, group_changed (with a per-class label — convention_restructure_suffix etc.).

Subcommands

Command Purpose
lookup Resolve a single sample across N versions
cohort Emit a cross-version manifest for a user-supplied cohort
diff Structured diff between two versions
join Wide-format pairwise table over the full intersection
schema Diagnostic: report the detected schema class

aadr-resolve lookup

aadr-resolve lookup INDIVIDUAL_OR_GENETIC_ID \
    --anno-files PATH [--anno-files PATH ...]
    [--json]

Treated as individual_id by default; falls back to genetic_id if no IID matches. The MID-rename bridge is built automatically from the supplied versions and reported under master_id_bridge in the output.

aadr-resolve cohort

aadr-resolve cohort COHORT_FILE \
    --anno-files PATH [--anno-files PATH ...]
    [--cohort-version LABEL]
    -o OUT.tsv [--json]
    [--no-propagate]
    [--collapse-to-individual]
    [--gid-preference AG,DG,SG,HO,TW,BY,AA,EC,WGC,bare]
    [--turnover-warn 0.05] [--turnover-fail 0.30]
    [--cohort-coverage-warn 0.50] [--cohort-coverage-fail 0.25]

COHORT_FILE is a TSV: one column for individual_id, optional second column for cohort_label. --cohort-version is auto-detected from the supplied annos when omitted. Default output is row-per-(individual × library); --collapse-to-individual reduces to one row per individual via the --gid-preference suffix priority.

aadr-resolve diff

aadr-resolve diff V_OLD.anno V_NEW.anno
    [--json | --tsv]
    [-o OUT]
    [--include-class CLASS [--include-class CLASS ...]]
    [--all-events]
    [--turnover-warn 0.05] [--turnover-fail 0.30]
    [--substantive-regroup-fail INT]

JSON output is summary-first: per-class counts always included; per-event arrays only for substantive_regroup (always) and any class named via --include-class, or all classes when --all-events is set. --tsv switches to streamed one-row-per-event format.

aadr-resolve join

aadr-resolve join V_OLD.anno V_NEW.anno
    -o OUT.tsv [--json]
    [--collapse-to-individual]
    [--gid-preference AG,DG,SG,HO,TW,BY,AA,EC,WGC,bare]

Wide-format pairwise table over the full v_old ∪ v_new canonical individual_id set. Same output schema as cohort; useful when you don't have a pre-existing cohort list.

aadr-resolve schema

aadr-resolve schema PATH [--json]

Diagnostic: detects which schema class (A–E) the .anno belongs to, reports the column layout. Useful for debugging "why does this .anno not load."

Shared options

These apply to all subcommands:

Option Default Notes
--schema-override CLASS auto Force schema class A/B/C/D/E (e.g., renamed .anno)
--version-label LABEL auto Force version label (when filename pattern doesn't match)
--mid-bridge FILE none Manual master_id-rename TSV layered on auto-detected bridge
--on-mid-collision {error,warn} error Cross-lab MID collision policy
--quiet false Suppress the "Wrote N rows" progress line

Library API

The same functionality is available in-process:

from aadr_resolve import (
    AnnoFrame,
    resolve_master_ids,
    resolve_genetic_ids,
)

# Resolve v44.3 Master IDs to v66.0 GeneticIDs
result = resolve_master_ids(
    ["I0001", "Bichon", "Mota"],
    src_version="v44.3",
    dst_version="v66.0",
    anno_paths={
        "v44.3": "v44.3_1240K_public.anno",
        "v66.0": "v66.0_1240K_public.anno",
    },
)
# result = {"I0001": "Loschbour.AG", "Bichon": "Bichon.SG", "Mota": None}

resolve_genetic_ids does the GID → GID inverse:

result = resolve_genetic_ids(
    ["I0001"],
    src_version="v44.3",
    dst_version="v66.0",
    anno_paths={...},
)
# result = {"I0001": ["Loschbour.AG", "Loschbour.DG"]}  # multi-row IID

Direct AnnoFrame access for lower-level work:

from aadr_resolve import AnnoFrame

af = AnnoFrame.from_path("v66.0_1240K_public.anno", version_label="v66.0")
af.schema_class       # SchemaClass.E
af.individual_id      # pd.Series of canonical IIDs
af.genetic_id         # pd.Series
af.persistent_genetic_id  # pd.Series of Int64 nullable (E only; all-NaN elsewhere)
af.date_calbp         # pd.Series of Int64 nullable
af.coverage           # pd.Series of Float64 nullable
af.path               # original Path, useful for re-creating anno_paths dicts

Exception hierarchy

All errors derive from aadr_resolve.AadrResolveError. Sibling tools catching aadr-resolve errors can except aadr_resolve.<Class>:

Class Maps to exit Trigger
ValidationError 1 Turnover gate, coverage gate, substantive-regroup gate
IOFailure 2 File not found, lock held, malformed TSV
InvariantViolation 3 Schema YAML malformed (rare)
SchemaDetectionError 3 Header signature unknown
MissingNativeFieldError 3 Canonical field requested for a class that lacks it
CollisionDetected 3 Cross-lab MID collision under error policy
UsageError 4 Bad CLI args; cohort file has no matching version

Exit codes

Stable across versions. CI workflows can grep:

  • 0 — success
  • 1 — soft-validation failure (any of the gates)
  • 2 — I/O failure
  • 3 — invariant violation (schema, MID collision)
  • 4 — usage error (bad CLI args)

Troubleshooting

"unknown .anno schema signature" — your .anno header doesn't match any of the 5 known classes. Either the file is from a newer AADR release (file an issue with the bench-verify diff), or the file has been edited. Workarounds:

  • --schema-override A|B|C|D|E forces a class without signature check.
  • --version-label vN.N forces a version label when the filename doesn't match a known pattern.

"cross-lab MID collision" — the GID-stability check found a Master ID that maps to two different individuals in different versions. This indicates either a real data error in AADR or a cross-lab naming collision (rare). Workarounds:

  • --on-mid-collision warn continues with a stderr warning and marks affected rows with library_chain_ambiguous status.
  • --mid-bridge FILE lets you specify the correct mapping manually.

"sample turnover gate (fail)" — removal rate exceeded the --turnover-fail threshold (default 30%). Indicates either a major AADR cleanup (the v62→v66 bump removed ~17%) or that the wrong files are being compared. Override with --turnover-fail 1.0 to disable.

"cohort coverage gate (fail)" — fewer than 25% of cohort entries resolved in the supplied versions. Usually means the cohort file uses IDs from a version not in the supplied set. Check --cohort-version.

Pandas ParserError on a v52 / v54 .anno — these versions contain embedded quote characters in some full_date cells. aadr-resolve reads with csv.QUOTE_NONE to side-step pandas's default quote-handling; upgrade if you're on an older version.

Composition with the broader ecosystem

aadr-resolve cohort patterson_2022.txt \
    --anno-files v44.3.anno --anno-files v66.0.anno \
    -o cohort_manifest.tsv
pgen-samplebind merge \
    --relabel-from cohort_manifest.tsv \
    --output merged_v66.pgen \
    v44.3.pgen v66.0.pgen

The manifest's column layout is documented in HLD §Output: cohort.

Development

git clone https://github.com/carstenerickson/aadr-resolve
cd aadr-resolve
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Default suite (fast; ~10s)
pytest -ra

# Slow tests (synth perf benchmark)
pytest -m slow -ra

# External tests (real AADR files; requires AADR_CACHE env var)
AADR_CACHE=/path/to/cache pytest -m external -ra

# Standalone perf benchmark with per-phase timings
AADR_CACHE=/path/to/cache python -m benchmarks.perf_bench

# Lint + format + types
ruff check src/ tests/
ruff format --check src/ tests/
mypy src/

CI runs the default suite across Python 3.11/3.12/3.13 × Ubuntu+macOS; see .github/workflows/ci.yml.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aadr_resolve-0.1.0.tar.gz (57.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aadr_resolve-0.1.0-py3-none-any.whl (71.1 kB view details)

Uploaded Python 3

File details

Details for the file aadr_resolve-0.1.0.tar.gz.

File metadata

  • Download URL: aadr_resolve-0.1.0.tar.gz
  • Upload date:
  • Size: 57.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for aadr_resolve-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3ccc0ae308de8da5b0f0817de51f59ebf666d1eec7164b93a414dcc78350eeaf
MD5 2acfc85154123ec5c32ce2f1fef20e98
BLAKE2b-256 b5cabcf10335fda3b11801c517b447dfa0480129c69d3eb37389a486fa13fe9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for aadr_resolve-0.1.0.tar.gz:

Publisher: publish.yml on carstenerickson/aadr-resolve

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file aadr_resolve-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: aadr_resolve-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 71.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for aadr_resolve-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e0983d5dc02a74079844306c0f4a7c66ace955660da4fd4f0a476803e0d5e2cf
MD5 f6da2e23c551b568517e60353429ba46
BLAKE2b-256 914dbcc7cdbce593a92e114adaae3b9c0b11b00d9bb4e656ec2f9b8458a3068d

See more details on using hashes here.

Provenance

The following attestation bundles were made for aadr_resolve-0.1.0-py3-none-any.whl:

Publisher: publish.yml on carstenerickson/aadr-resolve

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page