Skip to main content

Bind PFILE/BFILE/EIGENSTRAT genotype datasets sharing a variant set; the missing plink2 --pmerge case for ancient-DNA / population-genetics workflows.

Project description

pgen-samplebind

A focused CLI for the one thing plink2 --pmerge doesn't yet do: bind two or more PFILE/BFILE/EIGENSTRAT genotype datasets that share a variant set but contain different samples. Targeted at the ancient-DNA / population-genetics community working with the 1240k SNP capture panel and downstream AdmixTools 2 / qpAdm pipelines.

pgen-samplebind merge panel_a panel_b -o merged

That's it for the happy path. Variant alignment, strand canonicalization, ambiguous-strand policy, missing-call fill, IID collision handling, pseudohaploid detection, population-label preservation, and report emission all run with one command.

Why this exists

plink2 --pmerge errors with "under development" for the non-concatenating case (same variants, different samples). Today's workarounds — plink1.9 --bmerge, EIGENSOFT's mergeit, or VCF round-trips through bcftools merge — each carry friction (deprecated tool, format-conversion overhead, single-thread C engine, multi-format hops). The AdmixTools / ancient-DNA community spends nontrivial time on this every panel build.

pgen-samplebind is a stopgap until plink2 lands the feature natively; it's deliberately scoped to the bind operation only — not a plink2 replacement.

Install

pip install pgen-samplebind

Requires Python 3.11 or 3.12. For EIGENSTRAT or BFILE inputs, also install plink2 v2.0.0-a.7.1 or newer and put it on PATH:

# Linux / macOS — direct binary from the upstream release
curl -fsSL https://github.com/chrchang/plink-ng/releases/download/v2.0.0-a.7.1/plink2_linux_x86_64.zip -o /tmp/plink2.zip
unzip /tmp/plink2.zip -d ~/bin
plink2 --version  # confirm v2.0.0-a.7.1 (or newer) on PATH

The v2.0.0-a.7.x line introduced the --eigfile/--export eig support pgen-samplebind shells out to; older versions including v2.00a5.x (still shipped by Homebrew at the time of this writing) will silently lack EIGENSTRAT support. PFILE-only workflows have no plink2 dependency.

Canonical use cases

Four workflows the tool is shaped around. Each is one command.

1. Panel extension

Existing AADR-derived panel (~700 samples × ~1.15M 1240k variants), add new populations from a newer AADR release. Variants identical, alleles canonical (same source).

pgen-samplebind merge \
    /data/aadr_v66_phase6 \
    /data/aadr_v66_new_pops \
    -o /data/aadr_v66_phase7 \
    --report phase7_bind_report.tsv

2. Single-sample target append

A user's WGS pseudohaploid genotypes (from pileupCaller --randomDiploid) merged into an AADR-derived panel for downstream qpAdm. The user is missing some panel sites (low-coverage regions) — the tool emits missing-call codes for them, doesn't drop the variants. --target activates the asymmetric strand-check + per-sample call-rate gate (default 0.40).

pgen-samplebind merge \
    --target /data/carsten_pileupcaller \
    /data/aadr_v66_subset \
    -o /data/aadr_with_carsten

3. Cross-source merge (different formats, different strand conventions)

Two cohort releases from different labs — one in EIGENSTRAT, one in PFILE — both built on 1240k. Strand canonicalization may differ; ambiguous A/T and C/G SNPs may need explicit policy.

pgen-samplebind merge \
    /data/cohort_a.geno_prefix \
    /data/cohort_b \
    -o /data/cohort_ab \
    --on-strand flip \
    --validate-strand-fail-pct 10

4. AADR cross-version cohort assembly

Recreating a published cohort using AADR v66 genotypes (more SNPs) requires mapping v44.3 sample IDs to v66 sample IDs through the Master-ID join — AADR de-anonymized many samples between releases (I0001 became Loschbour.AG, etc.). Sample-bind plus identity remapping, one step.

pgen-samplebind merge \
    /data/aadr_v44_3_panel \
    /data/aadr_v66_extras \
    --id-column 'Master ID' \
    --relabel-from /data/aadr_v66.anno \
    --relabel-input-col 'Master ID' \
    --relabel-output-col 'Group ID' \
    -o /data/cross_version_panel

Subcommands

pgen-samplebind merge     INPUT [INPUT ...] -o OUTPUT [options]
pgen-samplebind validate  INPUT [INPUT ...]                [options]
pgen-samplebind hash      INPUT
pgen-samplebind inspect   INPUT

Inputs are PFILE/BFILE/EIGENSTRAT prefixes; format auto-detected from companion files (.pgen+.pvar+.psam / .bed+.bim+.fam / .geno+.snp+.ind).

merge — bind inputs into one output PFILE

Option Default Purpose
-o, --out PREFIX required Output PFILE prefix
--target PATH none Single-sample / small-cohort mode; activates asymmetric strand-check + call-rate gate. Appended as last input; canonical remains input[0]
--variant-key {chr_pos|id} chr_pos Match key
--on-mismatch {drop|error} drop Allele mismatch beyond resolution
--on-missing {fill_missing|drop_variant|error} fill_missing Variant in input[0] absent in input N
--on-extra {warn|drop|error} warn Variant in input N absent from input[0]
--on-strand {drop|flip|error} flip Strand mismatch on unambiguous SNPs
--trust-strand off Pass A/T and C/G ambiguous SNPs without flagging (footgun for cross-source panels)
--on-collision {error|first|suffix} error IID collision policy. suffix uses _<input_idx> (general) / _target (target mode) with idempotent numeric retry
--id-column NAME IID .psam column for identity ops (e.g., 'Master ID' for AADR anno)
--population-column NAME auto Holds population labels (default: POP / PHENO / PHENO1 fallback)
--target-min-call-rate FLOAT 0.40 Target-mode per-sample minimum call rate before exit-1
--validate-strand-fail-pct N 10.0 Exit 1 if ambiguous-strand drops exceed N% of intersection
--relabel-from FILE none TSV-driven POP relabel. 2-col header-less form maps POP→POP; N-col form (with --relabel-input-col / --relabel-output-col) joins per-sample on --id-column
--relabel-input-col NAME none For N-col TSVs: which column matches --id-column
--relabel-output-col NAME none For N-col TSVs: which column becomes the new POP value
--report PATH none Per-variant action TSV (streamed; constant memory)
--report-json PATH none Run-level summary JSON (~few KB; rows excluded by default)
--report-json-include-rows off Include per-variant rows in JSON (buffered; warns at >100 MB predicted size)
--quiet off Suppress progress to stdout
--threads N 1 Parallel input readers
--block-size N 2048 Variants per pgenlib read block

validate — check alignment without writing

Same alignment / strand options as merge. No output written; reports go to stdout plus optional --report / --report-json. Exits 0 if alignment OK, 1 if any of the gates below fires.

hash — emit canonical variant-set hash

pgen-samplebind hash /data/aadr_v66_subset
# 7c4f8e...  PFILE
pgen-samplebind hash /data/aadr_v66_subset.geno_prefix
# 7c4f8e...  EIGENSTRAT  ← same hash → format-equivalent panels

--emit-canonical prints the canonicalized bytestream that's hashed (for diagnosis when two inputs should match but don't).

inspect — structured summary of one input

Format, sample count, variant count, populations, pseudohaploid mix, sex distribution, missingness histogram. TSV by default; --json for machine-readable.

Validation gates (exit 1)

Both merge and validate apply the same soft-validation gates and exit 1 if any fires:

  • (a) Extras above warn threshold. Variants in input[N] absent from input[0] exceed the --on-extra warn threshold (10% of input[0] by default). Catches the input-order-reversed failure mode (smaller panel placed first; larger panel's distinct variants silently dropped).
  • (b) Ambiguous-strand drops above intersection threshold. Drops due to A/T or C/G allele ambiguity exceed --validate-strand-fail-pct of the alignment intersection (10% by default). Intersection denominator is deliberate: catches the wrong-panel failure mode (tiny intersection × 30% drop rate) that a canonical denominator would silently hide.
  • (c) Target call rate below threshold. In --target mode, target's non-missing call count over canonical variants is below --target-min-call-rate.
  • (d) --on-* error policy would have triggered (validate mode only). In merge mode those policies still exit 3 — explicit invariant enforcement is honored.

For merge, gates (a)-(c) run between pass 1 (alignment) and pass 2 (genotype streaming) — failing fast saves the pass-2 wallclock for an alignment that wouldn't have validated.

Exit codes

Stable across versions; safe to script against.

Code Meaning
0 Success
1 Validation failure — gate (a)/(b)/(c)/(d) fired
2 I/O failure — read/write failure, plink2 subprocess failed, advisory output-prefix lock held
3 Invariant violation — multi-allelic input, duplicate canonical variant keys, --on-* error policy triggered, --on-collision error on duplicate IIDs
4 Usage error — bad CLI argument combination, unknown input prefix

Concurrency

  • Same input prefixes: safe, read-only.
  • Same output prefix: tool advisory-locks {out}.lock via fcntl.flock; fails clearly with exit 2 if held; cleans up on success or signal.
  • Cache directory: not managed by this tool — caller's problem.

The lock prevents two concurrent invocations from corrupting each other's output. It does not synchronize across machines (file lock is local-fs only). On NFS/SMB/CIFS the tool emits a stderr warning at acquire time since fcntl.flock semantics over network filesystems are implementation-defined and effectively no-op.

Performance

Expected throughput ~25-30 M genotypes/sec end-to-end on a typical Linux x86_64 machine (one core, default --block-size 2048). For a 700-sample × 1.15M-variant 1240k panel, plan on roughly 30-60 seconds wallclock including pass-1 alignment, pass-2 genotype rewrite, and .psam finalization.

--threads N parallelizes per-input readers; useful when binding 4+ inputs. Beyond N=4 the gains taper since pass 2 is single-writer-bound.

Troubleshooting

"plink2 not found, required for EIGENSTRAT input"

Install plink2 v2.0.0-a.7.1+ (see Install). PFILE-only workflows have no plink2 dependency.

"FID column header on line 1 is not at the beginning"

Symptom of an older plink2 reading our .psam output. We emit FID-first column order per the plink2 spec; if you see this from a downstream tool, verify the consumer is plink2 v2.0.0-a.7.x or newer. Earlier plink2 lines require --no-fid or an .fam-style fallback.

--target mode call-rate gate fires unexpectedly

Default threshold is 0.40 (40% of canonical variants called for the target sample). For very-low-coverage targets this may be too strict. Pass --target-min-call-rate 0.20 (or lower) if appropriate; with 0.0 the gate is fully disabled.

"ASCII per-line EIGENSTRAT" inputs

Both flavors of EIGENSTRAT input are supported: PACKEDANCESTRYMAP (binary, GENO /TGENO header — converted via plink2 --eigfile) and the older ASCII per-line variant (one digit per sample-variant cell — parsed natively, no plink2 dependency for the ASCII path). Format is auto-detected from the .geno header bytes.

Cross-source merges drop more variants than expected

By default A/T and C/G ambiguous SNPs are dropped wherever strand cannot be verified — including the case where both inputs have the same allele pair (e.g., both have A/T at the same position). The strand cannot be proven the same because complementing A/T gives T/A which is the same pair. This matches mergeit's strandcheck: YES convention and is the safe default for cross-source merges (different cohorts, different processing pipelines).

For single-source merges where REF/ALT calls are guaranteed consistent across inputs (same AADR release, same conversion pipeline), pass --trust-strand to passthrough matching ambiguous variants. The 1240k panel has ~1.3% A/T+C/G matching ambiguous SNPs so this typically recovers 10-20K additional variants on a full 1.15M-variant panel.

"ambiguous-strand drops exceed 10% of intersection"

Gate (b) fired. Three common causes:

  1. Wrong panel pairing: tiny intersection × normal drop rate computes as a high fraction of intersection. Run pgen-samplebind hash on each input — if the hashes differ unexpectedly, the panels aren't what you thought they were.
  2. Different strand conventions: cross-source merges where one source already strand-flipped at A/T+C/G sites.
  3. Legitimate but high A/T+C/G fraction: 1240k naturally has ~5-8% ambiguous; a panel restricted to ambiguous-only sites would push above 10%. Raise the threshold with --validate-strand-fail-pct 20 if that's your case, or use --trust-strand for explicit pass-through (footgun warning: silently passes potentially-flipped genotypes).

Half-built output files after a failure

Don't happen. The merge orchestrator wraps pass 2 + .psam finalization in a try/except that unlinks the .pgen/.pvar/.psam triplet on any tool-internal exception, so downstream pipelines never silently consume a partial output. (Atomic-rename across the triplet isn't actually atomic on NFS — three separate file ops — so unlink-on-failure is the safer default.)

Output prefix is locked by another pgen-samplebind process

Another invocation holds the lock at {prefix}.lock. If that process has died (kill -9, OOM, lost terminal), the lock file persists and you must remove it manually:

ls -la /data/merged.lock     # check lock-file age vs. expected runtime
rm /data/merged.lock         # only after confirming no live process holds it

Status

v0.1.0 — initial release. End-to-end byte-equal qpAdm parity proven against the established mergeit + plink2 + awk pipeline on the Track E Phase 7 panel build (Reich-Lab-style ancient-DNA workflow with EIGENSTRAT panel + brit_subset + single-sample target append; md5-identical proximal qpAdm shootout output).

See CHANGELOG.md for the full feature list and known limitations.

Contributing

Issues and pull requests welcome at https://github.com/carstenerickson/pgen-samplebind/issues. The project is small and opinionated; substantive scope changes (e.g., multi-allelic merges, dosage data, BFILE output) should start with a design discussion before implementation.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pgen_samplebind-0.1.0.tar.gz (56.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pgen_samplebind-0.1.0-py3-none-any.whl (59.0 kB view details)

Uploaded Python 3

File details

Details for the file pgen_samplebind-0.1.0.tar.gz.

File metadata

  • Download URL: pgen_samplebind-0.1.0.tar.gz
  • Upload date:
  • Size: 56.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pgen_samplebind-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e30088e52c73b19a800255e6ffbd646c3105ddf114adacb68f81e101b49b77d2
MD5 86c10dddd473a363f2d7c22771640771
BLAKE2b-256 ad0d8843283d5af12975d0f08a7b4f68fcf58256de812a290be44c8ce9042548

See more details on using hashes here.

Provenance

The following attestation bundles were made for pgen_samplebind-0.1.0.tar.gz:

Publisher: release.yml on carstenerickson/pgen-samplebind

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pgen_samplebind-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pgen_samplebind-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 59.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pgen_samplebind-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ac73e4a84ded5670156500f97b8d45d76f2693fb6a8457ac2f82d59116e7fa91
MD5 db7db374e135a6774196efd31704dc1a
BLAKE2b-256 118e3ef1788b2ec752a4e5d90fea0a799af9c03bd85717d3f0b4d31882b6161d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pgen_samplebind-0.1.0-py3-none-any.whl:

Publisher: release.yml on carstenerickson/pgen-samplebind

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page