Bind PFILE/BFILE/EIGENSTRAT genotype datasets sharing a variant set; the missing plink2 --pmerge case for ancient-DNA / population-genetics workflows.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

carstenerickson

These details have not been verified by PyPI

Project description

pgen-samplebind

A focused CLI for the one thing plink2 --pmerge doesn't yet do: bind two or more PFILE/BFILE/EIGENSTRAT genotype datasets that share a variant set but contain different samples. Targeted at the ancient-DNA / population-genetics community working with the 1240k SNP capture panel and downstream AdmixTools 2 / qpAdm pipelines.

pgen-samplebind merge panel_a panel_b -o merged

That's it for the happy path. Variant alignment, strand canonicalization, ambiguous-strand policy, missing-call fill, IID collision handling, pseudohaploid detection, population-label preservation, and report emission all run with one command.

Why this exists

plink2 --pmerge errors with "under development" for the non-concatenating case (same variants, different samples). Today's workarounds — plink1.9 --bmerge, EIGENSOFT's mergeit, or VCF round-trips through bcftools merge — each carry friction (deprecated tool, format-conversion overhead, single-thread C engine, multi-format hops). The AdmixTools / ancient-DNA community spends nontrivial time on this every panel build.

pgen-samplebind is a stopgap until plink2 lands the feature natively; it's deliberately scoped to the bind operation only — not a plink2 replacement.

Install

pip install pgen-samplebind

Requires Python 3.11 through 3.14. CI tests all four versions on both Ubuntu and macOS. For EIGENSTRAT or BFILE inputs, also install plink2 v2.0.0-a.7.1 or newer and put it on PATH. Pick the right asset for your platform:

# Linux x86_64
ASSET=plink2_linux_x86_64.zip
# macOS Apple Silicon
# ASSET=plink2_mac_arm64.zip
# macOS Intel
# ASSET=plink2_mac.zip

curl -fsSL "https://github.com/chrchang/plink-ng/releases/download/v2.0.0-a.7.1/${ASSET}" -o /tmp/plink2.zip
unzip /tmp/plink2.zip -d ~/bin
plink2 --version  # confirm v2.0.0-a.7.1 (or newer) on PATH

The v2.0.0-a.7.x line introduced the --eigfile --make-pgen path pgen-samplebind shells out to for EIGENSTRAT inputs; older versions including v2.00a5.x (still shipped by Homebrew at the time of this writing) will silently lack EIGENSTRAT support. PFILE-only workflows have no plink2 dependency.

Canonical use cases

Four workflows the tool is shaped around. Each is one command.

1. Panel extension

Existing AADR-derived panel (~700 samples × ~1.15M 1240k variants), add new populations from a newer AADR release. Variants identical, alleles canonical (same source).

pgen-samplebind merge \
    /data/aadr_v66_phase6 \
    /data/aadr_v66_new_pops \
    -o /data/aadr_v66_phase7 \
    --report phase7_bind_report.tsv

2. Single-sample target append

A user's WGS pseudohaploid genotypes (from pileupCaller --randomDiploid) merged into an AADR-derived panel for downstream qpAdm. The user is missing some panel sites (low-coverage regions) — the tool emits missing-call codes for them, doesn't drop the variants. --target activates the asymmetric strand-check + per-sample call-rate gate (default 0.40).

pgen-samplebind merge \
    --target /data/carsten_pileupcaller \
    /data/aadr_v66_subset \
    -o /data/aadr_with_carsten

3. Cross-source merge (different formats, different strand conventions)

Two cohort releases from different labs — one in EIGENSTRAT, one in PFILE — both built on 1240k. Strand canonicalization may differ; ambiguous A/T and C/G SNPs may need explicit policy.

# cohort_a is EIGENSTRAT (cohort_a.geno + .snp + .ind); cohort_b is PFILE
# (cohort_b.pgen + .pvar + .psam). Format is auto-detected from the prefix.
pgen-samplebind merge \
    /data/cohort_a \
    /data/cohort_b \
    -o /data/cohort_ab \
    --on-strand flip \
    --validate-strand-fail-pct 10

4. AADR cross-version cohort assembly

Recreating a published cohort using AADR v66 genotypes (more SNPs) requires mapping v44.3 sample IDs to v66 sample IDs through the Master-ID join — AADR de-anonymized many samples between releases (I0001 became Loschbour.AG, etc.). Sample-bind plus identity remapping, one step.

pgen-samplebind merge \
    /data/aadr_v44_3_panel \
    /data/aadr_v66_extras \
    --id-column 'Master ID' \
    --relabel-from /data/aadr_v66.anno \
    --relabel-input-col 'Master ID' \
    --relabel-output-col 'Group ID' \
    -o /data/cross_version_panel

Subcommands

pgen-samplebind merge    INPUT [INPUT ...] -o OUTPUT  [options]
pgen-samplebind validate INPUT [INPUT ...]            [options]
pgen-samplebind afs      INPUT             -o OUTDIR  [options]
pgen-samplebind hash     INPUT                        [options]
pgen-samplebind inspect  INPUT                        [options]

Inputs are PFILE/BFILE/EIGENSTRAT prefixes; format auto-detected from companion files (.pgen+.pvar+.psam / .bed+.bim+.fam / .geno+.snp+.ind). The variant companion may be either .pvar or .pvar.zst (plink2 v2.0.0-a.6+ default; HuggingFace / Dataverse panels typically arrive zstd-compressed).

`merge` — bind inputs into one output PFILE

Option	Default	Purpose
`-o, --out PREFIX`	required	Output PFILE prefix
`--target PATH`	none	Single-sample / small-cohort mode; activates asymmetric strand-check + per-target call-rate gate. Targets are appended after the positional inputs; canonical remains input[0]. Repeatable — pass `--target` multiple times to append several targets in one merge
`--variant-key {chr_pos\|id}`	`chr_pos`	Match key
`--on-mismatch {drop\|error}`	`drop`	Allele mismatch beyond resolution
`--on-missing {fill_missing\|drop_variant\|error}`	`fill_missing`	Variant in input[0] absent in input N
`--on-extra {warn\|drop\|error}`	`warn`	Variant in input N absent from input[0]
`--on-strand {drop\|flip\|error}`	`flip`	Strand mismatch on unambiguous SNPs
`--trust-strand`	off	Pass A/T and C/G ambiguous SNPs without flagging (footgun for cross-source panels)
`--on-collision {error\|first\|suffix}`	`error`	IID collision policy. `suffix` uses `_<input_idx>` (general); `_target` (single-target mode); `_target_<input_idx>` (multi-target mode, ≥ 2 targets, so renames disambiguate). Idempotent numeric retry on further collisions
`--id-column NAME`	`IID`	`.psam` column for identity ops (e.g., `'Master ID'` for AADR anno)
`--population-column NAME`	auto	Holds population labels (default: POP / PHENO / PHENO1 fallback)
`--target-min-call-rate FLOAT`	`0.40`	Target-mode per-sample minimum call rate before exit-1
`--validate-strand-fail-pct N`	`10.0`	Exit 1 if ambiguous-strand drops exceed N% of intersection
`--relabel-from FILE`	none	TSV-driven POP relabel. 2-col header-less form maps POP→POP; N-col form (with `--relabel-input-col` / `--relabel-output-col`) joins per-sample on `--id-column`
`--relabel-input-col NAME`	none	For N-col TSVs: which column matches `--id-column`
`--relabel-output-col NAME`	none	For N-col TSVs: which column becomes the new POP value
`--report PATH`	none	Per-variant action TSV (streamed; constant memory)
`--report-json PATH`	none	Run-level summary JSON (~few KB; rows excluded by default)
`--report-json-include-rows`	off	Include per-variant rows in JSON (buffered; warns at >100 MB predicted size)
`--preflight-policy {warn\|strict\|off}`	`warn`	Pre-pass-1 input-compatibility gate. `warn` emits a stderr WARNING and continues; `strict` raises ValidationError (exit 1) before pass 2 runs; `off` writes the JSON but never warns or fails. See Preflight gate
`--quiet`	off	Suppress the stdout summary block and the stderr progress bar
`--block-size N`	`2048`	Variants per pgenlib read block

merge always writes a <prefix>.preflight.json (schema v1) describing per-input compatibility against the canonical: chr:pos intersection, per-chrom breakdown, per-chrom position-shift signature, alternate-key (id vs chr_pos) intersection, and a classification label per non-canonical input (compatible / build_mismatch / key_space_mismatch / disjoint_panels / empty_input). Workflow consumers may assert against the file directly — e.g. jq '.comparisons[0].intersection_fraction_of_min' or jq '.gate.would_trigger == false'. The gate block carries triggered (whether the policy acted), would_trigger (the classification-level signal independent of policy), action, policy, threshold, and failing_inputs[].

`validate` — check alignment without writing

Same alignment / strand options as merge. No output written; reports go to stdout plus optional --report / --report-json. Exits 0 if alignment OK, 1 if any of the gates below fires.

--no-population-column: skip the population-column requirement on input psams. Use when a user PFILE has only [IID, SEX] (e.g., a single-sample VCF intersected with a reference panel before fraposa OADP projection — populations are downstream classification output, not user input). Variant-alignment, strand-orientation, and IID-collision checks still run; population-aware report fields come out empty for inputs that lack the column. Mutually exclusive with --population-column.

--preflight-policy {warn|strict|off}: same semantics as on merge. Validate has no output prefix to derive a default JSON path from, so emission is opt-in via --preflight-json PATH. The gate evaluator and stderr/exit behavior run regardless. Use pgen-samplebind validate ... --preflight-policy strict as a cheap CI dry-run before a long merge — same exit codes, same JSON schema (with "command": "validate").

`hash` — emit canonical variant-set hash

# Hashing the same variant set via two different formats yields the same
# digest: PFILE on /data/aadr_v66_subset.{pgen,pvar,psam} and EIGENSTRAT
# on /data/aadr_v66_subset_eig.{geno,snp,ind}. Format is auto-detected
# from the companion files present at the prefix.
pgen-samplebind hash /data/aadr_v66_subset
# 7c4f8e...  PFILE
pgen-samplebind hash /data/aadr_v66_subset_eig
# 7c4f8e...  EIGENSTRAT  ← same hash → format-equivalent panels

--emit-canonical prints the canonicalized bytestream that's hashed (for diagnosis when two inputs should match but don't).

`afs` — per-population allele-frequency-spectrum TSVs

pgen-samplebind afs panel -o panel_afs/

Streams genotypes from a PFILE/BFILE/EIGENSTRAT input and emits three TSVs + a manifest JSON matching the shape AdmixTools 2's *_to_afs() family returns:

panel_afs/
├── afs_snp.tsv       (variant_id, chrom, pos, ref, alt, cm)
├── afs_freq.tsv      (variant_id × population, ALT-allele frequency)
├── afs_counts.tsv    (variant_id × population, called-allele counts)
└── afs_manifest.json (tool version, sample counts per pop, parameters)

Useful for the PFILE-native pipeline: skip the plink2 --pfile … --make-bed last-mile step before AT2's non-qpfstats f2 / qpAdm path. Bridge until pfile_to_afs() lands in admixtools upstream.

Feeding AFS into AT2 — use the end-to-end bridge script:

Rscript scripts/pgensb_afs_to_at2_f2_cache.R panel_afs/ panel_at2_f2_cache/

This loads the AFS bundle, applies the discard_from_aftable(maxmiss=0, …) filter that AT2's extract_f2 silently applies before writing its cache, then calls afs_to_f2() twice (once with poly_only=TRUE for type='f2', once with poly_only=FALSE for type='ap') to produce an AT2-ready f2 cache. From R:

library(admixtools)
f2 <- f2_from_precomp("panel_at2_f2_cache/", pops = my_pops, afprod = TRUE)
qpadm(f2, left = ..., right = ..., target = ...)

The sibling scripts/load_pgensb_afs.R is a raw loader that returns the AFS as in-memory data frames without filtering — use it for inspection / debugging, not as the AT2 entry point (raw AFS fed into afs_to_f2() produces divergent f2 because extract_f2's maxmiss=0 filter isn't applied).

Limitation: AT2's extract_f2(qpfstats=TRUE) path reads genotypes directly and bypasses the AFS layer entirely. For workflows that need qpfstats (e.g., ancient-DNA with high missingness), the PFILE→BED last-mile remains. The byte-equal-to-mergeit-qpfstats proof comes from the PFILE→BED→qpfstats path, not this AFS bridge.

Key flags:

--populations POP (repeatable) — restrict to a subset of populations
--no-pseudohaploid-adjust — skip pseudohaploid called-allele adjustment (treats all samples as diploid). Default applies the adjustment when the input has a PSEUDOHAPLOID column — pseudohaploid samples then contribute 1 called allele (not 2) for correct effective sample sizes in downstream variance estimates. Allele frequencies are unchanged.
--include-sex-chrom — extend to chr 23/24/25/26 (default: autosomes only)
--population-column NAME — aggregate by a .psam column other than POP

`inspect` — structured summary of one input

Format, sample count, variant count, populations, pseudohaploid mix, sex distribution, missingness histogram. TSV by default; --json for machine-readable.

Validation gates (exit 1)

merge applies soft-validation gates (a)-(c); validate applies all four (a)-(d). Any firing gate exits 1.

(a) Extras above warn threshold. Variants in input[N] absent from input[0] exceed the --on-extra warn threshold (10% of input[0] by default). Catches the input-order-reversed failure mode (smaller panel placed first; larger panel's distinct variants silently dropped).
(b) Ambiguous-strand drops above intersection threshold. Drops due to A/T or C/G allele ambiguity exceed --validate-strand-fail-pct of the alignment intersection (10% by default). Intersection denominator is deliberate: catches the wrong-panel failure mode (tiny intersection × 30% drop rate) that a canonical denominator would silently hide.
(c) Target call rate below threshold. In --target mode, target's non-missing call count over canonical variants is below --target-min-call-rate.
(d) --on-* error policy would have triggered. Validate-mode only — validate softens the error policies into a count + gate-(d) trigger so it can report the full picture rather than aborting at the first hit. In merge mode those policies aren't softened: they raise an InvariantViolation and exit 3 (explicit invariant enforcement is honored).

For merge, gates (a)-(c) run between pass 1 (alignment) and pass 2 (genotype streaming) — failing fast saves the pass-2 wallclock for an alignment that wouldn't have validated.

Preflight gate

Catches the silent-near-empty-merge failure mode (closes #12): if the canonical and another input share almost no variants under the active --variant-key, the merge completes "successfully" with a tiny output and the failure only surfaces in a downstream consumer that expected substantial overlap. The preflight pass runs before pass 1, computes per-pair key-space intersection + per-chrom breakdown + per-chrom position-shift signature, and emits <prefix>.preflight.json (schema v1) on every merge run. validate runs the same check; JSON emission is opt-in via --preflight-json PATH.

The classifier assigns one of five labels to each non-canonical input:

Label	Trigger	Likely cause / remediation
`compatible`	active-key intersection ≥ 50% of min(canonical, other)	Normal — no gate action.
`build_mismatch`	symmetric chrom presence, zero coord overlap on shared chroms, and a consistent per-chrom position-shift signature (relative MAD < 0.1, magnitude ≥ 1 kbp, on ≥ 50% of evaluated chroms)	Coordinate-build mismatch (hg19 vs hg38 etc.) — positions on each chrom differ by a near-uniform delta. Liftover one side with CrossMap or Picard LiftoverVcf and re-run.
`key_space_mismatch`	active-key fraction low, alternate-key fraction substantially higher (lift ≥ 0.4)	The non-active variant key matches well. Try the other `--variant-key` value (e.g., `--variant-key id` for an rsID-keyed panel against a chr:pos target).
`disjoint_panels`	active-key fraction low; either asymmetric chrom presence OR symmetric chroms without a consistent shift	Verify you're merging the panels you intended. Run `pgen-samplebind hash` on each input to check panel identity.
`empty_input`	one side has zero post-filter variants	Upstream filter is too aggressive, or the input is genuinely empty.

--preflight-policy:

warn (default) — emits a stderr WARNING with each offending input's path, classification, and intersection fraction; the merge continues.
strict — raises ValidationError (exit 1, same category as gates (a)-(d)) before pass 2 runs. Under merge, also unlinks any stale .pgen / .pvar / .psam from a prior successful run at the same prefix, so workflow managers don't consume outputs that no longer match the inputs.
off — writes the JSON but never warns or fails. gate.would_trigger still carries the classification-level signal so CI pipelines that jq '.gate.would_trigger'-gate the file still see suppressed failures.

The preflight does not flag allele-swap or strand-flip mismatches — those are handled by the existing alignment code (REF_ALT_SWAP / STRAND_FLIP actions) and produce a normal merge with the recoded genotypes, not a near-empty one.

Exit codes

Stable across versions; safe to script against.

Code	Meaning
`0`	Success
`1`	Validation failure — gate (a)/(b)/(c)/(d) fired
`2`	I/O failure — read/write failure, plink2 subprocess failed, advisory output-prefix lock held
`3`	Invariant violation — multi-allelic input, duplicate canonical variant keys, `--on-* error` policy triggered, `--on-collision error` on duplicate IIDs
`4`	Usage error — bad CLI argument combination, unknown input prefix

Concurrency

Same input prefixes: safe, read-only.
Same output prefix: tool advisory-locks {out}.lock via fcntl.flock; fails clearly with exit 2 if held; cleans up on success or signal.
Cache directory: not managed by this tool — caller's problem.

The lock prevents two concurrent invocations from corrupting each other's output. It does not synchronize across machines (file lock is local-fs only). On Linux, the tool detects NFS/SMB/CIFS at lock-acquire time (via /proc/self/mountinfo) and emits a stderr warning since fcntl.flock semantics over network filesystems are implementation-defined and effectively no-op; the lock is still attempted. On macOS the detection is suppressed (no /proc-equivalent fs-type API) to avoid false positives — the network-fs caveat applies the same way, you just won't get the diagnostic warning.

Performance

Expected throughput 50-70 M genotypes/sec end-to-end on a typical Linux x86_64 machine (one core, default --block-size 2048), post-v0.3 vectorization. The current CI baseline measured 70.84 M g/s on a 100M-genotype synthetic fixture (ubuntu-latest); the perf benchmark gates against regression below 80% of that recorded baseline. For a real-world bind of two 700-sample × 1.15M-variant 1240k panels (~1.6B total genotypes), plan on roughly 25-35 seconds wallclock including pass-1 alignment, pass-2 genotype rewrite, and .psam finalization — roughly 20-40% faster than v0.1.

Troubleshooting

"plink2 not found on PATH; required for eigenstrat input"

Install plink2 v2.0.0-a.7.1+ (see Install). PFILE-only workflows have no plink2 dependency. The same message appears for BFILE inputs (with "bfile" instead of "eigenstrat").

"FID column header on line 1 is not at the beginning"

Symptom of an older plink2 reading our .psam output. We emit FID-first column order per the plink2 spec; if you see this from a downstream tool, verify the consumer is plink2 v2.0.0-a.7.x or newer. Earlier plink2 lines require --no-fid or an .fam-style fallback.

`--target` mode call-rate gate fires unexpectedly

Default threshold is 0.40 (40% of canonical variants called for the target sample). For very-low-coverage targets this may be too strict. Pass --target-min-call-rate 0.20 (or lower) if appropriate; with 0.0 the gate is fully disabled.

"ASCII per-line EIGENSTRAT" inputs

Both flavors of EIGENSTRAT input are supported: PACKEDANCESTRYMAP (binary, GENO /TGENO header — converted via plink2 --eigfile) and the older ASCII per-line variant (one digit per sample-variant cell — parsed natively, no plink2 dependency for the ASCII path). Format is auto-detected from the .geno header bytes.

Cross-source merges drop more variants than expected

By default A/T and C/G ambiguous SNPs are dropped wherever strand cannot be verified — including the case where both inputs have the same allele pair (e.g., both have A/T at the same position). The strand cannot be proven the same because complementing A/T gives T/A which is the same pair. This matches mergeit's strandcheck: YES convention and is the safe default for cross-source merges (different cohorts, different processing pipelines).

For single-source merges where REF/ALT calls are guaranteed consistent across inputs (same AADR release, same conversion pipeline), pass --trust-strand to passthrough matching ambiguous variants. The 1240k panel has ~1.3% A/T+C/G matching ambiguous SNPs so this typically recovers 10-20K additional variants on a full 1.15M-variant panel.

"ambiguous-strand drops exceed 10% of intersection"

Gate (b) fired. Three common causes:

Wrong panel pairing: tiny intersection × normal drop rate computes as a high fraction of intersection. Run pgen-samplebind hash on each input — if the hashes differ unexpectedly, the panels aren't what you thought they were.
Different strand conventions: cross-source merges where one source already strand-flipped at A/T+C/G sites.
Legitimate but high A/T+C/G fraction: 1240k naturally has ~5-8% ambiguous; a panel restricted to ambiguous-only sites would push above 10%. Raise the threshold with --validate-strand-fail-pct 20 if that's your case, or use --trust-strand for explicit pass-through (footgun warning: silently passes potentially-flipped genotypes).

Preflight gate triggered

merge emits a stderr line like WARNING: Preflight gate triggered against canonical <path> (variant_key=chr_pos): (or raises ValidationError under --preflight-policy strict). The classification on each offending input tells you the failure shape — see Preflight gate for the full label table. The fastest triage path:

Read <prefix>.preflight.json — comparisons[].classification gives the label per non-canonical input, and comparisons[].classification_evidence carries the numbers the classifier keyed on (active-key fraction, alternate-key fraction, shift-consistency verdict, per-chrom shape).
If build_mismatch: positions on each chrom differ by a near-uniform delta. Inspect comparisons[].build_shift_signature.median_shift_per_chrom to see the per-chrom shifts. Liftover one side with CrossMap or Picard LiftoverVcf and re-run. pgen-samplebind does not liftover for you — detection is local and cheap; liftover is its own project and gets owned by the dedicated tools.
If key_space_mismatch: re-run with the other --variant-key value. Check comparisons[].alternate_key_fraction_of_min to confirm the other key would have matched.
If disjoint_panels: pgen-samplebind hash each input to verify panel identity. The classifier covers both asymmetric-chrom and same-chrom-no-shift sub-shapes — comparisons[].per_chrom shows which.
If you genuinely want to merge disjoint inputs (e.g., panels by chromosome that you intend to concatenate later), pass --preflight-policy off to suppress the gate. The JSON still records gate.would_trigger=true for the audit trail.

Half-built output files after a failure

Don't happen. The merge orchestrator wraps pass 2 + .psam finalization in a try/except that unlinks the .pgen/.pvar/.psam triplet on any tool-internal exception, so downstream pipelines never silently consume a partial output. (Atomic-rename across the triplet isn't actually atomic on NFS — three separate file ops — so unlink-on-failure is the safer default.)

Output prefix is locked by another pgen-samplebind process

Another invocation holds the lock at {prefix}.lock. If that process has died (kill -9, OOM, lost terminal), the lock file persists and you must remove it manually:

ls -la /data/merged.lock     # check lock-file age vs. expected runtime
rm /data/merged.lock         # only after confirming no live process holds it

Verification

End-to-end byte-equal qpAdm parity against the established mergeit + plink2 + AdmixTools 2 pipeline is verified on every commit via a vendored AADR-derivative regression test (Patterson 7-source + 4 English target pops + 1 individual target, drawn from AADR v66 under fair-use). The result is the project's trust artifact — anyone can clone, run, and confirm pgen-samplebind reproduces the established pipeline on a published-research-shape workload without trusting the maintainer's claims. See CONTRIBUTING.md §Dogfood for run instructions and the three-tier breakdown.

Status

v0.4.0 — silent-corruption fix release. Closes #10: merge could silently mis-read panel-sample genotype bytes when the canonical .pvar contained biallelic non-SNP rows (e.g., biallelic indels), producing ~17% pure 0↔2 dosage inversions at affected sites with no error, no warning, and no --report-json flag. Present in every release since v0.1.0. Fixed by stamping the original .pgen row position (_pgen_row, uint32) through read_pvar's biallelic-SNP filter so downstream pgenlib.PgenReader.read_list calls land on the correct rows. Hardened with a new check_pvar_pgen_row_count_consistent startup guardrail wired into merge / validate / inspect / afs that catches mis-paired triplets (a separate failure shape with the same silent-corruption signature). Any consumer running pgen-samplebind merge or afs against a full-scale (>10M variant) panel should re-run their pipeline after upgrading.

See CHANGELOG.md for the full feature list and known limitations.

Contributing

Issues and pull requests welcome at https://github.com/carstenerickson/pgen-samplebind/issues. Dev setup, test-runner conventions, commit + release process, and design philosophy are in CONTRIBUTING.md; the architecture tour for new contributors is in DEVELOPMENT.md. The project is small and opinionated — substantive scope changes (e.g., multi-allelic merges, dosage data, BFILE-only output) should start with a design discussion before implementation.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

carstenerickson

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.1

Jun 18, 2026

0.5.0

May 27, 2026

0.4.0

May 25, 2026

0.3.2

May 18, 2026

0.3.1

May 17, 2026

0.3.0

May 12, 2026

0.1.0

May 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pgen_samplebind-0.5.1.tar.gz (97.4 kB view details)

Uploaded Jun 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pgen_samplebind-0.5.1-py3-none-any.whl (98.4 kB view details)

Uploaded Jun 18, 2026 Python 3

File details

Details for the file pgen_samplebind-0.5.1.tar.gz.

File metadata

Download URL: pgen_samplebind-0.5.1.tar.gz
Upload date: Jun 18, 2026
Size: 97.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pgen_samplebind-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`6287f37f82fa4aa6bd06d84aa8ca441d0b55ce9d69f152da32f6b774979475c2`
MD5	`b561a3db905d6bfe87cc0eceeebccdac`
BLAKE2b-256	`8306f238dd028bf2a28bdc3ab32bacdf7bb80276acd572c958fc524fe873c71a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pgen_samplebind-0.5.1.tar.gz:

Publisher: release.yml on carstenerickson/pgen-samplebind

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pgen_samplebind-0.5.1.tar.gz
- Subject digest: 6287f37f82fa4aa6bd06d84aa8ca441d0b55ce9d69f152da32f6b774979475c2
- Sigstore transparency entry: 1861892398
- Sigstore integration time: Jun 18, 2026
Source repository:
- Permalink: carstenerickson/pgen-samplebind@f05e22b4991084291a166d5167b47946a1be10e2
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/carstenerickson
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f05e22b4991084291a166d5167b47946a1be10e2
- Trigger Event: push

File details

Details for the file pgen_samplebind-0.5.1-py3-none-any.whl.

File metadata

Download URL: pgen_samplebind-0.5.1-py3-none-any.whl
Upload date: Jun 18, 2026
Size: 98.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pgen_samplebind-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e28bd56403fe278f6b2bf01bec0b95ea8740129b7595c542de57493cd0bb6204`
MD5	`abeaa7c50e3477c902d3743ebe447db0`
BLAKE2b-256	`791d023ee115e05c421103692018cd32c3f8792f2493178c2e95a9974777840b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pgen_samplebind-0.5.1-py3-none-any.whl:

Publisher: release.yml on carstenerickson/pgen-samplebind

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pgen_samplebind-0.5.1-py3-none-any.whl
- Subject digest: e28bd56403fe278f6b2bf01bec0b95ea8740129b7595c542de57493cd0bb6204
- Sigstore transparency entry: 1861892485
- Sigstore integration time: Jun 18, 2026
Source repository:
- Permalink: carstenerickson/pgen-samplebind@f05e22b4991084291a166d5167b47946a1be10e2
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/carstenerickson
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f05e22b4991084291a166d5167b47946a1be10e2
- Trigger Event: push

pgen-samplebind 0.5.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pgen-samplebind

Why this exists

Install

Canonical use cases

1. Panel extension

2. Single-sample target append

3. Cross-source merge (different formats, different strand conventions)

4. AADR cross-version cohort assembly

Subcommands

merge — bind inputs into one output PFILE

validate — check alignment without writing

hash — emit canonical variant-set hash

afs — per-population allele-frequency-spectrum TSVs

inspect — structured summary of one input

Validation gates (exit 1)

Preflight gate

Exit codes

Concurrency

Performance

Troubleshooting

"plink2 not found on PATH; required for eigenstrat input"

"FID column header on line 1 is not at the beginning"

--target mode call-rate gate fires unexpectedly

"ASCII per-line EIGENSTRAT" inputs

Cross-source merges drop more variants than expected

"ambiguous-strand drops exceed 10% of intersection"

Preflight gate triggered

Half-built output files after a failure

Output prefix is locked by another pgen-samplebind process

Verification

Status

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`merge` — bind inputs into one output PFILE

`validate` — check alignment without writing

`hash` — emit canonical variant-set hash

`afs` — per-population allele-frequency-spectrum TSVs

`inspect` — structured summary of one input

`--target` mode call-rate gate fires unexpectedly