k-mer-based founder-mixture frequency estimation for pool-seq
Project description
kMate
Per-sample, per-record allele-frequency estimation from pooled sequencing against a
multi-founder reference panel. kMate runs a weighted k-mer Poisson EM on the founder
simplex to estimate founder frequencies (h), then projects through a per-record
presence/absence matrix (var_pa, the founder × variant alt-allele matrix $V_\mathrm{pa}$)
to allele frequencies for SNPs, indels, and SVs in a single pass, with no per-variant genotyping.
Overview
The picture above (our PEQG 2026 poster, click to enlarge) walks through the whole idea: the
GrENE-Net experiment evolved an equal mixture of 231 Arabidopsis founders at 43 climate
sites over 3 years, pool-sequencing the surviving populations each generation. kMate takes those
pooled k-mer counts, solves a Poisson EM for the founder mixture against the panel's
kmer_pa/var_pa matrices, and reads out per-record allele frequencies for SNPs and SVs at
once. Benchmarked against simulated pools at 10× coverage, estimates track the truth closely, letting
us follow structural-variant frequency trajectories across climates (e.g. a 181-bp insertion in
the cold-regulated COR413-PM2 gene, rising in cold gardens and falling in warm ones).
Install
Create the kmate environment (mamba or conda) with its dependencies, then install the package:
git clone https://github.com/Tatianabellagio/kMate.git
cd kMate
mamba create -n kmate -c conda-forge -c bioconda python numpy scipy pysam jellyfish samtools
mamba activate kmate
pip install -e . # installs the `kmate` command (no compilation step)
Python deps: numpy, scipy, pysam. kMate also calls jellyfish (k-mer counting) and samtools (read handling), both installed by the mamba create above. This gives you a kmate command with subcommands (kmate --help).
Verify the install
kmate selftest
This runs the bundled tiny fixture (a real Chr1 panel slice + a simulated 5-founder pool) end-to-end — exercising the full k-mer-count → EM → AF-projection path through jellyfish/samtools — and checks that the planted founder mixture is recovered. It takes a few seconds, needs no network, and prints PASS on a correct install. Run this before pointing kMate at your own data.
Usage
kMate processes one pooled sample at a time, per chromosome.
You need
- Pooled reads: paired FASTQ (
R1.fq R2.fq) of one pool/sample. - A reference panel encoded as per-chromosome matrices:
kmer_pa(k-mer × founder presence/absence),var_pa(founder × variant alt-allele), and recordmeta. Built once from your founders' phased VCF (see Building a panel). The 231-founder Arabidopsis thaliana panel used by GrENE-Net is available on request; the matrix files are large and are not stored in the Git repo.
Run
kmate run \
--kmer-pa-prefix data/kmer_pa_231_arch3_filt2inv/kmer_pa \
--var-pa panel/arch3/chr1/var_pa_231_arch3_chr1.var_pa.npz \
--var-called panel/arch3/chr1/var_pa_231_arch3_chr1.var_called.npz \
--var-meta panel/arch3/chr1/var_pa_231_arch3_chr1.meta.npz \
--reads R1.fq R2.fq --sample MYSAMPLE --out MYSAMPLE.tsv \
--threads 8 --chroms Chr1 --kmer-weight inv_mb --block-mode global
(kmate run --help lists every flag. Existing scripts that call python src/per_sample_per_chrom.py ... still work via thin shims that forward to the package.)
Estimator mode (--block-mode)
global: one founder mixture per chromosome. Use for selfing / inbred / founder (F0) pools.window: per-window mixture with HMM smoothing, for recombinant pools.--block-mode windowalone reproduces the production "star2" recipe (10 kb windows, 5 smoothing passes).
Output: a per-record TSV, one row per panel variant (SNP / indel / SV):
| chrom | pos | ref_len | alt_len | alt_freq | info | n_called | se |
|---|
alt_freq is the estimated alternate-allele frequency in the pool; n_called and se carry support/uncertainty. (--var-called adds a per-record called-mask; --kmer-db lets you count k-mers once and query per-chrom instead of re-scanning reads; --hash-size tunes the Jellyfish hash, e.g. lower it to 100M on memory-capped jobs.)
How it works
From the read k-mer spectrum, kMate solves a weighted Poisson EM on the 231-founder
simplex for the founder mixture h, then projects h through the panel's
presence/absence matrix var_pa to a per-record allele frequency for SNPs, indels and SVs
together, in one pass. Processing one chromosome at a time keeps peak memory ~5× below a
genome-wide solve (per-chrom h agrees to ~0.1%). Full math + code wiring: ALGORITHM.md.
Building a panel
To run kMate on your own founder set you build the panel matrices once from a multi-founder
phased VCF: var_pa from the founder genotypes and kmer_pa from a k-mer index of the
founders. The builders live in panel/ and data/. The bundled
231-founder Arabidopsis panel (used by GrENE-Net) and its exact construction are documented
in docs/PIPELINE_STATE.md §0.
Repository layout
src/kmate/ the kMate package (em_solver, kmer_count, block_em, per_sample_per_chrom, cli, selftest)
pyproject.toml, conda/ packaging: pip-installable `kmate` CLI + conda recipe
panel/ founder-panel construction (var_pa builders, k-mer index)
data/ prebuilt panel matrices (kmer_pa_*, var_pa_*) + sample lists
grenenet/ GrENE-Net application: production scale-out over the evolved cohort
benchmarks/ end-to-end accuracy benchmarks (p80 control, p231 headline)
sims/ pool-seq simulation framework (AF truth); see sims/README.md
docs/ methods + analysis writeups
Documentation
| Doc | What it is |
|---|---|
docs/PIPELINE_STATE.md |
Production inputs, run recipe, and environment; the project source of truth. |
ALGORITHM.md |
The kMate algorithm, math, and code wiring. |
BACKGROUND.md |
Project framing, known biases, and design decisions. |
SAVIO_HPC.md |
Cluster ops (partitions, sbatch recipes). |
Using kMate
kMate is not yet published as a standalone method. If you are interested in using kMate for your project, or in collaborating, please get in touch:
Tatiana Bellagio (tatianabellagio@gmail.com)
kMate was developed for, and underlies the allele-frequency analyses of, the GrENE-Net outdoor evolution experiment in Arabidopsis thaliana.
License
Released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file kmate-0.1.0.tar.gz.
File metadata
- Download URL: kmate-0.1.0.tar.gz
- Upload date:
- Size: 124.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac20fee8912f78c41f873408ef2a69c8239fbb50e914da9195b7d7f82662c045
|
|
| MD5 |
fde17a96e92181d6fdcf00c4ffea2937
|
|
| BLAKE2b-256 |
75f12e7d131311c5c7520a9c27ec17df3aca81e876a6cff7b5319011650684f0
|