kmate

k-mer-based founder-mixture frequency estimation for pool-seq

These details have not been verified by PyPI

Project links

Homepage

Project description

kMate

Per-sample, per-record allele-frequency estimation from pooled sequencing against a multi-founder reference panel. kMate runs a weighted k-mer Poisson EM on the founder simplex to estimate founder frequencies (h), then projects through a per-record presence/absence matrix (var_pa, the founder × variant alt-allele matrix $V_\mathrm{pa}$) to allele frequencies for SNPs, indels, and SVs in a single pass, with no per-variant genotyping.

Overview

The picture above (our PEQG 2026 poster, click to enlarge) walks through the whole idea: the GrENE-Net experiment evolved an equal mixture of 231 Arabidopsis founders at 43 climate sites over 3 years, pool-sequencing the surviving populations each generation. kMate takes those pooled k-mer counts, solves a Poisson EM for the founder mixture against the panel's kmer_pa/var_pa matrices, and reads out per-record allele frequencies for SNPs and SVs at once. Benchmarked against simulated pools at 10× coverage, estimates track the truth closely, letting us follow structural-variant frequency trajectories across climates (e.g. a 181-bp insertion in the cold-regulated COR413-PM2 gene, rising in cold gardens and falling in warm ones).

Install

Create the kmate environment (mamba or conda) with its dependencies, then install the package:

git clone https://github.com/Tatianabellagio/kMate.git
cd kMate
mamba create -n kmate -c conda-forge -c bioconda python numpy scipy pysam jellyfish samtools
mamba activate kmate
pip install -e .          # installs the `kmate` command (no compilation step)

Python deps: numpy, scipy, pysam. kMate also calls jellyfish (k-mer counting) and samtools (read handling), both installed by the mamba create above. This gives you a kmate command with subcommands (kmate --help).

Verify the install

kmate selftest

This runs the bundled tiny fixture (a real Chr1 panel slice + a simulated 5-founder pool) end-to-end — exercising the full k-mer-count → EM → AF-projection path through jellyfish/samtools — and checks that the planted founder mixture is recovered. It takes a few seconds, needs no network, and prints PASS on a correct install. Run this before pointing kMate at your own data.

Usage

kMate processes one pooled sample at a time, per chromosome.

You need

Pooled reads: paired FASTQ (R1.fq R2.fq) of one pool/sample.
A reference panel encoded as per-chromosome matrices: kmer_pa (k-mer × founder presence/absence), var_pa (founder × variant alt-allele), and record meta. Built once from your founders' phased VCF (see Building a panel). The 231-founder Arabidopsis thaliana panel used by GrENE-Net is available on request; the matrix files are large and are not stored in the Git repo.

Run

kmate run \
    --kmer-pa-prefix data/kmer_pa_231_arch3_filt2inv/kmer_pa \
    --var-pa     panel/arch3/chr1/var_pa_231_arch3_chr1.var_pa.npz \
    --var-called panel/arch3/chr1/var_pa_231_arch3_chr1.var_called.npz \
    --var-meta   panel/arch3/chr1/var_pa_231_arch3_chr1.meta.npz \
    --reads R1.fq R2.fq --sample MYSAMPLE --out MYSAMPLE.tsv \
    --threads 8 --chroms Chr1 --kmer-weight inv_mb --block-mode global

(kmate run --help lists every flag. Existing scripts that call python src/per_sample_per_chrom.py ... still work via thin shims that forward to the package.)

Estimator mode (--block-mode)

global: one founder mixture per chromosome. Use for selfing / inbred / founder (F0) pools.
window: per-window mixture with HMM smoothing, for recombinant pools. --block-mode window alone reproduces the production "star2" recipe (10 kb windows, 5 smoothing passes).

Output: a per-record TSV, one row per panel variant (SNP / indel / SV):

chrom	pos	ref_len	alt_len	alt_freq	info	n_called	se

alt_freq is the estimated alternate-allele frequency in the pool; n_called and se carry support/uncertainty. (--var-called adds a per-record called-mask; --kmer-db lets you count k-mers once and query per-chrom instead of re-scanning reads; --hash-size tunes the Jellyfish hash, e.g. lower it to 100M on memory-capped jobs.)

How it works

From the read k-mer spectrum, kMate solves a weighted Poisson EM on the 231-founder simplex for the founder mixture h, then projects h through the panel's presence/absence matrix var_pa to a per-record allele frequency for SNPs, indels and SVs together, in one pass. Processing one chromosome at a time keeps peak memory ~5× below a genome-wide solve (per-chrom h agrees to ~0.1%). Full math + code wiring: ALGORITHM.md.

Building a panel

To run kMate on your own founder set you build the panel matrices once from a multi-founder phased VCF: var_pa from the founder genotypes and kmer_pa from a k-mer index of the founders. The builders live in panel/ and data/. The bundled 231-founder Arabidopsis panel (used by GrENE-Net) and its exact construction are documented in docs/PIPELINE_STATE.md §0.

Repository layout

src/kmate/   the kMate package (em_solver, kmer_count, block_em, per_sample_per_chrom, cli, selftest)
pyproject.toml, conda/   packaging: pip-installable `kmate` CLI + conda recipe
panel/       founder-panel construction (var_pa builders, k-mer index)
data/        prebuilt panel matrices (kmer_pa_*, var_pa_*) + sample lists
grenenet/    GrENE-Net application: production scale-out over the evolved cohort
benchmarks/  end-to-end accuracy benchmarks (p80 control, p231 headline)
sims/        pool-seq simulation framework (AF truth); see sims/README.md
docs/        methods + analysis writeups

Documentation

Doc	What it is
`docs/PIPELINE_STATE.md`	Production inputs, run recipe, and environment; the project source of truth.
`ALGORITHM.md`	The kMate algorithm, math, and code wiring.
`BACKGROUND.md`	Project framing, known biases, and design decisions.
`SAVIO_HPC.md`	Cluster ops (partitions, sbatch recipes).

Using kMate

kMate is not yet published as a standalone method. If you are interested in using kMate for your project, or in collaborating, please get in touch:

Tatiana Bellagio (tatianabellagio@gmail.com)

kMate was developed for, and underlies the allele-frequency analyses of, the GrENE-Net outdoor evolution experiment in Arabidopsis thaliana.

License

Released under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Jun 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmate-0.1.0.tar.gz (124.3 kB view details)

Uploaded Jun 18, 2026 Source

File details

Details for the file kmate-0.1.0.tar.gz.

File metadata

Download URL: kmate-0.1.0.tar.gz
Upload date: Jun 18, 2026
Size: 124.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for kmate-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ac20fee8912f78c41f873408ef2a69c8239fbb50e914da9195b7d7f82662c045`
MD5	`fde17a96e92181d6fdcf00c4ffea2937`
BLAKE2b-256	`75f12e7d131311c5c7520a9c27ec17df3aca81e876a6cff7b5319011650684f0`

See more details on using hashes here.

kmate 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

kMate

Overview

Install

Verify the install

Usage

How it works

Building a panel

Repository layout

Documentation

Using kMate

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes