Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats

These details have not been verified by PyPI

Project links

Homepage

Project description

🐟 bunbun

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats.

bunbun reads genotype data from any major format and gives you back the same three things every time:

output	type	what
`.samples`	`polars.DataFrame`	sample metadata (ID, sex, family, phenotype, …)
`.variants`	`polars.DataFrame`	variant metadata (chrom, pos, ref, alt, …)
`.dosages`	`numpy.ndarray`	genotype matrix — `(n_samples × n_variants)`, float32

Install

uv pip install bunbun

If you want to build it from the source

git clone https://github.com/nahid18/bunbun.git && cd bunbun
uv venv 
source .venv/bin/activate
# or 
source .venv/Scripts/activate
uv pip install maturin numpy pytest polars

# install from source
uv pip install -e .

Quick start

import bunbun

# Auto-detect: pass any file from the fileset
data = bunbun.read("cohort.bed")

data.samples    # polars DataFrame: sample_id, family_id, sex, …
data.variants   # polars DataFrame: chrom, pos, id, ref, alt, …
data.dosages    # numpy array: shape (500, 10000), dtype float32

data.shape      # (500, 10000)

Format-specific readers

data = bunbun.read_plink("cohort.bed")
data = bunbun.read_vcf("calls.vcf.gz")
data = bunbun.read_plink2("imputed.pgen")
data = bunbun.read_bgen("ukb.bgen", sample_path="ukb.sample")

Operations

# Allele frequencies (parallel, NaN-aware)
freqs = data.allele_frequencies()   # numpy array, length n_variants

# Missing rate per variant
miss = data.missing_rates()

# Mean-impute missing values (in place)
data.mean_impute()

# Subset to a genomic region
chr1_data = data.region("1", 100_000, 500_000)

# Subset to specific samples
sub = data.subset_samples(["SAMPLE_001", "SAMPLE_042"])

Works with your existing stack

import polars as pl
import numpy as np
from sklearn.decomposition import PCA

data = bunbun.read("cohort.bed")

# Filter variants with polars
common = data.variants.filter(pl.col("chrom") == "1")

# Feed dosages straight into sklearn
pca = PCA(n_components=10)
components = pca.fit_transform(np.nan_to_num(data.dosages))

Supported formats

Format	Extensions	Compression	Dosage
PLINK 1.x	`.bed` `.bim` `.fam`	—	hard calls
VCF	`.vcf` `.vcf.gz`	gzip, bgzf	GT or DS field
PLINK 2	`.pgen` `.pvar` `.psam`	—	hard calls
BGEN	`.bgen` `.sample`	zlib, zstd	probabilistic

Output schema

`data.samples` — polars DataFrame

column	dtype	notes
`sample_id`	Utf8	IID (PLINK) or sample column (VCF)
`family_id`	Utf8	FID or `"."`
`paternal_id`	Utf8	or `"0"`
`maternal_id`	Utf8	or `"0"`
`sex`	Utf8	`"male"` / `"female"` / `"unknown"`
`phenotype`	Float64	or `NaN`

`data.variants` — polars DataFrame

column	dtype	notes
`chrom`	Utf8	string for X/Y/MT compatibility
`pos`	UInt32	1-based bp position
`id`	Utf8	rsID or `chrom:pos:ref:alt`
`ref`	Utf8	reference allele
`alt`	Utf8	alternate allele(s), comma-separated
`cm`	Float64	genetic distance or `NaN`
`qual`	Float64	quality score or `NaN`
`filter`	Utf8	`"PASS"` / filter string

`data.dosages` — numpy ndarray

Shape: (n_samples, n_variants)
Dtype: float32
Encoding: 0.0 = hom-ref, 1.0 = het, 2.0 = hom-alt, NaN = missing
For imputed data (BGEN, VCF with DS): continuous values in [0, 2]

Rust API

use bunbun::{read, PlinkReader, GenotypeReader};

// Auto-detect
let data = bunbun::read("cohort.bed").unwrap();

// Or explicit
let reader = PlinkReader::new("cohort.bed");
let data = reader.read_all().unwrap();

println!("{} × {}", data.n_samples(), data.n_variants());
println!("{}", data.samples.df);   // polars DataFrame
let freqs = data.dosages.allele_frequencies();

One output type. GenotypeData is the same regardless of input format.
Zero-copy where possible. PLINK .bed is memory-mapped; numpy gets the ndarray directly.
Parallel by default. Genotype decoding uses rayon across variants.
Small-string optimization. The Allele type stores ≤7-byte alleles inline (covers >99% of SNPs).
Transparent compression. gzip, bgzf, zstd, bzip2 detected from magic bytes.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.4

Feb 11, 2026

0.1.3

Feb 11, 2026

0.1.2

Feb 11, 2026

0.1.1

Feb 11, 2026

This version

0.1.0

Feb 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunbun-0.1.0.tar.gz (426.2 kB view details)

Uploaded Feb 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bunbun-0.1.0-cp311-cp311-win_amd64.whl (4.4 MB view details)

Uploaded Feb 11, 2026 CPython 3.11Windows x86-64

File details

Details for the file bunbun-0.1.0.tar.gz.

File metadata

Download URL: bunbun-0.1.0.tar.gz
Upload date: Feb 11, 2026
Size: 426.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4dfad54eb8488007e736257ac96a685d9bff8f80628cc38b4d66c0d3ff2dcfe6`
MD5	`e1b386649c7cd1fb1a32b4e49525f964`
BLAKE2b-256	`02523900e5c080e8de840c964e51602ad57c4825f7a21cbc376a0a58a6b233c6`

See more details on using hashes here.

File details

Details for the file bunbun-0.1.0-cp311-cp311-win_amd64.whl.

File metadata

Download URL: bunbun-0.1.0-cp311-cp311-win_amd64.whl
Upload date: Feb 11, 2026
Size: 4.4 MB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.0-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`37096de2b8effd1ddfd48c23e787df5a8fd2d025612eccd052dd62fd00296e17`
MD5	`18ef835983eb5ec0a885588dd08ca68f`
BLAKE2b-256	`ad4f7830b66178e69a750579c7e3bccd0c5f8f5abd158a3efb99c46d6c2877a2`

See more details on using hashes here.

bunbun 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🐟 bunbun

Install

Quick start

Format-specific readers

Operations

Works with your existing stack

Supported formats

Output schema

`data.samples` — polars DataFrame

`data.variants` — polars DataFrame

`data.dosages` — numpy ndarray

Rust API

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

bunbun 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🐟 bunbun

Install

Quick start

Format-specific readers

Operations

Works with your existing stack

Supported formats

Output schema

data.samples — polars DataFrame

data.variants — polars DataFrame

data.dosages — numpy ndarray

Rust API

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`data.samples` — polars DataFrame

`data.variants` — polars DataFrame

`data.dosages` — numpy ndarray