Skip to main content

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats

Project description

🐰 bunbun

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats.

bunbun reads genotype data from any major format and gives you back the same three things every time:

output type what
.samples polars.DataFrame sample metadata (ID, sex, family, phenotype, …)
.variants polars.DataFrame variant metadata (chrom, pos, ref, alt, …)
.dosages numpy.ndarray genotype matrix — (n_samples × n_variants), float32

Install

uv pip install bunbun

If you want to build it from the source

git clone https://github.com/nahid18/bunbun.git && cd bunbun
uv venv 
source .venv/bin/activate
# or 
source .venv/Scripts/activate
uv pip install maturin numpy pytest polars

# install from source
uv pip install -e .

Quick start

import bunbun

# Auto-detect: pass any file from the fileset
data = bunbun.read("cohort.bed")

data.samples    # polars DataFrame: sample_id, family_id, sex, …
data.variants   # polars DataFrame: chrom, pos, id, ref, alt, …
data.dosages    # numpy array: shape (500, 10000), dtype float32

data.shape      # (500, 10000)

Format-specific readers

data = bunbun.read_plink("cohort.bed")
data = bunbun.read_vcf("calls.vcf.gz")
data = bunbun.read_plink2("imputed.pgen")
data = bunbun.read_bgen("ukb.bgen", sample_path="ukb.sample")

Operations

# Allele frequencies (parallel, NaN-aware)
freqs = data.allele_frequencies()   # numpy array, length n_variants

# Missing rate per variant
miss = data.missing_rates()

# Mean-impute missing values (in place)
data.mean_impute()

# Subset to a genomic region
chr1_data = data.region("1", 100_000, 500_000)

# Subset to specific samples
sub = data.subset_samples(["SAMPLE_001", "SAMPLE_042"])

Works with your existing stack

import polars as pl
import numpy as np
from sklearn.decomposition import PCA

data = bunbun.read("cohort.bed")

# Filter variants with polars
common = data.variants.filter(pl.col("chrom") == "1")

# Feed dosages straight into sklearn
pca = PCA(n_components=10)
components = pca.fit_transform(np.nan_to_num(data.dosages))

Supported formats

Format Extensions Compression Dosage
PLINK 1.x .bed .bim .fam hard calls
VCF .vcf .vcf.gz gzip, bgzf GT or DS field
PLINK 2 .pgen .pvar .psam hard calls
BGEN .bgen .sample zlib, zstd probabilistic

Output schema

data.samples — polars DataFrame

column dtype notes
sample_id Utf8 IID (PLINK) or sample column (VCF)
family_id Utf8 FID or "."
paternal_id Utf8 or "0"
maternal_id Utf8 or "0"
sex Utf8 "male" / "female" / "unknown"
phenotype Float64 or NaN

data.variants — polars DataFrame

column dtype notes
chrom Utf8 string for X/Y/MT compatibility
pos UInt32 1-based bp position
id Utf8 rsID or chrom:pos:ref:alt
ref Utf8 reference allele
alt Utf8 alternate allele(s), comma-separated
cm Float64 genetic distance or NaN
qual Float64 quality score or NaN
filter Utf8 "PASS" / filter string

data.dosages — numpy ndarray

  • Shape: (n_samples, n_variants)
  • Dtype: float32
  • Encoding: 0.0 = hom-ref, 1.0 = het, 2.0 = hom-alt, NaN = missing
  • For imputed data (BGEN, VCF with DS): continuous values in [0, 2]

Rust API

use bunbun::{read, PlinkReader, GenotypeReader};

// Auto-detect
let data = bunbun::read("cohort.bed").unwrap();

// Or explicit
let reader = PlinkReader::new("cohort.bed");
let data = reader.read_all().unwrap();

println!("{} × {}", data.n_samples(), data.n_variants());
println!("{}", data.samples.df);   // polars DataFrame
let freqs = data.dosages.allele_frequencies();

  • One output type. GenotypeData is the same regardless of input format.
  • Zero-copy where possible. PLINK .bed is memory-mapped; numpy gets the ndarray directly.
  • Parallel by default. Genotype decoding uses rayon across variants.
  • Small-string optimization. The Allele type stores ≤7-byte alleles inline (covers >99% of SNPs).
  • Transparent compression. gzip, bgzf, zstd, bzip2 detected from magic bytes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunbun-0.1.1.tar.gz (49.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bunbun-0.1.1-cp311-cp311-win_amd64.whl (4.4 MB view details)

Uploaded CPython 3.11Windows x86-64

File details

Details for the file bunbun-0.1.1.tar.gz.

File metadata

  • Download URL: bunbun-0.1.1.tar.gz
  • Upload date:
  • Size: 49.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.1.tar.gz
Algorithm Hash digest
SHA256 92ad8fdae6a4483ac88452977a2a3e3243f5219e008afcf74738b6c971d1f949
MD5 6dd9953b9407db39ad7d4a7b7cb40029
BLAKE2b-256 d3ade9f8096ec4a0e0116f14c94292bfdd8bd10fa23467bb5a955f2fe4e276d6

See more details on using hashes here.

File details

Details for the file bunbun-0.1.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bunbun-0.1.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 511392b43cfb6dc063f767a297c0a36ab7f85e833a57b17e0d26b567244520ac
MD5 bf14e7f078d5a11c01bb4a0a881b5857
BLAKE2b-256 045730d4c43e697e6f244b511b2f66ae14546c6704650969f727921aaeedec3e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page