Skip to main content

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats

Project description

🐟 bunbun

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats.

bunbun reads genotype data from any major format and gives you back the same three things every time:

output type what
.samples polars.DataFrame sample metadata (ID, sex, family, phenotype, …)
.variants polars.DataFrame variant metadata (chrom, pos, ref, alt, …)
.dosages numpy.ndarray genotype matrix — (n_samples × n_variants), float32

Install

uv pip install bunbun

If you want to build it from the source

git clone https://github.com/nahid18/bunbun.git && cd bunbun
uv venv 
source .venv/bin/activate
# or 
source .venv/Scripts/activate
uv pip install maturin numpy pytest polars

# install from source
uv pip install -e .

Quick start

import bunbun

# Auto-detect: pass any file from the fileset
data = bunbun.read("cohort.bed")

data.samples    # polars DataFrame: sample_id, family_id, sex, …
data.variants   # polars DataFrame: chrom, pos, id, ref, alt, …
data.dosages    # numpy array: shape (500, 10000), dtype float32

data.shape      # (500, 10000)

Format-specific readers

data = bunbun.read_plink("cohort.bed")
data = bunbun.read_vcf("calls.vcf.gz")
data = bunbun.read_plink2("imputed.pgen")
data = bunbun.read_bgen("ukb.bgen", sample_path="ukb.sample")

Operations

# Allele frequencies (parallel, NaN-aware)
freqs = data.allele_frequencies()   # numpy array, length n_variants

# Missing rate per variant
miss = data.missing_rates()

# Mean-impute missing values (in place)
data.mean_impute()

# Subset to a genomic region
chr1_data = data.region("1", 100_000, 500_000)

# Subset to specific samples
sub = data.subset_samples(["SAMPLE_001", "SAMPLE_042"])

Works with your existing stack

import polars as pl
import numpy as np
from sklearn.decomposition import PCA

data = bunbun.read("cohort.bed")

# Filter variants with polars
common = data.variants.filter(pl.col("chrom") == "1")

# Feed dosages straight into sklearn
pca = PCA(n_components=10)
components = pca.fit_transform(np.nan_to_num(data.dosages))

Supported formats

Format Extensions Compression Dosage
PLINK 1.x .bed .bim .fam hard calls
VCF .vcf .vcf.gz gzip, bgzf GT or DS field
PLINK 2 .pgen .pvar .psam hard calls
BGEN .bgen .sample zlib, zstd probabilistic

Output schema

data.samples — polars DataFrame

column dtype notes
sample_id Utf8 IID (PLINK) or sample column (VCF)
family_id Utf8 FID or "."
paternal_id Utf8 or "0"
maternal_id Utf8 or "0"
sex Utf8 "male" / "female" / "unknown"
phenotype Float64 or NaN

data.variants — polars DataFrame

column dtype notes
chrom Utf8 string for X/Y/MT compatibility
pos UInt32 1-based bp position
id Utf8 rsID or chrom:pos:ref:alt
ref Utf8 reference allele
alt Utf8 alternate allele(s), comma-separated
cm Float64 genetic distance or NaN
qual Float64 quality score or NaN
filter Utf8 "PASS" / filter string

data.dosages — numpy ndarray

  • Shape: (n_samples, n_variants)
  • Dtype: float32
  • Encoding: 0.0 = hom-ref, 1.0 = het, 2.0 = hom-alt, NaN = missing
  • For imputed data (BGEN, VCF with DS): continuous values in [0, 2]

Rust API

use bunbun::{read, PlinkReader, GenotypeReader};

// Auto-detect
let data = bunbun::read("cohort.bed").unwrap();

// Or explicit
let reader = PlinkReader::new("cohort.bed");
let data = reader.read_all().unwrap();

println!("{} × {}", data.n_samples(), data.n_variants());
println!("{}", data.samples.df);   // polars DataFrame
let freqs = data.dosages.allele_frequencies();

  • One output type. GenotypeData is the same regardless of input format.
  • Zero-copy where possible. PLINK .bed is memory-mapped; numpy gets the ndarray directly.
  • Parallel by default. Genotype decoding uses rayon across variants.
  • Small-string optimization. The Allele type stores ≤7-byte alleles inline (covers >99% of SNPs).
  • Transparent compression. gzip, bgzf, zstd, bzip2 detected from magic bytes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunbun-0.1.0.tar.gz (426.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bunbun-0.1.0-cp311-cp311-win_amd64.whl (4.4 MB view details)

Uploaded CPython 3.11Windows x86-64

File details

Details for the file bunbun-0.1.0.tar.gz.

File metadata

  • Download URL: bunbun-0.1.0.tar.gz
  • Upload date:
  • Size: 426.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4dfad54eb8488007e736257ac96a685d9bff8f80628cc38b4d66c0d3ff2dcfe6
MD5 e1b386649c7cd1fb1a32b4e49525f964
BLAKE2b-256 02523900e5c080e8de840c964e51602ad57c4825f7a21cbc376a0a58a6b233c6

See more details on using hashes here.

File details

Details for the file bunbun-0.1.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bunbun-0.1.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 37096de2b8effd1ddfd48c23e787df5a8fd2d025612eccd052dd62fd00296e17
MD5 18ef835983eb5ec0a885588dd08ca68f
BLAKE2b-256 ad4f7830b66178e69a750579c7e3bccd0c5f8f5abd158a3efb99c46d6c2877a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page