Skip to main content

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats

Project description

🐰 bunbun

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats.

bunbun reads genotype data from any major format and gives you back the same three things every time:

output type what
.samples polars.DataFrame sample metadata (ID, sex, family, phenotype, …)
.variants polars.DataFrame variant metadata (chrom, pos, ref, alt, …)
.dosages numpy.ndarray genotype matrix — (n_samples × n_variants), float32

Install

uv pip install bunbun

If you want to build it from the source

git clone https://github.com/nahid18/bunbun.git && cd bunbun

uv venv 
source .venv/bin/activate # OR
source .venv/Scripts/activate

uv pip install maturin numpy pytest polars
uv pip install -e .

Quick start

import bunbun

# Auto-detect: pass any file from the fileset
data = bunbun.read("cohort.bed")

data.samples    # polars DataFrame: sample_id, family_id, sex, …
data.variants   # polars DataFrame: chrom, pos, id, ref, alt, …
data.dosages    # numpy array: shape (500, 10000), dtype float32

data.shape      # (500, 10000)

Format-specific readers

data = bunbun.read_plink("cohort.bed")
data = bunbun.read_vcf("calls.vcf.gz")
data = bunbun.read_plink2("imputed.pgen")
data = bunbun.read_bgen("ukb.bgen", sample_path="ukb.sample")

Operations

# Allele frequencies (parallel, NaN-aware)
freqs = data.allele_frequencies()   # numpy array, length n_variants

# Missing rate per variant
miss = data.missing_rates()

# Mean-impute missing values (in place)
data.mean_impute()

# Subset to a genomic region
chr1_data = data.region("1", 100_000, 500_000)

# Subset to specific samples
sub = data.subset_samples(["SAMPLE_001", "SAMPLE_042"])

Works with your existing stack

import polars as pl
import numpy as np
from sklearn.decomposition import PCA

data = bunbun.read("cohort.bed")

# Filter variants with polars
common = data.variants.filter(pl.col("chrom") == "1")

# Feed dosages straight into sklearn
pca = PCA(n_components=10)
components = pca.fit_transform(np.nan_to_num(data.dosages))

Supported formats

Format Extensions Compression Dosage
PLINK 1.x .bed .bim .fam hard calls
VCF .vcf .vcf.gz gzip, bgzf GT or DS field
PLINK 2 .pgen .pvar .psam hard calls
BGEN .bgen .sample zlib, zstd probabilistic

Output schema

data.samples — polars DataFrame

column dtype notes
sample_id Utf8 IID (PLINK) or sample column (VCF)
family_id Utf8 FID or "."
paternal_id Utf8 or "0"
maternal_id Utf8 or "0"
sex Utf8 "male" / "female" / "unknown"
phenotype Float64 or NaN

data.variants — polars DataFrame

column dtype notes
chrom Utf8 string for X/Y/MT compatibility
pos UInt32 1-based bp position
id Utf8 rsID or chrom:pos:ref:alt
ref Utf8 reference allele
alt Utf8 alternate allele(s), comma-separated
cm Float64 genetic distance or NaN
qual Float64 quality score or NaN
filter Utf8 "PASS" / filter string

data.dosages — numpy ndarray

  • Shape: (n_samples, n_variants)
  • Dtype: float32
  • Encoding: 0.0 = hom-ref, 1.0 = het, 2.0 = hom-alt, NaN = missing
  • For imputed data (BGEN, VCF with DS): continuous values in [0, 2]

Rust API

use bunbun::{read, PlinkReader, GenotypeReader};

// Auto-detect
let data = bunbun::read("cohort.bed").unwrap();

// Or explicit
let reader = PlinkReader::new("cohort.bed");
let data = reader.read_all().unwrap();

println!("{} × {}", data.n_samples(), data.n_variants());
println!("{}", data.samples.df);   // polars DataFrame
let freqs = data.dosages.allele_frequencies();

  • One output type. GenotypeData is the same regardless of input format.
  • Zero-copy where possible. PLINK .bed is memory-mapped; numpy gets the ndarray directly.
  • Parallel by default. Genotype decoding uses rayon across variants.
  • Small-string optimization. The Allele type stores ≤7-byte alleles inline (covers >99% of SNPs).
  • Transparent compression. gzip, bgzf, zstd, bzip2 detected from magic bytes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunbun-0.1.4.tar.gz (55.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bunbun-0.1.4-cp311-cp311-win_amd64.whl (4.4 MB view details)

Uploaded CPython 3.11Windows x86-64

File details

Details for the file bunbun-0.1.4.tar.gz.

File metadata

  • Download URL: bunbun-0.1.4.tar.gz
  • Upload date:
  • Size: 55.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.4.tar.gz
Algorithm Hash digest
SHA256 ecedd6370c752caf6615f9ab423f5a9a7102a00a1e9ee980fef156effaf34608
MD5 ecd184a63f1a1515620e19946834afc1
BLAKE2b-256 bfb1a6a43d95489839cbee3daef248594d2e3d54ff61b3cddea2b12096b83652

See more details on using hashes here.

File details

Details for the file bunbun-0.1.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bunbun-0.1.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 a1e0db9e28a8a89051fff968641a87be4ea1e467720e24d705a0a06a583cea13
MD5 91cada025b856fda3941b0e23a67aa16
BLAKE2b-256 3a086c7e7683ba131d7a25f8d22569ca47bf0217b54f9f929960237afab8fa75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page