Skip to main content

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats

Project description

🐰 bunbun

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats.

bunbun reads genotype data from any major format and gives you back the same three things every time:

output type what
.samples polars.DataFrame sample metadata (ID, sex, family, phenotype, …)
.variants polars.DataFrame variant metadata (chrom, pos, ref, alt, …)
.dosages numpy.ndarray genotype matrix — (n_samples × n_variants), float32

Install

uv pip install bunbun

If you want to build it from the source

git clone https://github.com/nahid18/bunbun.git && cd bunbun

uv venv 
source .venv/bin/activate # OR
source .venv/Scripts/activate

uv pip install maturin numpy pytest polars
uv pip install -e .

Quick start

import bunbun

# Auto-detect: pass any file from the fileset
data = bunbun.read("cohort.bed")

data.samples    # polars DataFrame: sample_id, family_id, sex, …
data.variants   # polars DataFrame: chrom, pos, id, ref, alt, …
data.dosages    # numpy array: shape (500, 10000), dtype float32

data.shape      # (500, 10000)

Format-specific readers

data = bunbun.read_plink("cohort.bed")
data = bunbun.read_vcf("calls.vcf.gz")
data = bunbun.read_plink2("imputed.pgen")
data = bunbun.read_bgen("ukb.bgen", sample_path="ukb.sample")

Operations

# Allele frequencies (parallel, NaN-aware)
freqs = data.allele_frequencies()   # numpy array, length n_variants

# Missing rate per variant
miss = data.missing_rates()

# Mean-impute missing values (in place)
data.mean_impute()

# Subset to a genomic region
chr1_data = data.region("1", 100_000, 500_000)

# Subset to specific samples
sub = data.subset_samples(["SAMPLE_001", "SAMPLE_042"])

Works with your existing stack

import polars as pl
import numpy as np
from sklearn.decomposition import PCA

data = bunbun.read("cohort.bed")

# Filter variants with polars
common = data.variants.filter(pl.col("chrom") == "1")

# Feed dosages straight into sklearn
pca = PCA(n_components=10)
components = pca.fit_transform(np.nan_to_num(data.dosages))

Supported formats

Format Extensions Compression Dosage
PLINK 1.x .bed .bim .fam hard calls
VCF .vcf .vcf.gz gzip, bgzf GT or DS field
PLINK 2 .pgen .pvar .psam hard calls
BGEN .bgen .sample zlib, zstd probabilistic

Output schema

data.samples — polars DataFrame

column dtype notes
sample_id Utf8 IID (PLINK) or sample column (VCF)
family_id Utf8 FID or "."
paternal_id Utf8 or "0"
maternal_id Utf8 or "0"
sex Utf8 "male" / "female" / "unknown"
phenotype Float64 or NaN

data.variants — polars DataFrame

column dtype notes
chrom Utf8 string for X/Y/MT compatibility
pos UInt32 1-based bp position
id Utf8 rsID or chrom:pos:ref:alt
ref Utf8 reference allele
alt Utf8 alternate allele(s), comma-separated
cm Float64 genetic distance or NaN
qual Float64 quality score or NaN
filter Utf8 "PASS" / filter string

data.dosages — numpy ndarray

  • Shape: (n_samples, n_variants)
  • Dtype: float32
  • Encoding: 0.0 = hom-ref, 1.0 = het, 2.0 = hom-alt, NaN = missing
  • For imputed data (BGEN, VCF with DS): continuous values in [0, 2]

Rust API

use bunbun::{read, PlinkReader, GenotypeReader};

// Auto-detect
let data = bunbun::read("cohort.bed").unwrap();

// Or explicit
let reader = PlinkReader::new("cohort.bed");
let data = reader.read_all().unwrap();

println!("{} × {}", data.n_samples(), data.n_variants());
println!("{}", data.samples.df);   // polars DataFrame
let freqs = data.dosages.allele_frequencies();

  • One output type. GenotypeData is the same regardless of input format.
  • Zero-copy where possible. PLINK .bed is memory-mapped; numpy gets the ndarray directly.
  • Parallel by default. Genotype decoding uses rayon across variants.
  • Small-string optimization. The Allele type stores ≤7-byte alleles inline (covers >99% of SNPs).
  • Transparent compression. gzip, bgzf, zstd, bzip2 detected from magic bytes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunbun-0.1.3.tar.gz (49.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bunbun-0.1.3-cp311-cp311-win_amd64.whl (4.4 MB view details)

Uploaded CPython 3.11Windows x86-64

File details

Details for the file bunbun-0.1.3.tar.gz.

File metadata

  • Download URL: bunbun-0.1.3.tar.gz
  • Upload date:
  • Size: 49.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d0179a71662191ebf0cfa3d5870cff57aaf06ae069f2cb771ac037660fd14742
MD5 23ca1daab37abd170768ac1d3b2c38ee
BLAKE2b-256 ef0c587db2a30d68ac579bc474f2462d5bca6a77a56ce6baac60dff8f5e92e32

See more details on using hashes here.

File details

Details for the file bunbun-0.1.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bunbun-0.1.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7edba09877367a60094fe050a965b2e5bd20c47778b1ba5bbb09178f24044fda
MD5 8bed07fc105c88dfceac77d83e6ca5e3
BLAKE2b-256 311094eec16fafbc77fb46b91f04324d14613575761c2dc871561136c89fbbbd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page