Skip to main content

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats

Project description

🐰 bunbun

Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats.

bunbun reads genotype data from any major format and gives you back the same three things every time:

output type what
.samples polars.DataFrame sample metadata (ID, sex, family, phenotype, …)
.variants polars.DataFrame variant metadata (chrom, pos, ref, alt, …)
.dosages numpy.ndarray genotype matrix — (n_samples × n_variants), float32

Install

uv pip install bunbun

If you want to build it from the source

git clone https://github.com/nahid18/bunbun.git && cd bunbun

uv venv 
source .venv/bin/activate # OR
source .venv/Scripts/activate

uv pip install maturin numpy pytest polars
uv pip install -e .

Quick start

import bunbun

# Auto-detect: pass any file from the fileset
data = bunbun.read("cohort.bed")

data.samples    # polars DataFrame: sample_id, family_id, sex, …
data.variants   # polars DataFrame: chrom, pos, id, ref, alt, …
data.dosages    # numpy array: shape (500, 10000), dtype float32

data.shape      # (500, 10000)

Format-specific readers

data = bunbun.read_plink("cohort.bed")
data = bunbun.read_vcf("calls.vcf.gz")
data = bunbun.read_plink2("imputed.pgen")
data = bunbun.read_bgen("ukb.bgen", sample_path="ukb.sample")

Operations

# Allele frequencies (parallel, NaN-aware)
freqs = data.allele_frequencies()   # numpy array, length n_variants

# Missing rate per variant
miss = data.missing_rates()

# Mean-impute missing values (in place)
data.mean_impute()

# Subset to a genomic region
chr1_data = data.region("1", 100_000, 500_000)

# Subset to specific samples
sub = data.subset_samples(["SAMPLE_001", "SAMPLE_042"])

Works with your existing stack

import polars as pl
import numpy as np
from sklearn.decomposition import PCA

data = bunbun.read("cohort.bed")

# Filter variants with polars
common = data.variants.filter(pl.col("chrom") == "1")

# Feed dosages straight into sklearn
pca = PCA(n_components=10)
components = pca.fit_transform(np.nan_to_num(data.dosages))

Supported formats

Format Extensions Compression Dosage
PLINK 1.x .bed .bim .fam hard calls
VCF .vcf .vcf.gz gzip, bgzf GT or DS field
PLINK 2 .pgen .pvar .psam hard calls
BGEN .bgen .sample zlib, zstd probabilistic

Output schema

data.samples — polars DataFrame

column dtype notes
sample_id Utf8 IID (PLINK) or sample column (VCF)
family_id Utf8 FID or "."
paternal_id Utf8 or "0"
maternal_id Utf8 or "0"
sex Utf8 "male" / "female" / "unknown"
phenotype Float64 or NaN

data.variants — polars DataFrame

column dtype notes
chrom Utf8 string for X/Y/MT compatibility
pos UInt32 1-based bp position
id Utf8 rsID or chrom:pos:ref:alt
ref Utf8 reference allele
alt Utf8 alternate allele(s), comma-separated
cm Float64 genetic distance or NaN
qual Float64 quality score or NaN
filter Utf8 "PASS" / filter string

data.dosages — numpy ndarray

  • Shape: (n_samples, n_variants)
  • Dtype: float32
  • Encoding: 0.0 = hom-ref, 1.0 = het, 2.0 = hom-alt, NaN = missing
  • For imputed data (BGEN, VCF with DS): continuous values in [0, 2]

Rust API

use bunbun::{read, PlinkReader, GenotypeReader};

// Auto-detect
let data = bunbun::read("cohort.bed").unwrap();

// Or explicit
let reader = PlinkReader::new("cohort.bed");
let data = reader.read_all().unwrap();

println!("{} × {}", data.n_samples(), data.n_variants());
println!("{}", data.samples.df);   // polars DataFrame
let freqs = data.dosages.allele_frequencies();

  • One output type. GenotypeData is the same regardless of input format.
  • Zero-copy where possible. PLINK .bed is memory-mapped; numpy gets the ndarray directly.
  • Parallel by default. Genotype decoding uses rayon across variants.
  • Small-string optimization. The Allele type stores ≤7-byte alleles inline (covers >99% of SNPs).
  • Transparent compression. gzip, bgzf, zstd, bzip2 detected from magic bytes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunbun-0.1.2.tar.gz (49.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bunbun-0.1.2-cp311-cp311-win_amd64.whl (4.4 MB view details)

Uploaded CPython 3.11Windows x86-64

File details

Details for the file bunbun-0.1.2.tar.gz.

File metadata

  • Download URL: bunbun-0.1.2.tar.gz
  • Upload date:
  • Size: 49.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.2.tar.gz
Algorithm Hash digest
SHA256 97c03cad3edfee0739e8b5cd03a2128ef924d7d0eb9ef0dab99eebe9ace8a262
MD5 dd35ecc551a4686e57279b7a1c128fd2
BLAKE2b-256 36959936794de41a177d98e5c83ce055d8357c2951cf06712ebbff8fd051dd74

See more details on using hashes here.

File details

Details for the file bunbun-0.1.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: bunbun-0.1.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bunbun-0.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 3768cb26bc024b0167805f3074bc3b837d7499348595f069bd825604b020a986
MD5 1b1ec0beff216d2ac4060084116e667f
BLAKE2b-256 2c1e23ccf8eb3fdb0b7256a682628eeec30c8ab7ef03b4ec51312b4f7422ede0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page