Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats
Project description
🐰 bunbun
Fast, unified genotype parser for PLINK, VCF, PLINK2, and BGEN formats.
bunbun reads genotype data from any major format and gives you back the same three things every time:
| output | type | what |
|---|---|---|
.samples |
polars.DataFrame |
sample metadata (ID, sex, family, phenotype, …) |
.variants |
polars.DataFrame |
variant metadata (chrom, pos, ref, alt, …) |
.dosages |
numpy.ndarray |
genotype matrix — (n_samples × n_variants), float32 |
Install
uv pip install bunbun
If you want to build it from the source
git clone https://github.com/nahid18/bunbun.git && cd bunbun
uv venv
source .venv/bin/activate # OR
source .venv/Scripts/activate
uv pip install maturin numpy pytest polars
uv pip install -e .
Quick start
import bunbun
# Auto-detect: pass any file from the fileset
data = bunbun.read("cohort.bed")
data.samples # polars DataFrame: sample_id, family_id, sex, …
data.variants # polars DataFrame: chrom, pos, id, ref, alt, …
data.dosages # numpy array: shape (500, 10000), dtype float32
data.shape # (500, 10000)
Format-specific readers
data = bunbun.read_plink("cohort.bed")
data = bunbun.read_vcf("calls.vcf.gz")
data = bunbun.read_plink2("imputed.pgen")
data = bunbun.read_bgen("ukb.bgen", sample_path="ukb.sample")
Operations
# Allele frequencies (parallel, NaN-aware)
freqs = data.allele_frequencies() # numpy array, length n_variants
# Missing rate per variant
miss = data.missing_rates()
# Mean-impute missing values (in place)
data.mean_impute()
# Subset to a genomic region
chr1_data = data.region("1", 100_000, 500_000)
# Subset to specific samples
sub = data.subset_samples(["SAMPLE_001", "SAMPLE_042"])
Works with your existing stack
import polars as pl
import numpy as np
from sklearn.decomposition import PCA
data = bunbun.read("cohort.bed")
# Filter variants with polars
common = data.variants.filter(pl.col("chrom") == "1")
# Feed dosages straight into sklearn
pca = PCA(n_components=10)
components = pca.fit_transform(np.nan_to_num(data.dosages))
Supported formats
| Format | Extensions | Compression | Dosage |
|---|---|---|---|
| PLINK 1.x | .bed .bim .fam |
— | hard calls |
| VCF | .vcf .vcf.gz |
gzip, bgzf | GT or DS field |
| PLINK 2 | .pgen .pvar .psam |
— | hard calls |
| BGEN | .bgen .sample |
zlib, zstd | probabilistic |
Output schema
data.samples — polars DataFrame
| column | dtype | notes |
|---|---|---|
sample_id |
Utf8 | IID (PLINK) or sample column (VCF) |
family_id |
Utf8 | FID or "." |
paternal_id |
Utf8 | or "0" |
maternal_id |
Utf8 | or "0" |
sex |
Utf8 | "male" / "female" / "unknown" |
phenotype |
Float64 | or NaN |
data.variants — polars DataFrame
| column | dtype | notes |
|---|---|---|
chrom |
Utf8 | string for X/Y/MT compatibility |
pos |
UInt32 | 1-based bp position |
id |
Utf8 | rsID or chrom:pos:ref:alt |
ref |
Utf8 | reference allele |
alt |
Utf8 | alternate allele(s), comma-separated |
cm |
Float64 | genetic distance or NaN |
qual |
Float64 | quality score or NaN |
filter |
Utf8 | "PASS" / filter string |
data.dosages — numpy ndarray
- Shape:
(n_samples, n_variants) - Dtype:
float32 - Encoding:
0.0= hom-ref,1.0= het,2.0= hom-alt,NaN= missing - For imputed data (BGEN, VCF with DS): continuous values in
[0, 2]
Rust API
use bunbun::{read, PlinkReader, GenotypeReader};
// Auto-detect
let data = bunbun::read("cohort.bed").unwrap();
// Or explicit
let reader = PlinkReader::new("cohort.bed");
let data = reader.read_all().unwrap();
println!("{} × {}", data.n_samples(), data.n_variants());
println!("{}", data.samples.df); // polars DataFrame
let freqs = data.dosages.allele_frequencies();
- One output type.
GenotypeDatais the same regardless of input format. - Zero-copy where possible. PLINK
.bedis memory-mapped; numpy gets the ndarray directly. - Parallel by default. Genotype decoding uses rayon across variants.
- Small-string optimization. The
Alleletype stores ≤7-byte alleles inline (covers >99% of SNPs). - Transparent compression. gzip, bgzf, zstd, bzip2 detected from magic bytes.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bunbun-0.1.3.tar.gz
(49.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bunbun-0.1.3.tar.gz.
File metadata
- Download URL: bunbun-0.1.3.tar.gz
- Upload date:
- Size: 49.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0179a71662191ebf0cfa3d5870cff57aaf06ae069f2cb771ac037660fd14742
|
|
| MD5 |
23ca1daab37abd170768ac1d3b2c38ee
|
|
| BLAKE2b-256 |
ef0c587db2a30d68ac579bc474f2462d5bca6a77a56ce6baac60dff8f5e92e32
|
File details
Details for the file bunbun-0.1.3-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: bunbun-0.1.3-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 4.4 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7edba09877367a60094fe050a965b2e5bd20c47778b1ba5bbb09178f24044fda
|
|
| MD5 |
8bed07fc105c88dfceac77d83e6ca5e3
|
|
| BLAKE2b-256 |
311094eec16fafbc77fb46b91f04324d14613575761c2dc871561136c89fbbbd
|