Pure-Rust BAM/CRAM depth, pileup and stats — fast, multi-threaded, Windows-native. Originally forked from rustbam.
Project description
🦀 rubam — fast, native, Windows-ready BAM/VCF toolkit
rubam is a pure-Rust BAM/VCF analysis library with first-class Python bindings. It provides per-base depth, pileup, flag statistics, read counting, VCF/BCF read+write and indexed query — multi-threaded, with bit-exact parity against samtools and pysam, and native binaries for Linux, macOS and Windows (no WSL, no MSYS2, no htslib system install). The core AlignmentFile surface (fetch / count / count_coverage / pileup / header access) is a drop-in for pysam on BAM, validated base-for-base against pysam on real hg38 data. CRAM is experimental: AlignmentFile(path, reference_filename=...) opens any CRAM and reads its header; record decode is panic-guarded and raises a Python error on codecs noodles-cram does not yet support, so it never crashes across the FFI boundary.
Originally forked from
rustbam(Choi et al.).rubamis now an independent project: pure-Rust backend (noodles), expanded API, full cross-platform CI, and a peer-reviewed validation campaign.
Why rubam
| Capability | pysam |
samtools CLI |
mosdepth |
rubam |
|---|---|---|---|---|
| Native Windows wheel | ❌ | ❌ | ❌ | ✅ |
| Multi-threaded depth | ❌ (GIL) | ⚠ partial | ❌ | ✅ |
| Python API | ✅ | ❌ | ❌ | ✅ |
| CRAM support | ✅ | ✅ | ✅ | ⚠ (skeleton v0.3.1; full decode v0.4) |
| Pure-Rust (no C dep) | ❌ | n/a | ❌ | ✅ |
pip install "just works" on Windows |
❌ | n/a | ❌ | ✅ |
Speed
Synthetic 10 Mb chr20, 30× coverage, 3 reps best-of
| Tool | 1 thread | 8 threads | vs pysam @ 8t |
|---|---|---|---|
| rubam | 4.14 s | 1.51 s | 6.0× |
| samtools depth | 8.34 s | 5.79 s | 1.6× |
| pysam | 8.95 s | 9.11 s (GIL) | 1.0× |
| mosdepth | 15.32 s | 13.88 s | 0.66× |
Real WGS — HG002 GIAB 2x250bp full chr20 (64.4 Mb, ~30× coverage)
Scaling sweep at threads {1, 2, 4, 8, 16}, 3 reps best-of (lower = better):
| Tool | 1t | 2t | 4t | 8t | 16t |
|---|---|---|---|---|---|
| rubam | 60.4 s | 35.3 s | 21.8 s | 17.1 s | 17.1 s |
| samtools depth | 89.7 s | 43.5 s | 44.1 s | 43.8 s | 45.2 s |
| pysam | 109.7 s | 110.7 s | 111.5 s | 109.4 s | 111.1 s (GIL) |
| mosdepth | 36.7 s | 36.8 s | 36.7 s | 36.3 s | 37.1 s |
rubam scales 3.5× from 1 → 8 threads, then saturates at 8t (I/O-bound). samtools scales only 1 → 2 threads. pysam and mosdepth are flat. At 8 threads, rubam beats every competitor: 6.4× pysam, 2.6× samtools, 2.1× mosdepth.
PacBio HiFi long reads — HG002 chr20 1-10 Mb
| Tool | 1 thread | 8 threads | vs pysam @ 8t |
|---|---|---|---|
| rubam | 5.3 s | 1.9 s | 5.6× |
| samtools depth | 9.7 s | 5.8 s | 1.8× |
| pysam | 10.6 s | 10.5 s | 1.0× |
| mosdepth | 19.0 s | 19.0 s | 0.55× |
→ rubam handles long-read CIGAR (rich D/I/=) without slowdown.
RNA-seq spliced reads — synthetic chr20 1-10 Mb, 5% reads with aM bN cM CIGAR (intron skip)
| Tool | 1 thread | 8 threads | vs pysam @ 8t |
|---|---|---|---|
| rubam | 4.6 s | 2.3 s | 4.8× |
| samtools depth | 8.3 s | 5.8 s | 1.9× |
| pysam | 11.0 s | 10.8 s | 1.0× |
→ rubam correctly skips reference-skip ops (N) without crashing; throughput is unchanged vs unspliced data. mosdepth not run on spliced data.
All numbers are best-of-3 wall-clock on the datasets named in each table heading.
Rust API (for downstream crates)
rubam is also a publishable Cargo crate. Add it to your Cargo.toml:
[dependencies]
rubam = "0.3.12"
…and use the pure-Rust types directly (no Python, no pyo3):
use rubam::api::{AlignmentFile, Aux};
fn count_reverse_reads(bam_path: &str) -> rubam::api::Result<usize> {
let mut bam = AlignmentFile::open(bam_path)?;
let mut n = 0;
for r in bam.records() {
if r?.is_reverse() {
n += 1;
}
}
Ok(n)
}
fn extract_split_reads(bam_path: &str) -> rubam::api::Result<Vec<String>> {
let mut bam = AlignmentFile::open(bam_path)?;
let mut sa_tags = Vec::new();
for r in bam.records() {
let r = r?;
if let Ok(Aux::String(s)) = r.aux(b"SA") {
sa_tags.push(s.to_owned());
}
}
Ok(sa_tags)
}
API surface (v0.2.1, stable):
| Type | Methods |
|---|---|
AlignmentFile |
open(path), header(), records() |
Header |
target_count, tid2name(tid), target_len(tid), target_names() |
AlignedSegment |
qname, tid, pos, mapq, seq, qual (raw phred), seq_len, 12 flag accessors, cigar(), aux(tag) |
Cigar |
enum with Match/Ins/Del/RefSkip/Equal/Diff/SoftClip/HardClip/Pad, each (u32) |
Aux<'a> |
enum with 18 variants (Char, I8/U8/.../U32, Float/Double, String, HexByteArray, 8 Array*) |
Drop-in replacement for rust_htslib::bam::Reader::from_path for codebases that iterate linearly. Indexed query (fetch) lands in v0.3.x. The pyo3 wrapper classes (rubam.AlignmentFile etc.) coexist with api::* and share the same noodles backend; v0.2.2 will refactor them to delegate to api::* directly.
Correctness
rubam is bit-exact against samtools depth -a over 5 × 10⁶ positions across five datasets, including whole-chromosome chr1:
| Dataset | Positions | rubam vs samtools |
|---|---|---|
| Synthetic chr20 30× WGS | 1 000 000 | 0 mismatches ✅ |
Synthetic chr20 spliced (5 % CIGAR N) |
1 000 000 | 0 mismatches ✅ |
| HG002 GIAB 2×250bp chr20 | 1 000 000 | 0 mismatches ✅ |
| HG002 PacBio HiFi chr20 | 1 000 000 | 0 mismatches ✅ |
| HG002 GIAB 2×250bp whole chr1 (249 Mb) | 1 000 000 | 0 mismatches ✅ |
| Total | 5 000 000 | 0 / 5 M ✅ |
VCF-side correctness vs pysam.VariantFile: 319 349 / 319 349 = 100.00 % on the GIAB HG002 truth chr1 (319 k records, 13 MB BGZF).
Cross-tool correctness vs system bcftools: 100 % on view, query, sort.
Install
pip install rubam
Pre-built wheels are published for Linux, macOS and Windows; a single
abi3 wheel per OS covers CPython 3.8 → 3.13. No htslib, no compiler,
no WSL required — pip install rubam just works on Windows.
The NumPy return path (get_depths_numpy) needs NumPy at runtime:
pip install rubam[numpy]
Quick start
import rubam
positions, depths = rubam.get_depths(
"sample.bam", "chr1", 1_000_000, 1_001_000,
step=1, min_mapq=20, min_bq=20,
max_depth=8000, num_threads=12,
)
CLI:
rubam depth sample.bam chr1 1000000 1001000 -n 12 -Q 20 -q 20 > depth.tsv
Features
Stable (v0.1)
get_depths(bam, chr, start, end, ...)— per-base coverage over a 1-based, inclusive region.- CLI
rubam depth ….
Shipped since v0.1.x
count_reads(bam, chr, start, end, ...)—pysam.AlignmentFile.countreplacement.flag_stats(bam)—samtools flagstatreplacement, returning a Python dict.pileup_bases(bam, chr, start, end, ...)— A/C/G/T counts per position.get_depths_regions(bam, regions)— batch BED-style regions with shared thread pool.get_depths_numpy(...)— zero-copynp.uint64/np.uint32return path (~4.5× lower peak RSS than the list path; needspip install rubam[numpy]).
Roadmap
- ⚠ CRAM full record decode (v0.4):
rubam.AlignmentFile("sample.cram", reference_filename="ref.fa")already opens and reads the header; record decode is panic-guarded and raises a Python error on codecsnoodles-cramdoes not yet support (e.g. Huffman byte-series on NYGC-style CRAMs). Tracking the upstream codec landing. to_pandas()zero-copy helper; Parquet output.rubam.compat.pysamdrop-in shim (v0.5).
What's new in 0.2
rubam.AlignmentFileandrubam.AlignedSegment— drop-in pysam-style read iteration and per-read property access (flags, cigar, sequence, qualities, tags, reference helpers).AlignmentFile.fetch(chr, start, end)— indexed region iterator.AlignmentFile.pileup(chr, start, end)— buffered per-position iterator yieldingPileupColumnobjects with(reference_pos, depth, A/C/G/T/N).rubam.tools.{sort, index, view, merge, flagstat, idxstats, calmd, faidx}— pure-Rust ports of the eight most-used samtools subcommands.rubam-samtoolsshadow CLI binary —alias samtools='rubam samtools'and your shell pipelines keep working, on Windows included.
What's new in 0.2.1
rubam::api::{AlignmentFile, AlignedSegment, Header, Cigar, Aux, Error}— pure-Rust public crate API. External Rust crates drop inrubam = "0.3.12"and import these types directly without pulling in pyo3 — a drop-in forrust-htslib::bam::Readerfor codebases that iterate linearly. The public surface is pinned bytests/api_smoke.rsandtests/integration_test.rs.
What's new in 0.3
rubam.VariantFileandrubam.VariantRecord— pysam-style VCF / BCF / Tabix support. Read, write (modes"w"/"wz"/"wb"for plain / BGZF / BCF), iterate, indexedfetch(contig, start, end), multi-sample genotype access viarecord.samples["NA12878"]["GT"].rubam.VariantHeader— read-only metadata: samples, contigs (with lengths), INFO / FORMAT meta lines (id / number / type / description), FILTER ids, file format version.rubam.VariantRecord(header=, …)constructor — build records from scratch. Plusset_position,set_quality,set_filter,add_filter,clear_filters,set_infomutation APIs.rubam.tools.bcftools.{view, norm, concat, query, index, sort, stats}— pure-Rust ports of seven most-used bcftools subcommands.rubam-bcftoolsshadow CLI —alias bcftools='rubam bcftools'works on Windows. Same shape asrubam-samtools.- Cross-tool correctness:
(chrom, pos, ref, alt, ids, qual, filters)extracted via bothrubam.VariantFileandpysam.VariantFileagree on 0 / 100 records mismatch on a 3-sample synthetic VCF.
What's new in 0.3.12
pysam parity on real-world hg38 BAMs, verified base-for-base against
pysam 0.24.0 (tests/test_pysam_parity_findings.py):
- Tolerant header parsing — opens real hg38 / GATK / Picard BAMs that
a strict SAM-header parser rejects (
@HDwith noVN, multi-part versions likeVN:1.6.0, duplicate@PG/@RG/@SQIDs from re-run pipelines). Valid headers still take the strict fast path unchanged. countmatches pysam defaults —read_callback='nofilter'by default (counts every read in the region, including secondary / supplementary / duplicate / QC-fail);read_callback='all'applies the0x704mask.count_coveragematches pysam defaults —quality_threshold=15(base counted iffqual >= threshold), no depth cap, and aread_callbackargument.
The compatibility layer rubam.compat.pysam (drop-in from rubam.compat import pysam)
lands in v0.5; v0.2 + v0.3 are the foundation it sits on top of.
Validation & benchmarks
rubam is validated against pysam, samtools depth, samtools mpileup, mosdepth, bedtools genomecov and the original rustbam on real WGS, RNA-seq, exome and PacBio HiFi datasets (HG002, NA12878, public ENA RNA-seq), with multi-threaded scaling and cross-platform parity. The numbers in the tables above are drawn from that campaign.
License
MIT — see LICENSE.
Citation
If you use rubam in academic work, please cite the bioRxiv preprint (link will be added once posted).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rubam-0.3.12.tar.gz.
File metadata
- Download URL: rubam-0.3.12.tar.gz
- Upload date:
- Size: 244.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
529adbbfc24a7e0c90b572fe2417960195ff034b36e4f12fd3334eb1d6c5204d
|
|
| MD5 |
38b2206f58e0602af62e4390296d6ab8
|
|
| BLAKE2b-256 |
c2c2300468b933c39c3ac497936a6e00b31b412d548085d47a7461f308579b13
|
Provenance
The following attestation bundles were made for rubam-0.3.12.tar.gz:
Publisher:
release.yml on victormar1/rubam
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rubam-0.3.12.tar.gz -
Subject digest:
529adbbfc24a7e0c90b572fe2417960195ff034b36e4f12fd3334eb1d6c5204d - Sigstore transparency entry: 1731441206
- Sigstore integration time:
-
Permalink:
victormar1/rubam@c91e6549b1e91cc235a976a0f3843585699a3c9c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/victormar1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c91e6549b1e91cc235a976a0f3843585699a3c9c -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file rubam-0.3.12-cp38-abi3-win_amd64.whl.
File metadata
- Download URL: rubam-0.3.12-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8db027f4b088ebc2669938817699438d50a218fde7cd1d2cdfcc34c660385ab4
|
|
| MD5 |
8f6103fbba45cb0274cf75c91906d295
|
|
| BLAKE2b-256 |
51c7eafe43e2206cbb7f77797efa19032bc2e9d726d28501b0a81f48e0fbc251
|
Provenance
The following attestation bundles were made for rubam-0.3.12-cp38-abi3-win_amd64.whl:
Publisher:
release.yml on victormar1/rubam
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rubam-0.3.12-cp38-abi3-win_amd64.whl -
Subject digest:
8db027f4b088ebc2669938817699438d50a218fde7cd1d2cdfcc34c660385ab4 - Sigstore transparency entry: 1731441278
- Sigstore integration time:
-
Permalink:
victormar1/rubam@c91e6549b1e91cc235a976a0f3843585699a3c9c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/victormar1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c91e6549b1e91cc235a976a0f3843585699a3c9c -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file rubam-0.3.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: rubam-0.3.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d95c62f6ed9d4a030e71c06af80c9c7787b0a21f1d5b2b3957a6f975ee4ba2e5
|
|
| MD5 |
e07470191bb51b38ba2c068dab3a0c43
|
|
| BLAKE2b-256 |
d52e81e74482124775fdc1f17c7c23a8814fb780cce94d8e6d8d5d2ecc1e7dcf
|
Provenance
The following attestation bundles were made for rubam-0.3.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on victormar1/rubam
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rubam-0.3.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
d95c62f6ed9d4a030e71c06af80c9c7787b0a21f1d5b2b3957a6f975ee4ba2e5 - Sigstore transparency entry: 1731441256
- Sigstore integration time:
-
Permalink:
victormar1/rubam@c91e6549b1e91cc235a976a0f3843585699a3c9c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/victormar1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c91e6549b1e91cc235a976a0f3843585699a3c9c -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file rubam-0.3.12-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.
File metadata
- Download URL: rubam-0.3.12-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.8+, macOS 10.12+ universal2 (ARM64, x86-64), macOS 10.12+ x86-64, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
231135305f866027f258117a6bf3f6a1634dc38fe83d697d74ab9922e33783ac
|
|
| MD5 |
caaa4f61bb0228ac63cb51980118e563
|
|
| BLAKE2b-256 |
4e3584835dac60029e6d37bce170580cdc94a3dcefd6dd97295b7d530d5fb811
|
Provenance
The following attestation bundles were made for rubam-0.3.12-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:
Publisher:
release.yml on victormar1/rubam
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rubam-0.3.12-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl -
Subject digest:
231135305f866027f258117a6bf3f6a1634dc38fe83d697d74ab9922e33783ac - Sigstore transparency entry: 1731441228
- Sigstore integration time:
-
Permalink:
victormar1/rubam@c91e6549b1e91cc235a976a0f3843585699a3c9c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/victormar1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c91e6549b1e91cc235a976a0f3843585699a3c9c -
Trigger Event:
workflow_dispatch
-
Statement type: