Skip to main content

Pure-Rust BAM/CRAM depth, pileup and stats — fast, multi-threaded, Windows-native. Originally forked from rustbam.

Project description

🦀 rubam — fast, native, Windows-ready BAM/VCF toolkit

CI Wheels License: MIT Python Platforms

rubam is a pure-Rust BAM/VCF analysis library with first-class Python bindings. It provides per-base depth, pileup, flag statistics, read counting, VCF/BCF read+write and indexed query — multi-threaded, with bit-exact parity against samtools and pysam, and native binaries for Linux, macOS and Windows (no WSL, no MSYS2, no htslib system install). The core AlignmentFile surface (fetch / count / count_coverage / pileup / header access) is a drop-in for pysam on BAM, validated base-for-base against pysam on real hg38 data. CRAM is experimental: AlignmentFile(path, reference_filename=...) opens any CRAM and reads its header; record decode is panic-guarded and raises a Python error on codecs noodles-cram does not yet support, so it never crashes across the FFI boundary.

Originally forked from rustbam (Choi et al.). rubam is now an independent project: pure-Rust backend (noodles), expanded API, full cross-platform CI, and a peer-reviewed validation campaign.

Why rubam

Capability pysam samtools CLI mosdepth rubam
Native Windows wheel
Multi-threaded depth ❌ (GIL) ⚠ partial
Python API
CRAM support ⚠ (skeleton v0.3.1; full decode v0.4)
Pure-Rust (no C dep) n/a
pip install "just works" on Windows n/a

Speed

Synthetic 10 Mb chr20, 30× coverage, 3 reps best-of

Tool 1 thread 8 threads vs pysam @ 8t
rubam 4.14 s 1.51 s 6.0×
samtools depth 8.34 s 5.79 s 1.6×
pysam 8.95 s 9.11 s (GIL) 1.0×
mosdepth 15.32 s 13.88 s 0.66×

Real WGS — HG002 GIAB 2x250bp full chr20 (64.4 Mb, ~30× coverage)

Scaling sweep at threads {1, 2, 4, 8, 16}, 3 reps best-of (lower = better):

Tool 1t 2t 4t 8t 16t
rubam 60.4 s 35.3 s 21.8 s 17.1 s 17.1 s
samtools depth 89.7 s 43.5 s 44.1 s 43.8 s 45.2 s
pysam 109.7 s 110.7 s 111.5 s 109.4 s 111.1 s (GIL)
mosdepth 36.7 s 36.8 s 36.7 s 36.3 s 37.1 s

rubam scales 3.5× from 1 → 8 threads, then saturates at 8t (I/O-bound). samtools scales only 1 → 2 threads. pysam and mosdepth are flat. At 8 threads, rubam beats every competitor: 6.4× pysam, 2.6× samtools, 2.1× mosdepth.

PacBio HiFi long reads — HG002 chr20 1-10 Mb

Tool 1 thread 8 threads vs pysam @ 8t
rubam 5.3 s 1.9 s 5.6×
samtools depth 9.7 s 5.8 s 1.8×
pysam 10.6 s 10.5 s 1.0×
mosdepth 19.0 s 19.0 s 0.55×

→ rubam handles long-read CIGAR (rich D/I/=) without slowdown.

RNA-seq spliced reads — synthetic chr20 1-10 Mb, 5% reads with aM bN cM CIGAR (intron skip)

Tool 1 thread 8 threads vs pysam @ 8t
rubam 4.6 s 2.3 s 4.8×
samtools depth 8.3 s 5.8 s 1.9×
pysam 11.0 s 10.8 s 1.0×

→ rubam correctly skips reference-skip ops (N) without crashing; throughput is unchanged vs unspliced data. mosdepth not run on spliced data.

All numbers are best-of-3 wall-clock on the datasets named in each table heading.

Rust API (for downstream crates)

rubam is also a publishable Cargo crate. Add it to your Cargo.toml:

[dependencies]
rubam = "0.3.12"

…and use the pure-Rust types directly (no Python, no pyo3):

use rubam::api::{AlignmentFile, Aux};

fn count_reverse_reads(bam_path: &str) -> rubam::api::Result<usize> {
    let mut bam = AlignmentFile::open(bam_path)?;
    let mut n = 0;
    for r in bam.records() {
        if r?.is_reverse() {
            n += 1;
        }
    }
    Ok(n)
}

fn extract_split_reads(bam_path: &str) -> rubam::api::Result<Vec<String>> {
    let mut bam = AlignmentFile::open(bam_path)?;
    let mut sa_tags = Vec::new();
    for r in bam.records() {
        let r = r?;
        if let Ok(Aux::String(s)) = r.aux(b"SA") {
            sa_tags.push(s.to_owned());
        }
    }
    Ok(sa_tags)
}

API surface (v0.2.1, stable):

Type Methods
AlignmentFile open(path), header(), records()
Header target_count, tid2name(tid), target_len(tid), target_names()
AlignedSegment qname, tid, pos, mapq, seq, qual (raw phred), seq_len, 12 flag accessors, cigar(), aux(tag)
Cigar enum with Match/Ins/Del/RefSkip/Equal/Diff/SoftClip/HardClip/Pad, each (u32)
Aux<'a> enum with 18 variants (Char, I8/U8/.../U32, Float/Double, String, HexByteArray, 8 Array*)

Drop-in replacement for rust_htslib::bam::Reader::from_path for codebases that iterate linearly. Indexed query (fetch) lands in v0.3.x. The pyo3 wrapper classes (rubam.AlignmentFile etc.) coexist with api::* and share the same noodles backend; v0.2.2 will refactor them to delegate to api::* directly.

Correctness

rubam is bit-exact against samtools depth -a over 5 × 10⁶ positions across five datasets, including whole-chromosome chr1:

Dataset Positions rubam vs samtools
Synthetic chr20 30× WGS 1 000 000 0 mismatches
Synthetic chr20 spliced (5 % CIGAR N) 1 000 000 0 mismatches
HG002 GIAB 2×250bp chr20 1 000 000 0 mismatches
HG002 PacBio HiFi chr20 1 000 000 0 mismatches
HG002 GIAB 2×250bp whole chr1 (249 Mb) 1 000 000 0 mismatches
Total 5 000 000 0 / 5 M ✅

VCF-side correctness vs pysam.VariantFile: 319 349 / 319 349 = 100.00 % on the GIAB HG002 truth chr1 (319 k records, 13 MB BGZF).

Cross-tool correctness vs system bcftools: 100 % on view, query, sort.

Install

pip install rubam

Pre-built wheels are published for Linux, macOS and Windows; a single abi3 wheel per OS covers CPython 3.8 → 3.13. No htslib, no compiler, no WSL required — pip install rubam just works on Windows.

The NumPy return path (get_depths_numpy) needs NumPy at runtime:

pip install rubam[numpy]

Quick start

import rubam

positions, depths = rubam.get_depths(
    "sample.bam", "chr1", 1_000_000, 1_001_000,
    step=1, min_mapq=20, min_bq=20,
    max_depth=8000, num_threads=12,
)

CLI:

rubam depth sample.bam chr1 1000000 1001000 -n 12 -Q 20 -q 20 > depth.tsv

Features

Stable (v0.1)

  • get_depths(bam, chr, start, end, ...) — per-base coverage over a 1-based, inclusive region.
  • CLI rubam depth ….

Shipped since v0.1.x

  • count_reads(bam, chr, start, end, ...)pysam.AlignmentFile.count replacement.
  • flag_stats(bam)samtools flagstat replacement, returning a Python dict.
  • pileup_bases(bam, chr, start, end, ...) — A/C/G/T counts per position.
  • get_depths_regions(bam, regions) — batch BED-style regions with shared thread pool.
  • get_depths_numpy(...) — zero-copy np.uint64 / np.uint32 return path (~4.5× lower peak RSS than the list path; needs pip install rubam[numpy]).

Roadmap

  • ⚠ CRAM full record decode (v0.4): rubam.AlignmentFile("sample.cram", reference_filename="ref.fa") already opens and reads the header; record decode is panic-guarded and raises a Python error on codecs noodles-cram does not yet support (e.g. Huffman byte-series on NYGC-style CRAMs). Tracking the upstream codec landing.
  • to_pandas() zero-copy helper; Parquet output.
  • rubam.compat.pysam drop-in shim (v0.5).

What's new in 0.2

  • rubam.AlignmentFile and rubam.AlignedSegment — drop-in pysam-style read iteration and per-read property access (flags, cigar, sequence, qualities, tags, reference helpers).
  • AlignmentFile.fetch(chr, start, end) — indexed region iterator.
  • AlignmentFile.pileup(chr, start, end) — buffered per-position iterator yielding PileupColumn objects with (reference_pos, depth, A/C/G/T/N).
  • rubam.tools.{sort, index, view, merge, flagstat, idxstats, calmd, faidx} — pure-Rust ports of the eight most-used samtools subcommands.
  • rubam-samtools shadow CLI binary — alias samtools='rubam samtools' and your shell pipelines keep working, on Windows included.

What's new in 0.2.1

  • rubam::api::{AlignmentFile, AlignedSegment, Header, Cigar, Aux, Error} — pure-Rust public crate API. External Rust crates drop in rubam = "0.3.12" and import these types directly without pulling in pyo3 — a drop-in for rust-htslib::bam::Reader for codebases that iterate linearly. The public surface is pinned by tests/api_smoke.rs and tests/integration_test.rs.

What's new in 0.3

  • rubam.VariantFile and rubam.VariantRecord — pysam-style VCF / BCF / Tabix support. Read, write (modes "w" / "wz" / "wb" for plain / BGZF / BCF), iterate, indexed fetch(contig, start, end), multi-sample genotype access via record.samples["NA12878"]["GT"].
  • rubam.VariantHeader — read-only metadata: samples, contigs (with lengths), INFO / FORMAT meta lines (id / number / type / description), FILTER ids, file format version.
  • rubam.VariantRecord(header=, …) constructor — build records from scratch. Plus set_position, set_quality, set_filter, add_filter, clear_filters, set_info mutation APIs.
  • rubam.tools.bcftools.{view, norm, concat, query, index, sort, stats} — pure-Rust ports of seven most-used bcftools subcommands.
  • rubam-bcftools shadow CLI — alias bcftools='rubam bcftools' works on Windows. Same shape as rubam-samtools.
  • Cross-tool correctness: (chrom, pos, ref, alt, ids, qual, filters) extracted via both rubam.VariantFile and pysam.VariantFile agree on 0 / 100 records mismatch on a 3-sample synthetic VCF.

What's new in 0.3.12

pysam parity on real-world hg38 BAMs, verified base-for-base against pysam 0.24.0 (tests/test_pysam_parity_findings.py):

  • Tolerant header parsing — opens real hg38 / GATK / Picard BAMs that a strict SAM-header parser rejects (@HD with no VN, multi-part versions like VN:1.6.0, duplicate @PG/@RG/@SQ IDs from re-run pipelines). Valid headers still take the strict fast path unchanged.
  • count matches pysam defaultsread_callback='nofilter' by default (counts every read in the region, including secondary / supplementary / duplicate / QC-fail); read_callback='all' applies the 0x704 mask.
  • count_coverage matches pysam defaultsquality_threshold=15 (base counted iff qual >= threshold), no depth cap, and a read_callback argument.

The compatibility layer rubam.compat.pysam (drop-in from rubam.compat import pysam) lands in v0.5; v0.2 + v0.3 are the foundation it sits on top of.

Validation & benchmarks

rubam is validated against pysam, samtools depth, samtools mpileup, mosdepth, bedtools genomecov and the original rustbam on real WGS, RNA-seq, exome and PacBio HiFi datasets (HG002, NA12878, public ENA RNA-seq), with multi-threaded scaling and cross-platform parity. The numbers in the tables above are drawn from that campaign.

License

MIT — see LICENSE.

Citation

If you use rubam in academic work, please cite the bioRxiv preprint (link will be added once posted).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubam-0.3.13.tar.gz (247.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rubam-0.3.13-cp38-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.8+Windows x86-64

rubam-0.3.13-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

rubam-0.3.13-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (3.0 MB view details)

Uploaded CPython 3.8+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file rubam-0.3.13.tar.gz.

File metadata

  • Download URL: rubam-0.3.13.tar.gz
  • Upload date:
  • Size: 247.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for rubam-0.3.13.tar.gz
Algorithm Hash digest
SHA256 c88acfa6d6d2e4ae3d6967c0eeefab76cf59811fbb0e8b012c0c3c697f63b42f
MD5 05519f74beba9d47d1e6883291a4a418
BLAKE2b-256 25eb1c9594ff0aad6fbccb8f41f42dd4dafbf609f5053ab3e1f385c7d87d13a8

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubam-0.3.13.tar.gz:

Publisher: release.yml on victormar1/rubam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rubam-0.3.13-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: rubam-0.3.13-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for rubam-0.3.13-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5517c1f7caa25b53a21b8815bc84ae71e8fb3cda4d9710d7969cc6ae1256c84b
MD5 64e77b99f33e03f2db579e5cce46feae
BLAKE2b-256 5e4b8eafeb8e8f43caac8ede125be9639a50ed353425ad37ed34ca097f4af28a

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubam-0.3.13-cp38-abi3-win_amd64.whl:

Publisher: release.yml on victormar1/rubam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rubam-0.3.13-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rubam-0.3.13-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d90a60d1e0b4f8f6fff54f28901b6d29079485ef0b47fbabd6e98804e6073a2e
MD5 1119f5e0b7ed1acb2d4e447f397a4dd8
BLAKE2b-256 e5dc8458506c5d09d48cf8187dfe5d289348a5fd96ec9de230e41364a6c859c9

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubam-0.3.13-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on victormar1/rubam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rubam-0.3.13-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for rubam-0.3.13-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 8aab86e331e9a077f21fdd79000ed35645aab9c69ad904e31298fb1a3f14843c
MD5 9f517a44ed5dd6aebde6e6a03706f69c
BLAKE2b-256 06944c30444847bd5f11a6b1f41d54fd5925762ebe00a180f8e7a841f0fbe1d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubam-0.3.13-cp38-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl:

Publisher: release.yml on victormar1/rubam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page