Skip to main content

ARM-native bioinformatics library with streaming architecture and evidence-based optimization

Project description

biometal logo

biometal

ARM-native bioinformatics library with streaming architecture and evidence-based optimization

Crates.io Documentation PyPI Python Ask DeepWiki License


What Makes biometal Different?

Stream data directly from networks and analyze terabyte-scale datasets on consumer hardware without downloading.

  • Constant ~5 MB memory regardless of dataset size (99.5% reduction)
  • 16-25× speedup using ARM NEON SIMD on Apple Silicon
  • Network streaming from HTTP/HTTPS sources (no download needed)
  • Evidence-based design (1,357 experiments, 40,710 measurements)

Quick Start

Installation

Rust:

[dependencies]
biometal = "1.2"

Python:

pip install biometal-rs  # Install
python -c "import biometal; print(biometal.__version__)"  # Test

Note: Package is biometal-rs on PyPI, but imports as biometal in Python.

Basic Usage

Rust:

use biometal::FastqStream;

// Stream FASTQ with constant memory (~5 MB)
let stream = FastqStream::from_path("dataset.fq.gz")?;

for record in stream {
    let record = record?;
    // Process one record at a time
}

Python:

import biometal

# Stream FASTQ with constant memory (~5 MB)
stream = biometal.FastqStream.from_path("dataset.fq.gz")

for record in stream:
    # ARM NEON accelerated (16-25× speedup)
    gc = biometal.gc_content(record.sequence)
    counts = biometal.count_bases(record.sequence)
    mean_q = biometal.mean_quality(record.quality)

📚 Documentation


📓 Interactive Tutorials

Learn biometal through hands-on Jupyter notebooks (5 complete, ~2.5 hours):

Notebook Duration Topics
01. Getting Started 15-20 min Streaming, GC content, quality analysis
02. Quality Control 30-40 min Trimming, filtering, masking (v1.2.0)
03. K-mer Analysis 30-40 min ML preprocessing, DNABert (v1.1.0)
04. Network Streaming 30-40 min HTTP streaming, public data (v1.0.0)
05. BAM Alignment Analysis 30-40 min BAM parsing, 4× speedup, filtering (v1.2.0+)

👉 Browse all tutorials →


🚀 Key Features

Streaming Architecture

  • Constant ~5 MB memory regardless of dataset size
  • Analyze 5TB datasets on laptops without downloading
  • 99.5% memory reduction vs. traditional approaches

ARM-Native Performance

  • 16-25× speedup using ARM NEON SIMD
  • Optimized for Apple Silicon (M1/M2/M3/M4)
  • Automatic scalar fallback on x86_64

Network Streaming

  • Stream directly from HTTP/HTTPS (no download)
  • Smart LRU caching + background prefetching
  • Access public data (ENA, S3, GCS, Azure)

Operations Library

  • Core operations: GC content, base counting, quality scores
  • K-mer operations: Extraction, minimizers, spectrum (v1.1.0)
  • QC operations: Trimming, filtering, masking (v1.2.0)
  • BAM/SAM parser: Production-ready with 4× speedup via parallel BGZF (Nov 8, 2025)
    • 4.54 million records/sec throughput
    • 43.0 MiB/s compressed file processing
    • Constant ~5 MB memory (streams terabyte-scale alignments)
    • Python bindings (v1.3.0): CIGAR operations, SAM writing, alignment metrics
  • 40+ Python functions for bioinformatics workflows

Performance Highlights

Operation Scalar Optimized Speedup
Base counting 315 Kseq/s 5,254 Kseq/s 16.7× (NEON)
GC content 294 Kseq/s 5,954 Kseq/s 20.3× (NEON)
Quality filter 245 Kseq/s 6,143 Kseq/s 25.1× (NEON)
BAM parsing ~11 MiB/s 43.0 MiB/s 4.0× (Parallel BGZF)
Dataset Size Traditional biometal Reduction
100K sequences 134 MB 5 MB 96.3%
1M sequences 1,344 MB 5 MB 99.5%
5TB dataset 5,000 GB 5 MB 99.9999%

Platform Support

Platform Performance Tests Status
Mac ARM (M1-M4) 16-25× speedup ✅ 424/424 Optimized
AWS Graviton 6-10× speedup ✅ 424/424 Portable
Linux x86_64 1× (scalar) ✅ 424/424 Portable

Test count includes 354 core library + 70 BAM/SAM parser tests


Evidence-Based Design

biometal's design is grounded in comprehensive experimental validation:


Roadmap

v1.0.0 (Released Nov 5, 2025) ✅ - Core library + network streaming v1.1.0 (Released Nov 6, 2025) ✅ - K-mer operations v1.2.0 (Released Nov 6, 2025) ✅ - Python bindings for Phase 4 QC BAM/SAM (Integrated Nov 8, 2025) ✅ - Native streaming alignment parser with parallel BGZF (4× speedup)

v1.3.0 (In Development) - Python BAM bindings with CIGAR operations and SAM writing

Next (Planned):

  • Complete tag parsing (extended types from Phase 1)
  • BAI/CSI index support (random access)
  • Additional alignment statistics

Future (Community Driven):

  • Extended operations (alignment, assembly)
  • Additional formats (VCF, BCF, CRAM)
  • Metal GPU acceleration (Mac-specific)

See CHANGELOG.md for detailed release notes.


Mission: Democratizing Bioinformatics

biometal addresses barriers that lock researchers out of genomics:

  1. Economic: Consumer ARM laptops ($1,400) deliver production performance
  2. Environmental: ARM efficiency reduces carbon footprint
  3. Portability: Works across ARM ecosystem (Mac, Graviton, Ampere, RPi)
  4. Data Access: Analyze 5TB datasets on 24GB laptops without downloading

Example Use Cases

Quality Control Pipeline

import biometal

stream = biometal.FastqStream.from_path("raw_reads.fq.gz")

for record in stream:
    # Trim low-quality ends
    trimmed = biometal.trim_quality_window(record, min_quality=20, window_size=4)

    # Length filter
    if biometal.meets_length_requirement(trimmed, min_len=50, max_len=150):
        # Mask remaining low-quality bases
        masked = biometal.mask_low_quality(trimmed, min_quality=20)

        # Check masking rate
        mask_rate = biometal.count_masked_bases(masked) / len(masked.sequence)
        if mask_rate < 0.1:
            # Pass QC - process further
            pass

K-mer Extraction for ML

import biometal

# Extract k-mers for DNABert preprocessing
stream = biometal.FastqStream.from_path("dataset.fq.gz")

for record in stream:
    # Extract overlapping k-mers (k=6 typical for DNABert)
    kmers = biometal.extract_kmers(record.sequence, k=6)

    # Format for transformer models
    kmer_string = " ".join(kmer.decode() for kmer in kmers)

    # Feed to DNABert - constant memory!
    model.process(kmer_string)

Network Streaming

import biometal

# Stream from HTTP without downloading
# Works with ENA, S3, GCS, Azure public data
url = "https://example.com/dataset.fq.gz"
stream = biometal.FastqStream.from_path(url)

for record in stream:
    # Analyze directly - no download needed!
    # Memory: constant ~5 MB
    gc = biometal.gc_content(record.sequence)

BAM Alignment Analysis (v1.3.0)

import biometal

# Stream BAM file with constant memory
reader = biometal.BamReader.from_path("alignments.bam")

for record in reader:
    # Access alignment details
    print(f"{record.name}: MAPQ={record.mapq}, pos={record.position}")

    # Analyze CIGAR operations
    for op in record.cigar:
        if op.is_insertion() and op.length >= 5:
            print(f"  Found {op.length}bp insertion")

    # Calculate alignment metrics
    ref_len = record.reference_length()
    query_len = record.query_length()
    print(f"  Reference: {ref_len}bp, Query: {query_len}bp")

# Convert BAM to SAM with filtering
writer = biometal.SamWriter.create("output.sam")
writer.write_header(reader.header)

for record in reader:
    if record.is_primary and record.mapq >= 30:
        writer.write_record(record)

writer.close()

FAQ

Q: Why biometal-rs on PyPI but biometal everywhere else? A: The biometal name was taken on PyPI, so we use biometal-rs for installation. You still import as import biometal.

Q: What platforms are supported? A: Mac ARM (optimized), Linux ARM/x86_64 (portable). Pre-built wheels for common platforms. See docs/CROSS_PLATFORM_TESTING.md.

Q: Why ARM-native? A: To democratize bioinformatics by enabling world-class performance on consumer hardware ($1,400 MacBooks vs. $50,000 servers).

More questions? See FAQ.md


Contributing

We welcome contributions! See CLAUDE.md for development guidelines.

biometal is built on evidence-based optimization - new features should:

  1. Have clear use cases
  2. Be validated experimentally (when adding optimizations)
  3. Maintain platform portability
  4. Follow OPTIMIZATION_RULES.md

License

Licensed under either of:

at your option.


Citation

If you use biometal in your research:

@software{biometal2025,
  author = {Handley, Scott},
  title = {biometal: ARM-native bioinformatics with streaming architecture},
  year = {2025},
  url = {https://github.com/shandley/biometal}
}

For the experimental methodology:

@misc{asbb2025,
  author = {Handley, Scott},
  title = {Apple Silicon Bio Bench: Systematic Hardware Characterization},
  year = {2025},
  url = {https://github.com/shandley/apple-silicon-bio-bench}
}

Status: v1.3.0 in development 🚧
Latest: Python BAM bindings with CIGAR operations and SAM writing
Tests: 424 passing (354 library + 70 BAM parser)
Performance: 4.54M records/sec, 43.0 MiB/s throughput
Python Functions: 50+ (including full BAM support)
Evidence Base: 1,357 experiments, 40,710 measurements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biometal_rs-1.3.0.tar.gz (1.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

biometal_rs-1.3.0-cp311-cp311-manylinux_2_34_x86_64.whl (3.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

biometal_rs-1.3.0-cp311-cp311-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

biometal_rs-1.3.0-cp311-cp311-macosx_10_12_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file biometal_rs-1.3.0.tar.gz.

File metadata

  • Download URL: biometal_rs-1.3.0.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for biometal_rs-1.3.0.tar.gz
Algorithm Hash digest
SHA256 d838ccb620d91ef1ad0e35e26caa5ac21202945e41bdbce1e0ff895c9693a086
MD5 e920c75c76109d56c9d49b9ecea524c1
BLAKE2b-256 49ef5cb98212abba3e9cc07c84e9fa6ed082633689dbae0bc51d68592b2c48ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for biometal_rs-1.3.0.tar.gz:

Publisher: publish-pypi.yml on shandley/biometal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biometal_rs-1.3.0-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for biometal_rs-1.3.0-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 a00edccac8b62ed2c10a777e250621c31842ed10be2e371a106b922324a2f852
MD5 36e18c35340abf6a3ea6d33ba84122ef
BLAKE2b-256 0cb1a3e3a8aa286834fc2db18bba5d7572714aee8ef79ae5c33ec398935bf92c

See more details on using hashes here.

Provenance

The following attestation bundles were made for biometal_rs-1.3.0-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: publish-pypi.yml on shandley/biometal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biometal_rs-1.3.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for biometal_rs-1.3.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4895c8d227c1485931c82b66ea4a7ef3741281accca632a4d05deef81db1837a
MD5 6274564fc6656b3a029f297abe8567aa
BLAKE2b-256 c53c8ac77af62c9874ea8a45247c37fa21871ff51a51d0c6077c01cb74ec90a7

See more details on using hashes here.

Provenance

The following attestation bundles were made for biometal_rs-1.3.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish-pypi.yml on shandley/biometal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biometal_rs-1.3.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for biometal_rs-1.3.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d72a61491c56a91a4253f02d2a2b92c2420deb7d8c150b04ab2d2f273d8914d4
MD5 fe8b26dc9cf0232b6797b2940c731adb
BLAKE2b-256 77245dba2d41ef028d03196b6c12342f52b0250427fc240e43b9318c6ee29755

See more details on using hashes here.

Provenance

The following attestation bundles were made for biometal_rs-1.3.0-cp311-cp311-macosx_10_12_x86_64.whl:

Publisher: publish-pypi.yml on shandley/biometal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page