Skip to main content

ARM-native bioinformatics library with streaming architecture and evidence-based optimization

Project description

biometal logo

biometal

ARM-native bioinformatics library with streaming architecture and evidence-based optimization

Crates.io Documentation PyPI Python Ask DeepWiki License


What Makes biometal Different?

Stream data directly from networks and analyze terabyte-scale datasets on consumer hardware without downloading.

  • Constant ~5 MB memory regardless of dataset size (99.5% reduction)
  • 16-25× speedup using ARM NEON SIMD on Apple Silicon
  • Network streaming from HTTP/HTTPS sources (no download needed)
  • Evidence-based design (1,357 experiments, 40,710 measurements)

Quick Start

Installation

Rust:

[dependencies]
biometal = "1.2"

Python:

pip install biometal-rs  # Install
python -c "import biometal; print(biometal.__version__)"  # Test

Note: Package is biometal-rs on PyPI, but imports as biometal in Python.

Basic Usage

Rust:

use biometal::FastqStream;

// Stream FASTQ with constant memory (~5 MB)
let stream = FastqStream::from_path("dataset.fq.gz")?;

for record in stream {
    let record = record?;
    // Process one record at a time
}

Python:

import biometal

# Stream FASTQ with constant memory (~5 MB)
stream = biometal.FastqStream.from_path("dataset.fq.gz")

for record in stream:
    # ARM NEON accelerated (16-25× speedup)
    gc = biometal.gc_content(record.sequence)
    counts = biometal.count_bases(record.sequence)
    mean_q = biometal.mean_quality(record.quality)

📚 Documentation


📓 Interactive Tutorials

Learn biometal through hands-on Jupyter notebooks (5 complete, ~2.5 hours):

Notebook Duration Topics
01. Getting Started 15-20 min Streaming, GC content, quality analysis
02. Quality Control 30-40 min Trimming, filtering, masking (v1.2.0)
03. K-mer Analysis 30-40 min ML preprocessing, DNABert (v1.1.0)
04. Network Streaming 30-40 min HTTP streaming, public data (v1.0.0)
05. BAM Alignment Analysis 30-40 min BAM parsing, 4× speedup, filtering (v1.2.0+)

👉 Browse all tutorials →


🚀 Key Features

Streaming Architecture

  • Constant ~5 MB memory regardless of dataset size
  • Analyze 5TB datasets on laptops without downloading
  • 99.5% memory reduction vs. traditional approaches

ARM-Native Performance

  • 16-25× speedup using ARM NEON SIMD
  • Optimized for Apple Silicon (M1/M2/M3/M4)
  • Automatic scalar fallback on x86_64

Network Streaming

  • Stream directly from HTTP/HTTPS (no download)
  • Smart LRU caching + background prefetching
  • Access public data (ENA, S3, GCS, Azure)

Operations Library

  • Core operations: GC content, base counting, quality scores
  • K-mer operations: Extraction, minimizers, spectrum (v1.1.0)
  • QC operations: Trimming, filtering, masking (v1.2.0)
  • BAM/SAM parser: Production-ready with 4× speedup via parallel BGZF (Nov 8, 2025)
    • 4.54 million records/sec throughput
    • 43.0 MiB/s compressed file processing
    • Constant ~5 MB memory (streams terabyte-scale alignments)
    • Python bindings (v1.3.0): CIGAR operations, SAM writing, alignment metrics
  • 40+ Python functions for bioinformatics workflows

Performance Highlights

Operation Scalar Optimized Speedup
Base counting 315 Kseq/s 5,254 Kseq/s 16.7× (NEON)
GC content 294 Kseq/s 5,954 Kseq/s 20.3× (NEON)
Quality filter 245 Kseq/s 6,143 Kseq/s 25.1× (NEON)
BAM parsing ~11 MiB/s 43.0 MiB/s 4.0× (Parallel BGZF)
Dataset Size Traditional biometal Reduction
100K sequences 134 MB 5 MB 96.3%
1M sequences 1,344 MB 5 MB 99.5%
5TB dataset 5,000 GB 5 MB 99.9999%

Platform Support

Platform Performance Tests Status
Mac ARM (M1-M4) 16-25× speedup ✅ 424/424 Optimized
AWS Graviton 6-10× speedup ✅ 424/424 Portable
Linux x86_64 1× (scalar) ✅ 424/424 Portable

Test count includes 354 core library + 70 BAM/SAM parser tests


Evidence-Based Design

biometal's design is grounded in comprehensive experimental validation:


Roadmap

v1.0.0 (Released Nov 5, 2025) ✅ - Core library + network streaming v1.1.0 (Released Nov 6, 2025) ✅ - K-mer operations v1.2.0 (Released Nov 6, 2025) ✅ - Python bindings for Phase 4 QC BAM/SAM (Integrated Nov 8, 2025) ✅ - Native streaming alignment parser with parallel BGZF (4× speedup)

v1.3.0 (In Development) - Python BAM bindings with CIGAR operations and SAM writing

Next (Planned):

  • Complete tag parsing (extended types from Phase 1)
  • BAI/CSI index support (random access)
  • Additional alignment statistics

Future (Community Driven):

  • Extended operations (alignment, assembly)
  • Additional formats (VCF, BCF, CRAM)
  • Metal GPU acceleration (Mac-specific)

See CHANGELOG.md for detailed release notes.


Mission: Democratizing Bioinformatics

biometal addresses barriers that lock researchers out of genomics:

  1. Economic: Consumer ARM laptops ($1,400) deliver production performance
  2. Environmental: ARM efficiency reduces carbon footprint
  3. Portability: Works across ARM ecosystem (Mac, Graviton, Ampere, RPi)
  4. Data Access: Analyze 5TB datasets on 24GB laptops without downloading

Example Use Cases

Quality Control Pipeline

import biometal

stream = biometal.FastqStream.from_path("raw_reads.fq.gz")

for record in stream:
    # Trim low-quality ends
    trimmed = biometal.trim_quality_window(record, min_quality=20, window_size=4)

    # Length filter
    if biometal.meets_length_requirement(trimmed, min_len=50, max_len=150):
        # Mask remaining low-quality bases
        masked = biometal.mask_low_quality(trimmed, min_quality=20)

        # Check masking rate
        mask_rate = biometal.count_masked_bases(masked) / len(masked.sequence)
        if mask_rate < 0.1:
            # Pass QC - process further
            pass

K-mer Extraction for ML

import biometal

# Extract k-mers for DNABert preprocessing
stream = biometal.FastqStream.from_path("dataset.fq.gz")

for record in stream:
    # Extract overlapping k-mers (k=6 typical for DNABert)
    kmers = biometal.extract_kmers(record.sequence, k=6)

    # Format for transformer models
    kmer_string = " ".join(kmer.decode() for kmer in kmers)

    # Feed to DNABert - constant memory!
    model.process(kmer_string)

Network Streaming

import biometal

# Stream from HTTP without downloading
# Works with ENA, S3, GCS, Azure public data
url = "https://example.com/dataset.fq.gz"
stream = biometal.FastqStream.from_path(url)

for record in stream:
    # Analyze directly - no download needed!
    # Memory: constant ~5 MB
    gc = biometal.gc_content(record.sequence)

BAM Alignment Analysis (v1.3.0)

import biometal

# Stream BAM file with constant memory
reader = biometal.BamReader.from_path("alignments.bam")

for record in reader:
    # Access alignment details
    print(f"{record.name}: MAPQ={record.mapq}, pos={record.position}")

    # Analyze CIGAR operations
    for op in record.cigar:
        if op.is_insertion() and op.length >= 5:
            print(f"  Found {op.length}bp insertion")

    # Calculate alignment metrics
    ref_len = record.reference_length()
    query_len = record.query_length()
    print(f"  Reference: {ref_len}bp, Query: {query_len}bp")

# Convert BAM to SAM with filtering
writer = biometal.SamWriter.create("output.sam")
writer.write_header(reader.header)

for record in reader:
    if record.is_primary and record.mapq >= 30:
        writer.write_record(record)

writer.close()

FAQ

Q: Why biometal-rs on PyPI but biometal everywhere else? A: The biometal name was taken on PyPI, so we use biometal-rs for installation. You still import as import biometal.

Q: What platforms are supported? A: Mac ARM (optimized), Linux ARM/x86_64 (portable). Pre-built wheels for common platforms. See docs/CROSS_PLATFORM_TESTING.md.

Q: Why ARM-native? A: To democratize bioinformatics by enabling world-class performance on consumer hardware ($1,400 MacBooks vs. $50,000 servers).

More questions? See FAQ.md


Contributing

We welcome contributions! See CLAUDE.md for development guidelines.

biometal is built on evidence-based optimization - new features should:

  1. Have clear use cases
  2. Be validated experimentally (when adding optimizations)
  3. Maintain platform portability
  4. Follow OPTIMIZATION_RULES.md

License

Licensed under either of:

at your option.


Citation

If you use biometal in your research:

@software{biometal2025,
  author = {Handley, Scott},
  title = {biometal: ARM-native bioinformatics with streaming architecture},
  year = {2025},
  url = {https://github.com/shandley/biometal}
}

For the experimental methodology:

@misc{asbb2025,
  author = {Handley, Scott},
  title = {Apple Silicon Bio Bench: Systematic Hardware Characterization},
  year = {2025},
  url = {https://github.com/shandley/apple-silicon-bio-bench}
}

Status: v1.3.0 in development 🚧
Latest: Python BAM bindings with CIGAR operations and SAM writing
Tests: 424 passing (354 library + 70 BAM parser)
Performance: 4.54M records/sec, 43.0 MiB/s throughput
Python Functions: 50+ (including full BAM support)
Evidence Base: 1,357 experiments, 40,710 measurements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biometal_rs-1.2.0.tar.gz (1.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

biometal_rs-1.2.0-cp311-cp311-manylinux_2_34_x86_64.whl (3.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

biometal_rs-1.2.0-cp311-cp311-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

biometal_rs-1.2.0-cp311-cp311-macosx_10_12_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file biometal_rs-1.2.0.tar.gz.

File metadata

  • Download URL: biometal_rs-1.2.0.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for biometal_rs-1.2.0.tar.gz
Algorithm Hash digest
SHA256 fbd42d2d27aa6e243cbe9f644463fd1809afbbc31aee08085d9395e13fd6a329
MD5 8e3299001fd5aa1069d43dd8fca997e8
BLAKE2b-256 66c125b8f5d3db5c21341d4aa9381aafea39652819f1104699765cc3496825ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for biometal_rs-1.2.0.tar.gz:

Publisher: publish-pypi.yml on shandley/biometal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biometal_rs-1.2.0-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for biometal_rs-1.2.0-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 d58aec487678a9de6ad7093e0def23146ecba10266b7272bcc6311b4a7049d67
MD5 65b971b14efedfa3fb7480111c016019
BLAKE2b-256 a35f80eb76cdb0b5dd04bb838764dfc06102ecc2fa30bc62e0a508b899357509

See more details on using hashes here.

Provenance

The following attestation bundles were made for biometal_rs-1.2.0-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: publish-pypi.yml on shandley/biometal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biometal_rs-1.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for biometal_rs-1.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 db5a9a7ff6ecf65249ee85dab47afdd35cde9c31a7b79556b19e1b82d8205e93
MD5 91bcfd095b598deb058261d00b9c750d
BLAKE2b-256 ddb14421b99022c599413e67228a74b99b23ab413fce8a7bce72aff019b4088a

See more details on using hashes here.

Provenance

The following attestation bundles were made for biometal_rs-1.2.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish-pypi.yml on shandley/biometal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biometal_rs-1.2.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for biometal_rs-1.2.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 81e5f0b15391dd51d770daea6b97c9c89f2016f4a7f6960360bb1490067d1d1a
MD5 3ba308d90423a77e6649b85fb3bdcc4d
BLAKE2b-256 cc7dde232fc7620fcfafa89ae3519e68a93db6508e3571b8017964f97be229e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for biometal_rs-1.2.0-cp311-cp311-macosx_10_12_x86_64.whl:

Publisher: publish-pypi.yml on shandley/biometal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page