ARM-native bioinformatics library with streaming architecture and evidence-based optimization
Project description
biometal
ARM-native bioinformatics library with streaming architecture and evidence-based optimization
What Makes biometal Different?
Stream data directly from networks and analyze terabyte-scale datasets on consumer hardware without downloading.
- Constant ~5 MB memory regardless of dataset size (99.5% reduction)
- 16-25× speedup using ARM NEON SIMD on Apple Silicon
- Network streaming from HTTP/HTTPS sources (no download needed)
- Evidence-based design (1,357 experiments, 40,710 measurements)
Quick Start
Installation
Rust:
[dependencies]
biometal = "1.2"
Python:
pip install biometal-rs # Install
python -c "import biometal; print(biometal.__version__)" # Test
Note: Package is
biometal-rson PyPI, but imports asbiometalin Python.
Basic Usage
Rust:
use biometal::FastqStream;
// Stream FASTQ with constant memory (~5 MB)
let stream = FastqStream::from_path("dataset.fq.gz")?;
for record in stream {
let record = record?;
// Process one record at a time
}
Python:
import biometal
# Stream FASTQ with constant memory (~5 MB)
stream = biometal.FastqStream.from_path("dataset.fq.gz")
for record in stream:
# ARM NEON accelerated (16-25× speedup)
gc = biometal.gc_content(record.sequence)
counts = biometal.count_bases(record.sequence)
mean_q = biometal.mean_quality(record.quality)
📚 Documentation
- 📖 Comprehensive Docs - DeepWiki AI-assisted documentation
- 📓 Interactive Tutorials - Jupyter notebooks with real workflows
- 🦀 API Reference - Full Rust documentation
- 🐍 Python Guide - Python-specific documentation
- 📐 Architecture - Technical design details
- 📊 Benchmarks - Performance analysis
- ❓ FAQ - Frequently asked questions
📓 Interactive Tutorials
Learn biometal through hands-on Jupyter notebooks (5 complete, ~2.5 hours):
| Notebook | Duration | Topics |
|---|---|---|
| 01. Getting Started | 15-20 min | Streaming, GC content, quality analysis |
| 02. Quality Control | 30-40 min | Trimming, filtering, masking (v1.2.0) |
| 03. K-mer Analysis | 30-40 min | ML preprocessing, DNABert (v1.1.0) |
| 04. Network Streaming | 30-40 min | HTTP streaming, public data (v1.0.0) |
| 05. BAM Alignment Analysis | 30-40 min | BAM parsing, 4× speedup, filtering (v1.2.0+) |
🚀 Key Features
Streaming Architecture
- Constant ~5 MB memory regardless of dataset size
- Analyze 5TB datasets on laptops without downloading
- 99.5% memory reduction vs. traditional approaches
ARM-Native Performance
- 16-25× speedup using ARM NEON SIMD
- Optimized for Apple Silicon (M1/M2/M3/M4)
- Automatic scalar fallback on x86_64
Network Streaming
- Stream directly from HTTP/HTTPS (no download)
- Smart LRU caching + background prefetching
- Access public data (ENA, S3, GCS, Azure)
Operations Library
- Core operations: GC content, base counting, quality scores
- K-mer operations: Extraction, minimizers, spectrum (v1.1.0)
- QC operations: Trimming, filtering, masking (v1.2.0)
- BAM/SAM parser: Production-ready with 4× speedup via parallel BGZF (Nov 8, 2025)
- 4.54 million records/sec throughput
- 43.0 MiB/s compressed file processing
- Constant ~5 MB memory (streams terabyte-scale alignments)
- Python bindings (v1.3.0): CIGAR operations, SAM writing, alignment metrics
- 40+ Python functions for bioinformatics workflows
Performance Highlights
| Operation | Scalar | Optimized | Speedup |
|---|---|---|---|
| Base counting | 315 Kseq/s | 5,254 Kseq/s | 16.7× (NEON) |
| GC content | 294 Kseq/s | 5,954 Kseq/s | 20.3× (NEON) |
| Quality filter | 245 Kseq/s | 6,143 Kseq/s | 25.1× (NEON) |
| BAM parsing | ~11 MiB/s | 43.0 MiB/s | 4.0× (Parallel BGZF) |
| Dataset Size | Traditional | biometal | Reduction |
|---|---|---|---|
| 100K sequences | 134 MB | 5 MB | 96.3% |
| 1M sequences | 1,344 MB | 5 MB | 99.5% |
| 5TB dataset | 5,000 GB | 5 MB | 99.9999% |
Platform Support
| Platform | Performance | Tests | Status |
|---|---|---|---|
| Mac ARM (M1-M4) | 16-25× speedup | ✅ 424/424 | Optimized |
| AWS Graviton | 6-10× speedup | ✅ 424/424 | Portable |
| Linux x86_64 | 1× (scalar) | ✅ 424/424 | Portable |
Test count includes 354 core library + 70 BAM/SAM parser tests
Evidence-Based Design
biometal's design is grounded in comprehensive experimental validation:
- 1,357 experiments (40,710 measurements, N=30)
- Statistical rigor (95% CI, Cohen's d effect sizes)
- Full methodology: apple-silicon-bio-bench
- 6 optimization rules documented in OPTIMIZATION_RULES.md
Roadmap
v1.0.0 (Released Nov 5, 2025) ✅ - Core library + network streaming v1.1.0 (Released Nov 6, 2025) ✅ - K-mer operations v1.2.0 (Released Nov 6, 2025) ✅ - Python bindings for Phase 4 QC BAM/SAM (Integrated Nov 8, 2025) ✅ - Native streaming alignment parser with parallel BGZF (4× speedup)
v1.3.0 (In Development) - Python BAM bindings with CIGAR operations and SAM writing
Next (Planned):
- Complete tag parsing (extended types from Phase 1)
- BAI/CSI index support (random access)
- Additional alignment statistics
Future (Community Driven):
- Extended operations (alignment, assembly)
- Additional formats (VCF, BCF, CRAM)
- Metal GPU acceleration (Mac-specific)
See CHANGELOG.md for detailed release notes.
Mission: Democratizing Bioinformatics
biometal addresses barriers that lock researchers out of genomics:
- Economic: Consumer ARM laptops ($1,400) deliver production performance
- Environmental: ARM efficiency reduces carbon footprint
- Portability: Works across ARM ecosystem (Mac, Graviton, Ampere, RPi)
- Data Access: Analyze 5TB datasets on 24GB laptops without downloading
Example Use Cases
Quality Control Pipeline
import biometal
stream = biometal.FastqStream.from_path("raw_reads.fq.gz")
for record in stream:
# Trim low-quality ends
trimmed = biometal.trim_quality_window(record, min_quality=20, window_size=4)
# Length filter
if biometal.meets_length_requirement(trimmed, min_len=50, max_len=150):
# Mask remaining low-quality bases
masked = biometal.mask_low_quality(trimmed, min_quality=20)
# Check masking rate
mask_rate = biometal.count_masked_bases(masked) / len(masked.sequence)
if mask_rate < 0.1:
# Pass QC - process further
pass
K-mer Extraction for ML
import biometal
# Extract k-mers for DNABert preprocessing
stream = biometal.FastqStream.from_path("dataset.fq.gz")
for record in stream:
# Extract overlapping k-mers (k=6 typical for DNABert)
kmers = biometal.extract_kmers(record.sequence, k=6)
# Format for transformer models
kmer_string = " ".join(kmer.decode() for kmer in kmers)
# Feed to DNABert - constant memory!
model.process(kmer_string)
Network Streaming
import biometal
# Stream from HTTP without downloading
# Works with ENA, S3, GCS, Azure public data
url = "https://example.com/dataset.fq.gz"
stream = biometal.FastqStream.from_path(url)
for record in stream:
# Analyze directly - no download needed!
# Memory: constant ~5 MB
gc = biometal.gc_content(record.sequence)
BAM Alignment Analysis (v1.3.0)
import biometal
# Stream BAM file with constant memory
reader = biometal.BamReader.from_path("alignments.bam")
for record in reader:
# Access alignment details
print(f"{record.name}: MAPQ={record.mapq}, pos={record.position}")
# Analyze CIGAR operations
for op in record.cigar:
if op.is_insertion() and op.length >= 5:
print(f" Found {op.length}bp insertion")
# Calculate alignment metrics
ref_len = record.reference_length()
query_len = record.query_length()
print(f" Reference: {ref_len}bp, Query: {query_len}bp")
# Convert BAM to SAM with filtering
writer = biometal.SamWriter.create("output.sam")
writer.write_header(reader.header)
for record in reader:
if record.is_primary and record.mapq >= 30:
writer.write_record(record)
writer.close()
FAQ
Q: Why biometal-rs on PyPI but biometal everywhere else?
A: The biometal name was taken on PyPI, so we use biometal-rs for installation. You still import as import biometal.
Q: What platforms are supported? A: Mac ARM (optimized), Linux ARM/x86_64 (portable). Pre-built wheels for common platforms. See docs/CROSS_PLATFORM_TESTING.md.
Q: Why ARM-native? A: To democratize bioinformatics by enabling world-class performance on consumer hardware ($1,400 MacBooks vs. $50,000 servers).
More questions? See FAQ.md
Contributing
We welcome contributions! See CLAUDE.md for development guidelines.
biometal is built on evidence-based optimization - new features should:
- Have clear use cases
- Be validated experimentally (when adding optimizations)
- Maintain platform portability
- Follow OPTIMIZATION_RULES.md
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.
Citation
If you use biometal in your research:
@software{biometal2025,
author = {Handley, Scott},
title = {biometal: ARM-native bioinformatics with streaming architecture},
year = {2025},
url = {https://github.com/shandley/biometal}
}
For the experimental methodology:
@misc{asbb2025,
author = {Handley, Scott},
title = {Apple Silicon Bio Bench: Systematic Hardware Characterization},
year = {2025},
url = {https://github.com/shandley/apple-silicon-bio-bench}
}
Status: v1.3.0 in development 🚧
Latest: Python BAM bindings with CIGAR operations and SAM writing
Tests: 424 passing (354 library + 70 BAM parser)
Performance: 4.54M records/sec, 43.0 MiB/s throughput
Python Functions: 50+ (including full BAM support)
Evidence Base: 1,357 experiments, 40,710 measurements
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biometal_rs-1.3.0.tar.gz.
File metadata
- Download URL: biometal_rs-1.3.0.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d838ccb620d91ef1ad0e35e26caa5ac21202945e41bdbce1e0ff895c9693a086
|
|
| MD5 |
e920c75c76109d56c9d49b9ecea524c1
|
|
| BLAKE2b-256 |
49ef5cb98212abba3e9cc07c84e9fa6ed082633689dbae0bc51d68592b2c48ed
|
Provenance
The following attestation bundles were made for biometal_rs-1.3.0.tar.gz:
Publisher:
publish-pypi.yml on shandley/biometal
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biometal_rs-1.3.0.tar.gz -
Subject digest:
d838ccb620d91ef1ad0e35e26caa5ac21202945e41bdbce1e0ff895c9693a086 - Sigstore transparency entry: 685853829
- Sigstore integration time:
-
Permalink:
shandley/biometal@9a7bee8c98b29e924fd1298174fd825f566522ba -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/shandley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@9a7bee8c98b29e924fd1298174fd825f566522ba -
Trigger Event:
release
-
Statement type:
File details
Details for the file biometal_rs-1.3.0-cp311-cp311-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: biometal_rs-1.3.0-cp311-cp311-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 3.5 MB
- Tags: CPython 3.11, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a00edccac8b62ed2c10a777e250621c31842ed10be2e371a106b922324a2f852
|
|
| MD5 |
36e18c35340abf6a3ea6d33ba84122ef
|
|
| BLAKE2b-256 |
0cb1a3e3a8aa286834fc2db18bba5d7572714aee8ef79ae5c33ec398935bf92c
|
Provenance
The following attestation bundles were made for biometal_rs-1.3.0-cp311-cp311-manylinux_2_34_x86_64.whl:
Publisher:
publish-pypi.yml on shandley/biometal
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biometal_rs-1.3.0-cp311-cp311-manylinux_2_34_x86_64.whl -
Subject digest:
a00edccac8b62ed2c10a777e250621c31842ed10be2e371a106b922324a2f852 - Sigstore transparency entry: 685853835
- Sigstore integration time:
-
Permalink:
shandley/biometal@9a7bee8c98b29e924fd1298174fd825f566522ba -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/shandley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@9a7bee8c98b29e924fd1298174fd825f566522ba -
Trigger Event:
release
-
Statement type:
File details
Details for the file biometal_rs-1.3.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: biometal_rs-1.3.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4895c8d227c1485931c82b66ea4a7ef3741281accca632a4d05deef81db1837a
|
|
| MD5 |
6274564fc6656b3a029f297abe8567aa
|
|
| BLAKE2b-256 |
c53c8ac77af62c9874ea8a45247c37fa21871ff51a51d0c6077c01cb74ec90a7
|
Provenance
The following attestation bundles were made for biometal_rs-1.3.0-cp311-cp311-macosx_11_0_arm64.whl:
Publisher:
publish-pypi.yml on shandley/biometal
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biometal_rs-1.3.0-cp311-cp311-macosx_11_0_arm64.whl -
Subject digest:
4895c8d227c1485931c82b66ea4a7ef3741281accca632a4d05deef81db1837a - Sigstore transparency entry: 685853831
- Sigstore integration time:
-
Permalink:
shandley/biometal@9a7bee8c98b29e924fd1298174fd825f566522ba -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/shandley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@9a7bee8c98b29e924fd1298174fd825f566522ba -
Trigger Event:
release
-
Statement type:
File details
Details for the file biometal_rs-1.3.0-cp311-cp311-macosx_10_12_x86_64.whl.
File metadata
- Download URL: biometal_rs-1.3.0-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d72a61491c56a91a4253f02d2a2b92c2420deb7d8c150b04ab2d2f273d8914d4
|
|
| MD5 |
fe8b26dc9cf0232b6797b2940c731adb
|
|
| BLAKE2b-256 |
77245dba2d41ef028d03196b6c12342f52b0250427fc240e43b9318c6ee29755
|
Provenance
The following attestation bundles were made for biometal_rs-1.3.0-cp311-cp311-macosx_10_12_x86_64.whl:
Publisher:
publish-pypi.yml on shandley/biometal
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biometal_rs-1.3.0-cp311-cp311-macosx_10_12_x86_64.whl -
Subject digest:
d72a61491c56a91a4253f02d2a2b92c2420deb7d8c150b04ab2d2f273d8914d4 - Sigstore transparency entry: 685853833
- Sigstore integration time:
-
Permalink:
shandley/biometal@9a7bee8c98b29e924fd1298174fd825f566522ba -
Branch / Tag:
refs/tags/v1.3.0 - Owner: https://github.com/shandley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@9a7bee8c98b29e924fd1298174fd825f566522ba -
Trigger Event:
release
-
Statement type: