Skip to main content

Python bindings for microbiorust, Microbiology friendly bioinformatics functions

Project description

microbiorust 🦀

Python bindings for microBioRust — a high-performance, modular bioinformatics toolkit written in Rust.

microbiorust provides fast and memory-efficient bioinformatics functionality to Python users by leveraging the power of Rust, exposed through PyO3. This package aims to offer an alternative to libraries like Biopython, with a focus on speed, correctness, and extensibility.


Installation

pip install microbiorust

to use the Python tests with pytest

python3 -m pytest -s tests/test_mbr.py

Wheels are available for Linux, macOS and Windows (Python 3.10+). No Rust toolchain required. (no requirement to install Rust)

Build from source

If you prefer to build from source using maturin:

pip install maturin
git clone https://github.com/microBioRust/microBioRust
cd microbiorust-py
maturin develop --features extension-module

To verify the Python module functions are correctly exposed from Rust:

cargo test

Features

  • Fast parsers for GenBank and EMBL formats
  • Fast parsers for BLAST XML and tabular formats
  • Fast parser for MSA alignments — subset, purge_gaps, get_consensus
  • Write directly to GFF3, FAA, FNA and FFN formats
  • Typed collections, return type enforces what data is returned
  • Accurate feature extraction, gene, product, strand, start, stop, codon_start
  • Native JSON serialization to instantly export extracted data structures to standard JSON strings
  • Sequence metrics: hydrophobicity, amino acid counts and percentages
  • Python API for easy integration into existing pipelines
  • Built with Rust for memory safety and performance

Modules

microbiorust gbk — GenBank format

import microbiorust as mb

# Write directly to file — most efficient for large files
# all functions are also available for embl format parse_embl, embl_to_faa, embl_to_ffn etc.)
collection = mb.parse_gbk("genome.gbk")
collection.write_faa("output.faa")
collection.write_ffn("output.ffn")
collection.write_fna("output.fna")

# Flat access across whole genome file — returns FaaCollection
faa = mb.gbk_to_faa("genome.gbk")
# print valid protein fasta
for info in faa.values():
    print(f">{info.locus_tag}\n{info.faa}")

# Per-contig record access
for record in collection.values():
    # prints the contig id and sequence
    print(record.id(), record.sequence())

    # protein sequences
    for info in record.faa().values():
        # prints the protein fasta for each predicted protein in the record
        print(f">{info.locus_tag}\n{info.faa}")

    # nucleotide sequences
    for info in record.ffn().values():
        # prints the nucleotide fasta sequence of each predicted gene
        print(f">{info.locus_tag}\n{info.ffn}")

    # features
    features = record.features()
    # prints the features of each predicted gene by locus tag key
    if "b3304" in features:
        feat = features["b3304"]
        print(f"Gene: {feat.gene}, Product: {feat.product}")
        print(f"Location: {feat.start}..{feat.stop}, Strand: {feat.strand}")

# Convert collection to JSON string
json_str = collection.to_json()
print(json_str)
# Parse JSON string into Python dictionary
data = json.loads(json_str)

# Count proteins without loading sequences
count = mb.gbk_to_faa_count("genome.gbk")

# Convert annotations from gbk or embl to GFF3
mb.gbk_to_gff("genome.gbk", dna=True)

---

### EMBL format: illustrates use by calling on the submodule, can also be called directly as mb.embl_to_faa etc.

```python
from microbiorust import embl

# Extract protein sequences to FASTA
embl.embl_to_faa("input.embl", "output.faa")

# Extract nucleotide sequences to FASTA
embl.embl_to_fna("input.embl", "output.fna")

# Convert annotations to GFF3
embl.embl_to_gff("input.embl", "output.gff")

microbiorust seqmetrics — Sequence metrics

from microbiorust import seqmetrics

sequence = "MKTLLLTLVVVTIVCLDLGAVGNGSSLSEDKDNVHK"

# Hydrophobicity score
window_size = 5
score = seqmetrics.hydrophobicity(sequence, window_size)

# Amino acid counts
counts = seqmetrics.amino_counts(sequence)

# Amino acid percentages
percentages = seqmetrics.amino_percentage(sequence)

microbiorust align — Multiple sequence alignment

from microbiorust import align

# Subset a fasta format MSA by row and column e.g.
align.subset_msa_alignment("input.fasta", "ids.txt", "output.fasta")
where the first tuple (0,10) is a row-wise subset and
the second tuple (0,100) is a column-wise subset

microbiorust.blast — BLAST results

import microbiorust

results = microbiorust.parse_tabular("blast_results.tab")
for hit in results:
    print(hit["qseqid"], hit["pident"], hit["bitscore"])

Choice of the usage pattern

Goal Use
Write everything to file collection.write_faa() / write_ffn() / write_fna()
Get all proteins across a whole genome file gbk_to_faa()
Work per genome contig record parse_gbk() then record.faa() or record.ffn()
Features and sequences together parse_gbk() then record.sequences() + record.features()
Count proteins without loading gbk_to_faa_count()
Convert collection to JSON string collection.to_json()
Parse JSON string into Python dictionary json.loads()

Why Rust?

Rust gives microbiorust C-level performance with memory safety — no segfaults, no GIL limitations, and no need for NumPy or Pandas for core parsing operations. Large GenBank or EMBL files are parsed significantly faster than equivalent pure-Python implementations.


Documentation

Full documentation: https://microbiorust.github.io/docs/

Source: https://github.com/microBioRust/microBioRust


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

microbiorust-0.1.6.tar.gz (34.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

microbiorust-0.1.6-cp310-abi3-win_amd64.whl (2.7 MB view details)

Uploaded CPython 3.10+Windows x86-64

microbiorust-0.1.6-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

microbiorust-0.1.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

microbiorust-0.1.6-cp310-abi3-macosx_11_0_arm64.whl (2.9 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

microbiorust-0.1.6-cp310-abi3-macosx_10_12_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file microbiorust-0.1.6.tar.gz.

File metadata

  • Download URL: microbiorust-0.1.6.tar.gz
  • Upload date:
  • Size: 34.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.13.3

File hashes

Hashes for microbiorust-0.1.6.tar.gz
Algorithm Hash digest
SHA256 2135297e6dd81d620f0319dfebaf1583322ded4972c4f244eb35d3d0539831d1
MD5 133c49d9e43002e925509bfe6884d534
BLAKE2b-256 13c8f56a65338af745c28a48d6b958d97191f3cf317de1222a5f4a0673bb5b8e

See more details on using hashes here.

File details

Details for the file microbiorust-0.1.6-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for microbiorust-0.1.6-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f9b3a7db1a51452b577d2e74d2cf59f7899f9a1825e3a845a93c90be33be31cd
MD5 15ad1576941ee5290d556704f6f46f7d
BLAKE2b-256 cc0b58ca7ce97ec4c1ccc0c4abcbeed3c3b049420e0a5b09c3a317b614b3b519

See more details on using hashes here.

File details

Details for the file microbiorust-0.1.6-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for microbiorust-0.1.6-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 365582e1199560061e497c0fb57e96711500b7db1cbe9fc7178a0b164ff681fb
MD5 d3d556ef9e3eac01a99d8e77c3f051f6
BLAKE2b-256 f6d09304d7b54c72868839c671424df9aeb351196b1f5b4c0c939caf4a2a6d1b

See more details on using hashes here.

File details

Details for the file microbiorust-0.1.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for microbiorust-0.1.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c69b5f6ab67a53b79f03637006f86a92cef18bc301414385ca8eb956d172d9d8
MD5 0706c9949aa4115929d681f3eb72541f
BLAKE2b-256 9b1301bb13d1070a8415ad0a0fedf8da3f7394e71be2b0a1743177509d5315a3

See more details on using hashes here.

File details

Details for the file microbiorust-0.1.6-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for microbiorust-0.1.6-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cb25b2e09beef676495f12a21ded28a1581b65da8200fc7f65604cba0ac72e8a
MD5 e3e6f8aaf1edae829e677325762d7e41
BLAKE2b-256 50ec44eea2bad753d769ace8f2367af09d3b2703ec5c30da9d98a1c31e73b2a8

See more details on using hashes here.

File details

Details for the file microbiorust-0.1.6-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for microbiorust-0.1.6-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b403eaa74a907996b4b7b43dde83130775a8b2f13f7b9d997e86330af367c3a4
MD5 77cccbb04bf8a08bf299fd4607433211
BLAKE2b-256 3ea44325d7be12ce4c56d8ab6bd2e8b5908d583732d7f1025cd9e5ab31e793ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page