Skip to main content

Python wrapper around [noodles](https://github.com/zaeleus/noodles).

Project description

bionemo-noodles

bionemo-noodles is a Python wrapper of noodles that supports memmap-based file I/O for FASTA files.

Installation

To install from PyPI, execute the following command:

pip install bionemo-noodles

Usage

An example torch.utils.data.Dataset using NvFaidx / bionemo-noodles:

import json
from pathlib import Path

import torch

from bionemo.noodles.nvfaidx import NvFaidx

class SimpleFastaDataset(torch.utils.data.Dataset):

    def __init__(self, fasta_path: Path, tokenizer):
        """Initialize the dataset."""
        super().__init__()
        self.fasta = NvFaidx(fasta_path)
        self.seqids = sorted(self.fasta.keys())
        self.tokenizer = tokenizer

    def write_idx_map(self, output_dir: Path):
        """Write the index map to the output directory."""
        with open(output_dir / "seq_idx_map.json", "w") as f:
            json.dump({seqid: idx for idx, seqid in enumerate(self.seqids)}, f)

    def __len__(self):
        """Get the length of the dataset."""
        return len(self.seqids)

    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
        """Get an item from the dataset."""
        sequence = self.fasta[self.seqids[idx]].sequence().upper()
        tokenized_seq = self.tokenizer.text_to_ids(sequence)
        loss_mask = torch.ones_like(torch.tensor(tokenized_seq, dtype=torch.long), dtype=torch.long)
        return {
            "tokens": torch.tensor(tokenized_seq, dtype=torch.long),
            "position_ids": torch.arange(len(tokenized_seq), dtype=torch.long),
            "seq_idx": torch.tensor(idx, dtype=torch.long),
            "loss_mask": loss_mask,
        }

BioNeMo Framework Ecosystem Development

To install this sub-package locally (with --editable):

pip install -e .

To run unit tests, execute:

pytest -v .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bionemo_noodles-0.1.1-cp312-cp312-manylinux_2_28_x86_64.whl (275.4 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

bionemo_noodles-0.1.1-cp311-cp311-manylinux_2_28_x86_64.whl (275.2 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

bionemo_noodles-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl (277.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

File details

Details for the file bionemo_noodles-0.1.1-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bionemo_noodles-0.1.1-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b3779941414fdd1a50768ea4fc8e4bb81efc5440edcdb8055306b086aa0e873f
MD5 8dacfe5e6073da4f0af7543b49c94391
BLAKE2b-256 e7211fecdde78243d9626716c382c3e160b0de6ccc42989fa021958759c98b93

See more details on using hashes here.

Provenance

The following attestation bundles were made for bionemo_noodles-0.1.1-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: bionemo-subpackage-ci.yml on NVIDIA/bionemo-framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bionemo_noodles-0.1.1-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bionemo_noodles-0.1.1-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 34230662b1602d221a7075ec39dcbcbd862135b7d8d1a0d1fbfd2596299e5677
MD5 cff8ec38f98e56c6df071465b04e3eaf
BLAKE2b-256 b20dc32a1b916125229e52e407629452b5fd72b32610291a0de2a002e764552c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bionemo_noodles-0.1.1-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: bionemo-subpackage-ci.yml on NVIDIA/bionemo-framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bionemo_noodles-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for bionemo_noodles-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b5782e3b62638f7125a54e3d92b644a873656ee9d5b4eb5ffabeb2f64f530300
MD5 2e6f3af8629dd7e79325c8d8f5648fa3
BLAKE2b-256 4c53b6fa402be8562e80dd61ee55f0505e4f91e029d320167c4d96ed75ea6538

See more details on using hashes here.

Provenance

The following attestation bundles were made for bionemo_noodles-0.1.1-cp310-cp310-manylinux_2_28_x86_64.whl:

Publisher: bionemo-subpackage-ci.yml on NVIDIA/bionemo-framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page