Skip to main content

Lightweight Neural Net efficient FASTA

Project description

nnfasta

Lightweight Neural Net efficient FASTA dataset suitable for training.

Should be memory efficient across process boundaries. So useful as input to torch/tensorflow dataloaders with multiple workers etc. (see this issue)

Presents a list of fasta files as a simple abc.Sequence so you can inquire about len(dataset) and retrieve Records randomly with dataset[i]

Uses Python's mmap.mmap under the hood.

The underlying FASTA's should be "well formed" since there is minimal sanity checking done.

Install

Install with:

pip install nnfasta

There are no dependencies, you just need a modern (>= 3.9) python (< 12K of code).

Usage

from nnfasta import nnfastas

dataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])

# display the number of sequences
print(len(dataset))

# get a particular record
rec = dataset[20]
print('sequence', rec.id, rec.description, rec.seq)

Warning: No checks are made for the existence of the fasta files. Also files of zero length will be rejected by mmap.

A Record mimics biopython's SeqRecord and is simply:

@dataclass
class Record:
    id: str
    """Sequence ID"""
    description: str
    """Line prefixed by '>'"""
    seq: str
    """Sequence stripped of whitespace and uppercased"""

    @property
    def name(self) -> str:
        return self.id

The major difference is that seq is just a simple str not a biopython Seq object (We just don't want the Bio dependency -- nnfasta has no dependencies).

Arguments

You can give nnfastas either a filename, a Path, the actual bytes in the file or an open file pointer (opened with mode="rb") OR a list of these things. e.g:

from nnfasta import nnfastas
my = "my.fasta"
fa = nnfastas([my, open(my mode="rb"),
            open(my, mode="rb").read()])

Encoding

The files are assumed to be encoded as "ASCII". If this is not the case then nnfastas accepts an encoding argument. All the files presented to nnfastas are assumed to be similarly encoded. You can alter the decoding with the errors keyword (default=strict).

Test and Train Split best practice

Use SubsetFasta

from nnfasta import nnfastas, SubsetFasta
from sklearn.model_selection import train_test_split

dataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])
train_idx, test_idx = train_test_split(range(len(dataset)),test_size=.1,shuffle=True)

# these are still Sequence[Record] objects.

train_data = SubsetFasta(dataset, train_idx)
test_data = SubsetFasta(dataset, test_idx)

# *OR* ... this is basically the same
import torch
train_data, test_data = torch.utils.data.random_split(dataset, [.9, .1])

See the pytorch Subset logic here

How it works

We memory map the input files and use python's re package to scan the files for b"\r>|\n>|^>" bytes from which we compute a start, end for each record and create an array.array (in memory).

The operating system will ensure that similar mmapped pages in different process are shared.

Enjoy peps!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nnfasta-0.1.36.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

nnfasta-0.1.36-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file nnfasta-0.1.36.tar.gz.

File metadata

  • Download URL: nnfasta-0.1.36.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.0 Linux/5.15.0-106-generic

File hashes

Hashes for nnfasta-0.1.36.tar.gz
Algorithm Hash digest
SHA256 33de0e5be3fdaf85f4f73a0f1b6b94354ae6d746cd443ed0cb3c949a88385ca2
MD5 63c3b80ac244f57393eca08c92bb31f9
BLAKE2b-256 7cd8da91b6f48e85cd6714d97ab4d05ec2c418316539090ba3a1c36bf32ed15b

See more details on using hashes here.

File details

Details for the file nnfasta-0.1.36-py3-none-any.whl.

File metadata

  • Download URL: nnfasta-0.1.36-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.0 Linux/5.15.0-106-generic

File hashes

Hashes for nnfasta-0.1.36-py3-none-any.whl
Algorithm Hash digest
SHA256 efd2c24ce89afc24862338d606ba035267b72d55c1f2208725eb8569f74011ee
MD5 1e87d7e38ade91fbf56761a8dc09f3e7
BLAKE2b-256 9c5a7445ac8eb605e23d32f986870c160487f8255bcd1e2a0e99954ad467f090

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page