Neural Net efficient FASTA

These details have not been verified by PyPI

Project description

nnfasta

Lightweight Neural Net efficient FASTA dataset suitable for training.

Should be memory efficient across process boundaries. So useful as input to torch/tensorflow dataloaders with multiple workers etc.

Presents a list of fasta files as a simple abc.Sequence so you can inquire about len(dataset) and retrieve Records randomly with dataset[i]

Uses Python's mmap.mmap under the hood.

The underlying FASTA's should be "well formed" since there is minimal sanity checking done.

Install

Install with:

pip install nnfasta

There are no dependencies, you just need a modern (>= 3.9) python.

Usage

from nnfasta import nnfastas

dataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])

# display the number of sequences
print(len(dataset))

# get a particular record
rec = dataset[20]
print('sequence', rec.id, rec.description, rec.seq)

Warning: No checks are made for the existence of the fasta files. Also files of zero length will be rejected by mmap.

A Record mimics biopython's SeqRecord and is simply:

@dataclass
class Record:
    id: str
    """Sequence ID"""
    description: str
    """Line prefixed by '>'"""
    seq: str
    """Sequence stripped of whitespace and uppercased"""

    @property
    def name(self) -> str:
        return self.id

Arguments

You can give nnfastas either a filename, a Path, the actual bytes in the file or an open file pointer (opened with mode="rb") OR a list of these things.

Encoding

The files are assumed to be encoded as "ASCII". If this is not the case the nnfastas accepts an encoding argument. All the files presented to nnfastas are assumed to be similarly encoded.

Test and Train Split best practice

Use SubsetFasta

from nnfasta import nnfastas, SubsetFasta
from sklearn.model_selection import train_test_split

dataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])
train_idx, test_idx = train_test_split(range(len(dataset)),test_size=.1,shuffle=True)

# these are still Sequence[Record] objects.

train_data = SubsetFasta(dataset, train_idx)
test_data = SubsetFasta(dataset, test_idx)

# *OR* ... this is basically the same
import torch
train_data, test_data = torch.utils.data.random_split(dataset, [.9, .1])

See the pytorch Subset logic here

Enjoy peps!

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.36

May 12, 2024

0.1.35

May 12, 2024

0.1.34

May 12, 2024

0.1.33

May 12, 2024

0.1.32

May 11, 2024

0.1.31

May 11, 2024

0.1.30

May 11, 2024

0.1.29

May 11, 2024

0.1.27

May 11, 2024

0.1.26

May 11, 2024

0.1.25

May 11, 2024

0.1.24

May 11, 2024

0.1.23

May 11, 2024

0.1.22

May 11, 2024

0.1.21

May 11, 2024

0.1.20

May 10, 2024

0.1.19

May 10, 2024

This version

0.1.18

May 10, 2024

0.1.17

May 10, 2024

0.1.16

May 10, 2024

0.1.15

May 10, 2024

0.1.14

May 5, 2024

0.1.13

May 4, 2024

0.1.12

May 4, 2024

0.1.10

May 3, 2024

0.1.9

May 3, 2024

0.1.8

May 3, 2024

0.1.7

May 3, 2024

0.1.6

May 3, 2024

0.1.5

May 3, 2024

0.1.4

May 3, 2024

0.1.2

May 3, 2024

0.1.1

May 3, 2024

0.1.0

May 3, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nnfasta-0.1.18.tar.gz (5.5 kB view hashes)

Uploaded May 10, 2024 Source

Built Distribution

nnfasta-0.1.18-py3-none-any.whl (6.4 kB view hashes)

Uploaded May 10, 2024 Python 3

Hashes for nnfasta-0.1.18.tar.gz

Hashes for nnfasta-0.1.18.tar.gz
Algorithm	Hash digest
SHA256	`ed243187b3eb321208d2cdbac245a453b61f9dcb341c2136ada8f124c5b4bb9b`
MD5	`00ff9c284d534dbec16e1ae9c81d08b6`
BLAKE2b-256	`e5268deaba1c9aad5d7a75db5a6f1f61acc5fad4041a297475b24c40027cc991`

Hashes for nnfasta-0.1.18-py3-none-any.whl

Hashes for nnfasta-0.1.18-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d6cfea71503c9917571fc6c8e84d17abf48d6d2f42d4dd762632ab7f7f4abd01`
MD5	`95ffe5df216e07bb6ed9494a28da9457`
BLAKE2b-256	`7c99f40d7d8620ee62425e8734400ff86d7c481f4a657804c36f834251ca7473`