Lightweight Neural Net efficient FASTA
Project description
nnfasta
Lightweight Neural Net efficient FASTA dataset suitable for training.
Should be memory efficient across process boundaries. So useful as input to torch/tensorflow dataloaders with multiple workers etc. (see this issue)
Presents a list of fasta files as a simple abc.Sequence
so you can inquire about len(dataset)
and retrieve
Record
s randomly with dataset[i]
Uses Python's mmap.mmap
under the hood.
The underlying FASTA's should be "well formed" since there is minimal sanity checking done.
Install
Install with:
pip install nnfasta
There are no dependencies, you just need a modern (>= 3.9) python (< 12K of code).
Usage
from nnfasta import nnfastas
dataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])
# display the number of sequences
print(len(dataset))
# get a particular record
rec = dataset[20]
print('sequence', rec.id, rec.description, rec.seq)
Warning: No checks are made for the existence of
the fasta files. Also files of zero length will be rejected
by mmap
.
A Record
mimics biopython's SeqRecord
and is simply:
@dataclass
class Record:
id: str
"""Sequence ID"""
description: str
"""Line prefixed by '>'"""
seq: str
"""Sequence stripped of whitespace and uppercased"""
@property
def name(self) -> str:
return self.id
The major difference is that seq
is just a simple str
not a biopython Seq
object
(We just don't want the Bio
dependency -- nnfasta
has no dependencies).
Arguments
You can give nnfastas
either a filename, a Path
, the actual
bytes in the file or an open file pointer (opened with mode="rb"
)
OR a list of these things. e.g:
from nnfasta import nnfastas
my = "my.fasta"
fa = nnfastas([my, open(my mode="rb"),
open(my, mode="rb").read()])
Encoding
The files are assumed to be encoded as "ASCII
". If this is not the
case then nnfastas
accepts an encoding
argument. All the files
presented to nnfastas
are assumed to be similarly encoded. You can
alter the decoding with the errors
keyword (default=strict
).
Test and Train Split best practice
Use SubsetFasta
from nnfasta import nnfastas, SubsetFasta
from sklearn.model_selection import train_test_split
dataset = nnfastas(['athaliana.fasta','triticum.fasta','zmays.fasta'])
train_idx, test_idx = train_test_split(range(len(dataset)),test_size=.1,shuffle=True)
# these are still Sequence[Record] objects.
train_data = SubsetFasta(dataset, train_idx)
test_data = SubsetFasta(dataset, test_idx)
# *OR* ... this is basically the same
import torch
train_data, test_data = torch.utils.data.random_split(dataset, [.9, .1])
See the pytorch Subset
logic here
How it works
We memory map the input files and use python's re
package to scan the files
for b"\r>|\n>|^>"
bytes from which we compute a start, end for each
record and create an array.array
(in memory).
The operating system will ensure that similar mmapped pages in different process are shared.
Enjoy peps!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nnfasta-0.1.36.tar.gz
.
File metadata
- Download URL: nnfasta-0.1.36.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.0 Linux/5.15.0-106-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 33de0e5be3fdaf85f4f73a0f1b6b94354ae6d746cd443ed0cb3c949a88385ca2 |
|
MD5 | 63c3b80ac244f57393eca08c92bb31f9 |
|
BLAKE2b-256 | 7cd8da91b6f48e85cd6714d97ab4d05ec2c418316539090ba3a1c36bf32ed15b |
File details
Details for the file nnfasta-0.1.36-py3-none-any.whl
.
File metadata
- Download URL: nnfasta-0.1.36-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.0 Linux/5.15.0-106-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | efd2c24ce89afc24862338d606ba035267b72d55c1f2208725eb8569f74011ee |
|
MD5 | 1e87d7e38ade91fbf56761a8dc09f3e7 |
|
BLAKE2b-256 | 9c5a7445ac8eb605e23d32f986870c160487f8255bcd1e2a0e99954ad467f090 |