encoding a biological sequence to a one-hot numpy array
Project description
Description
seq2onehot
is a command-line tool encoding DNA/RNA/protein sequences to a one-hot numpy array.
z
:warning: All sequences must be the same lengths.
To decode a one-hot numpy array to sequences, use onehot2seq
.
https://github.com/akikuno/onehot2seq
Installation
You can install seq2onehot
using pip:
pip install seq2onehot
Usage
seq2onehot [options] -t/--type <dna/rna/protein> -i/--input <in.fasta> -o/--output <out.npy>
Options
-a/--ambiguous: include ambiguous characters
The ambigous characters are:
XBZJ
for amino acidNVHDBMRWSYK
for DNA and RNA
The detail of ambiguous characters is described here:
https://meme-suite.org/meme/doc/alphabets.html
Examples
# DNA sequences
seq2onehot -t dna -i example/dna.fasta -o dna.npy
# RNA sequences
seq2onehot -t rna -i example/rna.fasta -o rna.npy
# Protein sequences
seq2onehot -t protein -i example/protein.fasta -o protein.npy
One-hot array
The output file contains 3d one-hot array of RxNxL
(Read x Nucreotide/Amino acid x Letter)
- The order of nucreotide is
ACGT
(+NVHDBMRWSYK
) for DNA,ACGU
(+NVHDBMRWSYK
) for RNA - The order of amino acid is
ACDEFGHIKLMNPQRSTVWY
(+XBZJ
)
# Original sequences:
## ACGTACGTACGTACGT
## CCCCCCCCTTTTTTTT
onehot = np.load("dna.npy")
onehot.shape
# (2, 16, 4) <- 2 reads x 16 nucreotides x 4 letters (ACGT)
onehot
# array([[[1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.],
# [1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.],
# [1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.],
# [1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.]],
# [[0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.]]])
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file seq2onehot-0.0.1.tar.gz
.
File metadata
- Download URL: seq2onehot-0.0.1.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 897d8c2cce477d85028c43b3c1eb24e89b47742cd39e935ae7d862011ffbefa2 |
|
MD5 | 399f508d5e2fbcb3108dd793ab2661ca |
|
BLAKE2b-256 | e294b9a39ef02694801ec5159d19bbaf95c3bccc0416399e2313518c4855c843 |
File details
Details for the file seq2onehot-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: seq2onehot-0.0.1-py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 841535e7e55c66235639bcc13ad8ff2afb9fd9b58a55d6fa7f19358fe0bc1002 |
|
MD5 | 1451c7953bc538598a5f0837c9669849 |
|
BLAKE2b-256 | 99d4899716534fc53653a75566585ee11e504590fe52fa9af160d867bc962f7c |