encoding a biological sequence to a one-hot numpy array
Project description
Description
seq2onehot
is a command-line tool encoding DNA/RNA/protein sequences to a one-hot numpy array.
z
:warning: All sequences must be the same lengths.
To decode a one-hot numpy array to sequences, use onehot2seq
.
https://github.com/akikuno/onehot2seq
Installation
You can install seq2onehot
using pip:
pip install seq2onehot
Usage
seq2onehot [options] -t/--type <dna/rna/protein> -i/--input <in.fasta> -o/--output <out.npy>
Options
-a/--ambiguous: include ambiguous characters
The ambigous characters are:
XBZJ
for amino acidNVHDBMRWSYK
for DNA and RNA
The detail of ambiguous characters is described here:
https://meme-suite.org/meme/doc/alphabets.html
Examples
# DNA sequences
seq2onehot -t dna -i example/dna.fasta -o dna.npy
# RNA sequences
seq2onehot -t rna -i example/rna.fasta -o rna.npy
# Protein sequences
seq2onehot -t protein -i example/protein.fasta -o protein.npy
One-hot array
The output file contains 3d one-hot array of RxNxL
(Read x Nucreotide/Amino acid x Letter)
- The order of nucreotide is
ACGT
(+NVHDBMRWSYK
) for DNA,ACGU
(+NVHDBMRWSYK
) for RNA - The order of amino acid is
ACDEFGHIKLMNPQRSTVWY
(+XBZJ
)
# Original sequences:
## ACGTACGTACGTACGT
## CCCCCCCCTTTTTTTT
onehot = np.load("dna.npy")
onehot.shape
# (2, 16, 4) <- 2 reads x 16 nucreotides x 4 letters (ACGT)
onehot
# array([[[1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.],
# [1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.],
# [1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.],
# [1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.]],
# [[0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.]]])
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
seq2onehot-0.0.1.tar.gz
(3.9 kB
view hashes)
Built Distribution
Close
Hashes for seq2onehot-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 841535e7e55c66235639bcc13ad8ff2afb9fd9b58a55d6fa7f19358fe0bc1002 |
|
MD5 | 1451c7953bc538598a5f0837c9669849 |
|
BLAKE2b-256 | 99d4899716534fc53653a75566585ee11e504590fe52fa9af160d867bc962f7c |