encoding a biological sequence to a one-hot numpy array
Project description
Description
seq2onehot is a command-line tool encoding DNA/RNA/protein sequences to a one-hot numpy array.
z
:warning: All sequences must be the same lengths.
To decode a one-hot numpy array to sequences, use onehot2seq.
https://github.com/akikuno/onehot2seq
Installation
You can install seq2onehot using pip:
pip install seq2onehot
Usage
seq2onehot [options] -t/--type <dna/rna/protein> -i/--input <in.fasta> -o/--output <out.npy>
Options
-a/--ambiguous: include ambiguous characters
The ambigous characters are:
XBZJfor amino acidNVHDBMRWSYKfor DNA and RNA
The detail of ambiguous characters is described here:
https://meme-suite.org/meme/doc/alphabets.html
Examples
# DNA sequences
seq2onehot -t dna -i example/dna.fasta -o dna.npy
# RNA sequences
seq2onehot -t rna -i example/rna.fasta -o rna.npy
# Protein sequences
seq2onehot -t protein -i example/protein.fasta -o protein.npy
One-hot array
The output file contains 3d one-hot array of RxNxL (Read x Nucreotide/Amino acid x Letter)
- The order of nucreotide is
ACGT(+NVHDBMRWSYK) for DNA,ACGU(+NVHDBMRWSYK) for RNA - The order of amino acid is
ACDEFGHIKLMNPQRSTVWY(+XBZJ)
# Original sequences:
## ACGTACGTACGTACGT
## CCCCCCCCTTTTTTTT
onehot = np.load("dna.npy")
onehot.shape
# (2, 16, 4) <- 2 reads x 16 nucreotides x 4 letters (ACGT)
onehot
# array([[[1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.],
# [1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.],
# [1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.],
# [1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 1., 0.],
# [0., 0., 0., 1.]],
# [[0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 1., 0., 0.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.],
# [0., 0., 0., 1.]]])
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file seq2onehot-0.0.1.tar.gz.
File metadata
- Download URL: seq2onehot-0.0.1.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
897d8c2cce477d85028c43b3c1eb24e89b47742cd39e935ae7d862011ffbefa2
|
|
| MD5 |
399f508d5e2fbcb3108dd793ab2661ca
|
|
| BLAKE2b-256 |
e294b9a39ef02694801ec5159d19bbaf95c3bccc0416399e2313518c4855c843
|
File details
Details for the file seq2onehot-0.0.1-py3-none-any.whl.
File metadata
- Download URL: seq2onehot-0.0.1-py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
841535e7e55c66235639bcc13ad8ff2afb9fd9b58a55d6fa7f19358fe0bc1002
|
|
| MD5 |
1451c7953bc538598a5f0837c9669849
|
|
| BLAKE2b-256 |
99d4899716534fc53653a75566585ee11e504590fe52fa9af160d867bc962f7c
|