No project description provided
Project description
dna_parser
dna_parser is a small python library written in rust to perform encoding/feature extraction for machine learning on dna and rna sequences.
Table of contents
Install
For now, you need to have the rust programming language installed on your computer to install the library.
Run the following command on Unix-like OS to install rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
or see more options at https://www.rust-lang.org/tools/install.
then, to install the test version you can run:
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple dna-parser
Usage
import dna_parser
Loading Fasta Files
#load both metadata and sequence in tuples (metadata,sequences)
metadata_and_sequences= dna_parser.load_fasta("path/to/fasta/file")
#load sequence only
sequences= dna_parser.seq_from_fasta("path/to/fasta/file")
#load metadata only
metadata= dna_parser.metadata_from_fasta("path/to/fasta/file")
Encodings
Currently only support ordinal encoding, onehot encoding and Term Frequency Inverse Document Frequency (TF-IDF).
Ordinal Encoding
Nucleotides are currently encoded as follow:
- A= 0.25
- C= 0.50
- G= 0.75
- T/U= 0.75
- Other characters or gaps = 0
#returns a list of 1D numpy arrays representing the encoding
encoding= dna_parser.ordinal_encoding(sequences, pad_type, pad_length, n_jobs)
Function Arguments:
- sequences: List of strings (representing your sequences).
- pad_type: pad (or trim) "before" the sequence or "after" the sequences.
- pad_length: -2 to pad according to the longest sequence, -1 to trim to shortest, 0 for no paddding, any positive number for a fixed length.
- n_jobs: number of threads to use to encode the sequences. 0 to use all cpus available.
OneHot Encoding
Nucleotides are currently encoded as follow:
- A= [1,0,0,0]
- C= [0,1,0,0]
- G= [0,0,1,0]
- T/U= [0,0,0,1]
- Other characters or gaps = [0,0,0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.onehot_encoding(sequences, pad_type, pad_length, n_jobs)
Function Arguments:
- sequences: List of strings (representing your sequences).
- pad_type: pad (or trim) "before" the sequence or "after" the sequences.
- pad_length: -2 to pad according to the longest sequence, -1 to trim to shortest, 0 for no paddding, any positive number for a fixed length.
- n_jobs: number of threads to use to encode the sequences. 0 to use all cpus available.
TF-IDF Encoding
Note that for this function, your sequences need to be split up in words (or k-mers) where each word is separated by a whitespace. To do so you can use the make_kmers function (see Other Functions section)
encoding= dna_parser.tfidf_encoding(corpus)
Function Arguments:
- corpus: List of strings (representing your sequences).
Other Functions
Generating Random sequences
This function generates random dna, rna or amino acid sequences and returns them in a list.
sequences= dna_parser.random_seq(lenght, nb_of_seq, seq_type, n_jobs)
Function Arguments:
- length: integer representing the length of the sequences
- nb_of_seq: integer representing the number of sequences to generate
- seq_type: string representing the type of sequence. dna, rna or aa (for amino acid)
- n_jobs: number of threads to use to generate the sequences. 0 to use all cpus available.
Making K-mers in Sequences
this function takes a string and returns a new one with withspaces inserted to form words of length k.
seq_k_mers= dna_parser.make_kmers(seq, k)
Function Arguments:
- seq: string representing a sequence
- k: integer representing the length of words to form
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for dna_parser-0.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed8f6e8459526a8078df23722e582145387e0c0c4fa740a522b5654fc6e98e5f |
|
MD5 | 347f4552dfc905c6370b9ea089d8d3f3 |
|
BLAKE2b-256 | bd5c78742d9d1116c583907341000d49fe26d6c6a813df1bf4443ee2f405d7a3 |
Hashes for dna_parser-0.1.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e71908776f035738089c3b306e70f33053103b3cf1aaae4bb58547d202e49abf |
|
MD5 | 50e748dc6fdfd03f2f2615134d377c3f |
|
BLAKE2b-256 | 3d1ac317d8dad98e943c7cab30216508dd64c591efac6aad8d6f59d8af908578 |
Hashes for dna_parser-0.1.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed0f91dcac87bcfc53dcc4155e2fc3d2099f2c3f34b52c98964edd7fe5fbbb79 |
|
MD5 | fd794d4e4ec27ac8860ad74c280b71e3 |
|
BLAKE2b-256 | 83fd89e6ab67905ae8aebb15c2b0fbf5abfe3b77d639727709254009ad24906f |
Hashes for dna_parser-0.1.0-cp311-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ec5d44c058b9a68e57eaa82c363047f41a3515678d65c6ace980f0475afc875 |
|
MD5 | c8d67c578f679397f7436ce49e25b97a |
|
BLAKE2b-256 | 173c33086768e55ca60af6e8d50214f140fb366674b7bfc9411ef38bf7516142 |
Hashes for dna_parser-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8dfbf8810dcad58ccb861152c546aa1585f6ca630415bff88d18019f43a7578 |
|
MD5 | b0079951c5db4238a10828f741330786 |
|
BLAKE2b-256 | 5891913638b4303f6ded7d7cbfdf27a76aefecfd4d3d4d34e43ed54650f8f053 |
Hashes for dna_parser-0.1.0-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7477d393c18c018b919be3eaf20f62b9e37ac60a789ea09a718836bab6edfb0 |
|
MD5 | cd759b4d784046e276d922087931b52b |
|
BLAKE2b-256 | a5af86ca814864a65dd88f05d5098d542a15683cfbcd019edb062fd2692bdcd9 |
Hashes for dna_parser-0.1.0-cp310-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 010d7ae4e7aa98b3284f497b770c58727073b2573a8bfa48b2167ad881907720 |
|
MD5 | 9be2b4e417cff24d90e8ab3919292cb5 |
|
BLAKE2b-256 | 3c140a07397762551405a411eb0a7e50f7f0d5c04e7154b5727fab45552d8c6c |
Hashes for dna_parser-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d5fd4ae85bf29b8e9f392e7bb88ddb94f12bb6baf25e59df34710cbaf49d6f6 |
|
MD5 | f6941bae40a14dc3012cceebb0eb1608 |
|
BLAKE2b-256 | 1c5f6d9307ce5ebb8a41a04e3d3ae30542f555ba6fc27fc81d26d3718f79189a |
Hashes for dna_parser-0.1.0-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 847db38f6619a31b3ab4f2f50e8e52fe0cae555173aee62b4384b749ae34c121 |
|
MD5 | 26dbff0fe255a7a8793cbeb6adebfa72 |
|
BLAKE2b-256 | baad1bb936af16658ebc5aa7beb5ef98d5a02d9e25f79e76d7a3bf124f419890 |
Hashes for dna_parser-0.1.0-cp39-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08ef51318fa1154acbd7f66967a409616d667e245e38e410ac8e509118d37fa3 |
|
MD5 | 6d514d81e1e4d5663b0ebb873b7961c6 |
|
BLAKE2b-256 | 7f47523eef72e295343ee45487b41c828da2b55aeec8b4b3009ca83fdc7043a8 |
Hashes for dna_parser-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0aed566af5d6d779c0e737d3be74fbb986d2c7be48a00ef0efc4be3201a9e488 |
|
MD5 | 07da0439cc8dceb15dc9cde0578e077c |
|
BLAKE2b-256 | 11f148384ca28c27ea49623d37081ab0326a10dbef10ab137f0e4ea5063ffe37 |
Hashes for dna_parser-0.1.0-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 62565a5e71c302298b4cc7a3f7b28165bc292a3ecb6fa995f2fc8d85ab24e34c |
|
MD5 | 1e1ea23b26168073f7f354bb37f1838b |
|
BLAKE2b-256 | dda1e79489b641fcfd235f794d7c3b83c7870b69ef389324ab5de7d3744b4f87 |
Hashes for dna_parser-0.1.0-cp38-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b77393ca388b6f13a4ed8126ff0f606df26e3816939544ffe3b5625d6ab6985d |
|
MD5 | 73dbda547a140f2ea637987471c01423 |
|
BLAKE2b-256 | 07260a9ed871d9dc85cf30ad0842b4740b7a9cf983597da663e1852d56978bd3 |
Hashes for dna_parser-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c880eae3c7a1ad9705e1b066fcb876d2c61c4b71f55bf20a51160646e99a9fef |
|
MD5 | 925c378f9b54879a0538d2793d058089 |
|
BLAKE2b-256 | de77c25c20698f9f407274d180d8c068107584bbd4303b57d94663ab7d8d3d1b |
Hashes for dna_parser-0.1.0-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c1da37cf55dd1797c3d64c3fb127e88a843972b0e726162e780c284d09fb2c0 |
|
MD5 | e90eee8211bdc6e9d93ed75c7721108a |
|
BLAKE2b-256 | c4505d4344dfdffa159b7952c300f4e3020dd3a05dc671a1ca01f5588ab99c74 |
Hashes for dna_parser-0.1.0-cp37-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 761f3fb2b03952d07d7146af27106130740b54d7585575999e75743146044a71 |
|
MD5 | ea7032b9dcca015fadd0af930936d37f |
|
BLAKE2b-256 | db2f583c2fbb3523b9e0fe4193ef5b7fc4c4873dc7453978acaff3b372c0c1cb |
Hashes for dna_parser-0.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08d97e0df7b3f162851c6d27c940457586740a8927209e547a9a29b6a15b3d1f |
|
MD5 | 913b36f406869a6692203945e4b32bf7 |
|
BLAKE2b-256 | 559a111a98d26c2cfa7e15c4c7d6bad373d4f3a8fa15c187c4e6b78796d8dcd4 |
Hashes for dna_parser-0.1.0-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5683b56b7de0c87dc85fb54fd5f1dad7b01289eea5202a9588e3c981462b55af |
|
MD5 | 655f92d8723ab3be7fde7e80ba8295d5 |
|
BLAKE2b-256 | 717aa30833ebe98cb636f820aef97190a2773d2730e21b958e52b0f398a62d63 |