No project description provided
Project description
dna_parser
dna-parser is a Python library written in rust to encode (or perform feature extraction on) DNA/RNA sequences for machine learning.
Table of contents
Install
To install dna-parser simply run:
pip install dna-parser
If there is no Python wheel available for your OS you can install Rust and re-install dna-parser which should now compile and your machine. Run the following command on Unix-like OS to install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
or see more options at https://www.rust-lang.org/tools/install.
Usage
import dna_parser
Loading Fasta Files
#load both metadata and sequence in tuples (metadata,sequences)
metadata_and_sequences= dna_parser.load_fasta("path/to/fasta/file")
#load sequence only
sequences= dna_parser.seq_from_fasta("path/to/fasta/file")
#load metadata only
metadata= dna_parser.metadata_from_fasta("path/to/fasta/file")
Encodings
Currently only support ordinal encoding, onehot encoding, cross encoding and Term Frequency Inverse Document Frequency (TF-IDF).
Ordinal Encoding
Nucleotides are currently encoded as follow:
- A= 0.25
- C= 0.50
- G= 0.75
- T/U= 1.0
- Other characters or gaps = 0
#returns a list of 1D numpy arrays representing the encoding
encoding= dna_parser.ordinal_encoding(sequences, pad_type, pad_length, n_jobs)
Function Arguments:
- sequences (list of str): list of genomic sequences.
- pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
- pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
- n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.
OneHot Encoding
Nucleotides are currently encoded as follow:
- A= [1,0,0,0]
- C= [0,1,0,0]
- G= [0,0,1,0]
- T/U= [0,0,0,1]
- Other characters or gaps = [0,0,0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.onehot_encoding(sequences, pad_type, pad_length, n_jobs)
Function Arguments:
- sequences (list of str): list of genomic sequences.
- pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
- pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
- n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.
Cross Encoding
Nucleotides are currently encoded as follow:
- A= [0,-1]
- C= [-1,0]
- G= [1,0]
- T/U= [0,1]
- Other characters or gaps = [0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.cross_encoding(sequences, pad_type, pad_length, n_jobs)
Function Arguments:
- sequences (list of str): list of genomic sequences.
- pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
- pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
- n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.
TF-IDF Encoding
Note that for this function, your sequences need to be split up in words (or k-mers) where each word is separated by a whitespace. To do so you can use the make_kmers function (see Other Functions section).
encoding= dna_parser.tfidf_encoding(corpus)
Function Arguments:
- corpus (list of str): genomic sequences.
Other Functions
Generating Random sequences
This function generates random dna, rna or amino acid sequences and returns them in a list.
sequences= dna_parser.random_seq(lenght, nb_of_seq, seq_type, n_jobs)
Function Arguments:
- length (int): length of the sequences.
- nb_of_seq (int): number of sequences to generate.
- seq_type (str; default= dna): type of sequences. "dna", "rna" or "aa" (for amino acid).
- n_jobs (int, default= 1): number of threads to use to generate the sequences. 0 to use all cpus available.
Making K-mers in Sequences
this function takes a string and returns a new one with withspaces inserted to form words of length k.
seq_k_mers= dna_parser.make_kmers(seq, k)
Function Arguments:
- seq (str): the genomic sequence.
- k (int): length of words to create in the sequence.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for dna_parser-0.2.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c55372324b4e850f1f689256892a5816d9b170cf1e4caabbc84590fade9d7f79 |
|
MD5 | 87441030a2a8269752b7bcd73fd8fbf7 |
|
BLAKE2b-256 | 1966b943f304fbbb77dc9f248a94faa888d4910e293ea8ede1e6a8bae8a6cc6f |
Hashes for dna_parser-0.2.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63e67aeb2b5e93e9814ab15784a34cb091f2d5289286afa30c5bd43b33d1fa38 |
|
MD5 | d86823bbac13f9b378043cfca9bec2f0 |
|
BLAKE2b-256 | ee06c2ba1246464323c3f3f930227dfc1232cbf035da7f474b754a2aa9a1d7f0 |
Hashes for dna_parser-0.2.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d269dda146c7afc4bb8dded949830d9b06dd7ec9f80b5fda855242707e89b983 |
|
MD5 | 0b94746925cbd524cf94f7a54b6dcc70 |
|
BLAKE2b-256 | 4f5026c30b22df10ddafddaf394cbffae8aa8932670474cab112141420c530c5 |
Hashes for dna_parser-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f48548bedb5e1b79469644afabc89815861ad43f9f79eaa06870462b7600271a |
|
MD5 | 2022b05ad5d6617e52c8e28c06fab6c3 |
|
BLAKE2b-256 | 65c39a915b6ec34043270ba33d12e5b011b4de933213cda3f4274156f34fcbbe |
Hashes for dna_parser-0.2.1-cp311-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ff7b903bed0fee28e0463dd090bff5bd382c31fe22c0128a39719097f5f3297 |
|
MD5 | 800c58d082f7fe238b9ad73a532077a2 |
|
BLAKE2b-256 | a2d2e136c45a5360d96cdd47b533be3f3ada77e95fbd50bad41a9d3ce0762161 |
Hashes for dna_parser-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a83f1544623c5d75f437859421d98843e57c551ac605e9626df17fec71f2092e |
|
MD5 | 96a2b7d60461ec51b2ab9ee45e51f74b |
|
BLAKE2b-256 | cdb7c40fa97af12e3931de0a40876112e5b25eb958f2b4fee0581bc8d4c70f9d |
Hashes for dna_parser-0.2.1-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dccf1fe5df2eab6717a1bb9bf61033186df3eb2fa5d36a1dae63558605334cc6 |
|
MD5 | 38fc933cd290a979a939f3f9bf0e797e |
|
BLAKE2b-256 | c4d97b6f449c72f3b008a4edad3301be74b5560b01d89fd814ed054e27dcb98e |
Hashes for dna_parser-0.2.1-cp310-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60d88cb99d3b755316c3c99445d7d658dd0b050c09dee982158658160e4cf756 |
|
MD5 | cc245453c35f1c5afe418873a7d3eb3b |
|
BLAKE2b-256 | e40e9d14da8c5484972111491a8626afec9d9fabeb27b73342d31a78d45a0043 |
Hashes for dna_parser-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bdaaa8533b5db3df5af7e280a0909aab5f7aadd55db93da1c9fbd16a89e48548 |
|
MD5 | 037b87828f74779555b43daead702aed |
|
BLAKE2b-256 | 6504a493aeddca8009312487be9364e8f549b77944a2067843e777c596cfe698 |
Hashes for dna_parser-0.2.1-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c2f92441d73345cc38a9690c56851f72493aa88d3f9cfe0092bf39cecbdfbd3c |
|
MD5 | 28dbfd253c30247269b1b261126c39a3 |
|
BLAKE2b-256 | 2463cd860f83de94a497a15603252cc2f1b61fc2c77d5808495f762c732ab417 |
Hashes for dna_parser-0.2.1-cp39-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93213d8d6d1dff5a20a365e9c465026f4c0bdbb740999a1d0027e18a932d19aa |
|
MD5 | be6ac2348e89b45afe9fa41f5203b546 |
|
BLAKE2b-256 | e572202f9e0620db65a59859e0b1067a5fe1b3e75b35c1b9179bb14b22f437ac |
Hashes for dna_parser-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 51e69b798df84ddcecacdecfeac960eef3f64b89e7a7f5cbbbdf57a2ae4c2bba |
|
MD5 | c090a4609eebd4f6fbd4d9571331cfe4 |
|
BLAKE2b-256 | fbc6c9cc163bfb610069a7b19c3c3d24159b7c5064e9b64b5ea9ec6dabda33fc |
Hashes for dna_parser-0.2.1-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 34478efdb35e06a45bec09a368d2a695ef2fb6c4a10141bbcdce3d41d5598f70 |
|
MD5 | 067b29c69c637967b7eda81e9caad4be |
|
BLAKE2b-256 | 59650c482daa1ff84fbaac8e0c9f2da86a02af47883b32eb81df4a724c98bba6 |
Hashes for dna_parser-0.2.1-cp38-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f7d4dfa93e30b96f64e8697287c906896a970ca3f1f37794d388af61d97503b |
|
MD5 | e9be7d3e493afe20cc5f96ede6e34410 |
|
BLAKE2b-256 | 2d0254673f8f4b30ca4bf498c8b3ea13b0b8bb1cfe1bd7991a509e57231ffa09 |
Hashes for dna_parser-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e2a9358e022a0d6efc71b471c39076a055185f618cd98ec3a6821f2743c06b6 |
|
MD5 | b892c9cc7a8ee346a1be1535e553669a |
|
BLAKE2b-256 | 0a2f3326f24f30e4311001d20bbf6276df85710f91bf60151d90a281c58c805e |
Hashes for dna_parser-0.2.1-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 543ce6205824ffcbb6b4219c91117404eb80dcda1d84903aeb27e4db3d16d2c3 |
|
MD5 | 33f09bd6cf06ff0a0e22dd2da115b5f0 |
|
BLAKE2b-256 | 451e10f66985f1e3e3d40f898e90342e1e0eda5b3897f58173d4e794e8f3c35d |
Hashes for dna_parser-0.2.1-cp37-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b3b398f86546a1ebe85ae3392c5c2b4dd80f57253673fd66164b146cf7649cc |
|
MD5 | 538cc4c9cb32f533f0f73fccdaf578fd |
|
BLAKE2b-256 | 469bce73fbb0fa850645008313c52d986d157511ee72f513b975268f7c0e0c2c |
Hashes for dna_parser-0.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd4b467ac91177e7ca8b08029f7584640ed211e864e6fdbdc5b7ba05248d0022 |
|
MD5 | e8dd370af48609ce32408478eea5dc55 |
|
BLAKE2b-256 | 54787ed3eb544c19615e7d6e2d9adb9502d8b5c5a2e335d0ea5708f63e719b53 |
Hashes for dna_parser-0.2.1-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47c68dbbf72a07bd2dcc2c0e43e1b178175d980a75992f47645a38cfd7abb517 |
|
MD5 | 72b3e9fd735d2bb432f655171e665f01 |
|
BLAKE2b-256 | 1b1d309ab593a45155f30028d0171be82e81277508e7773fd8a9cac96263c16b |