No project description provided
Project description
dna_parser
dna-parser is a Python library written in rust to encode (or perform feature extraction on) DNA/RNA sequences for machine learning.
Table of contents
Install
To install dna-parser simply run:
pip install dna-parser
If there is no Python wheel available for your OS you can install Rust and re-install dna-parser which should now compile and your machine. Run the following command on Unix-like OS to install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
or see more options at https://www.rust-lang.org/tools/install.
Usage
import dna_parser
Loading Fasta Files
#load both metadata and sequence in tuples (metadata,sequences)
metadata_and_sequences= dna_parser.load_fasta("path/to/fasta/file")
#load sequence only
sequences= dna_parser.seq_from_fasta("path/to/fasta/file")
#load metadata only
metadata= dna_parser.metadata_from_fasta("path/to/fasta/file")
Encodings
Currently only support ordinal encoding, onehot encoding, cross encoding and Term Frequency Inverse Document Frequency (TF-IDF).
Ordinal Encoding
Nucleotides are currently encoded as follow:
- A= 0.25
- C= 0.50
- G= 0.75
- T/U= 1.0
- Other characters or gaps = 0
#returns a list of 1D numpy arrays representing the encoding
encoding= dna_parser.ordinal_encoding(sequences, pad_type, pad_length, n_jobs)
Function Arguments:
- sequences (list of str): list of genomic sequences.
- pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
- pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
- n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.
OneHot Encoding
Nucleotides are currently encoded as follow:
- A= [1,0,0,0]
- C= [0,1,0,0]
- G= [0,0,1,0]
- T/U= [0,0,0,1]
- Other characters or gaps = [0,0,0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.onehot_encoding(sequences, pad_type, pad_length, n_jobs)
Function Arguments:
- sequences (list of str): list of genomic sequences.
- pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
- pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
- n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.
Cross Encoding
Nucleotides are currently encoded as follow:
- A= [0,-1]
- C= [-1,0]
- G= [1,0]
- T/U= [0,1]
- Other characters or gaps = [0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.cross_encoding(sequences, pad_type, pad_length, n_jobs)
Function Arguments:
- sequences (list of str): list of genomic sequences.
- pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
- pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
- n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.
TF-IDF Encoding
Note that for this function, your sequences need to be split up in words (or k-mers) where each word is separated by a whitespace. To do so you can use the make_kmers function (see Other Functions section).
encoding= dna_parser.tfidf_encoding(corpus)
Function Arguments:
- corpus (list of str): genomic sequences.
Other Functions
Generating Random sequences
This function generates random dna, rna or amino acid sequences and returns them in a list.
sequences= dna_parser.random_seq(lenght, nb_of_seq, seq_type, n_jobs)
Function Arguments:
- length (int): length of the sequences.
- nb_of_seq (int): number of sequences to generate.
- seq_type (str; default= dna): type of sequences. "dna", "rna" or "aa" (for amino acid).
- n_jobs (int, default= 1): number of threads to use to generate the sequences. 0 to use all cpus available.
Making K-mers in Sequences
this function takes a string and returns a new one with withspaces inserted to form words of length k.
seq_k_mers= dna_parser.make_kmers(seq, k)
Function Arguments:
- seq (str): the genomic sequence.
- k (int): length of words to create in the sequence.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for dna_parser-0.2.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f3fe1e0841c5950f74d9a92aa7a88a82d01ea2d9d272cbde0616f5255799c16 |
|
MD5 | 395c3bc1dfafb4929889c343bb12437c |
|
BLAKE2b-256 | 5af983105066ae2fb4232c2fa0d0c1b494351f92c32a40708d02ea77568da63a |
Hashes for dna_parser-0.2.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9390d7c938b5f31ea6df810ba64e185e98f15f2126877adeda9db061186acf39 |
|
MD5 | f3b881e57c2732a0562f1f6bc1b99e96 |
|
BLAKE2b-256 | 65f7c9e05a987940fe572a1dac71f8504341271b034f04f92964bab8eaec8b98 |
Hashes for dna_parser-0.2.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd7d54e6bd858983b559de3ccdea1ac0ca74f321ca1daf19c4c80bddd4f661f7 |
|
MD5 | 1bf5f085aed7709cdcd3b931110db93b |
|
BLAKE2b-256 | cd179537c190b08e5ba0780e4bfbcc4d90508d443540dfde10fd7d7329c6c4fc |
Hashes for dna_parser-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0c5b448ad088f70aa7e35463ae8f4444ab679600b8a80c4decb1b0f906566633 |
|
MD5 | 47d1b2e460e696c2e4df425cd71b40e8 |
|
BLAKE2b-256 | a7f783de89f5c7c1da3328b49a005e9dad300b6598ea2b92d7582999bce2a929 |
Hashes for dna_parser-0.2.0-cp311-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2865032eabdc5c679039398bfc12d8b5c9f7e468d72ed77f99ebb8fe0d85a581 |
|
MD5 | 9b2601072b3cdf47660db4477af81173 |
|
BLAKE2b-256 | 7bad92e774684e6b77128e0be7af0fb46472e9dd98367bbd0ec52f2e3b828b32 |
Hashes for dna_parser-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba2cc333f8f688a1145af49c54c17f785a00c271caf0f879b8bde22de6cec33b |
|
MD5 | 48d1b22a9a0f897d8c91c8e07ff5d955 |
|
BLAKE2b-256 | 0945dde34a16d6ade32683bf88feb50df4623960eadff56e7f3c72e86e7645f9 |
Hashes for dna_parser-0.2.0-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8aab73c874a30701270ecf937f0103018b19d819eff85c8666088016d8b56be |
|
MD5 | a0bc4d8d9905c44d2409f3772a872e0d |
|
BLAKE2b-256 | dd6d625999e3a3599ba35f55d0e2ab004fb6758e3659a4bf6f28c7696c25a4de |
Hashes for dna_parser-0.2.0-cp310-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a67535cd890b7ce927c24ece8cd2eec89a28c1eb1bfa3cdbd82bacd17c6c839 |
|
MD5 | b21317748d5049d1c65bb3aa967b05e0 |
|
BLAKE2b-256 | 0bea831b7657ead21c1597d9d638fecad7561372670293e16f07eb9a95c3c4ba |
Hashes for dna_parser-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 981d617162e02118a415cccf88fc4d8ffcab4f4ae74d2143ce0df55eb12cd60c |
|
MD5 | 310c666254f0480e44314a0048a9fa4e |
|
BLAKE2b-256 | 44fc85733b39c173cf0fc7dc5c44a38e9cf927aaf365d6fd40fad327c10865f1 |
Hashes for dna_parser-0.2.0-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 066c02b7e8b37e518ba2b01c7015074ab542453e6fbb83446132428077f7b1e4 |
|
MD5 | 191c8567018c6be16453ae71c8808f03 |
|
BLAKE2b-256 | aa89afca6a4b96a9d181508b9695adfec8f764ec0c64d0fd073d8b38f72e9b2e |
Hashes for dna_parser-0.2.0-cp39-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f75747b94188a7e6ca3a0be2204c913c2dbdf05bba9a38c439034872b4e43a99 |
|
MD5 | 20dcfbb60ff96fe994f2d55269811f1f |
|
BLAKE2b-256 | e1038cfab5356013d2e72381cf4f0b3de489b53871169aa5213f64e25b58bcde |
Hashes for dna_parser-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30db095e8b573896538467cd60246de21b95b0542f5d88fcc246346a8f7b8fd1 |
|
MD5 | 0db2cf112bdf53192420ee80a89862c1 |
|
BLAKE2b-256 | ed12ded5337864db805f901c93eaa6311a535eeeedc4921715c8036fc36c9f8b |
Hashes for dna_parser-0.2.0-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5841bfd877d0e687e8542a7b4804bc01175160cbb8a483da2bf5e4d7afb0f468 |
|
MD5 | cb29a7e670a38333093bb51b669ce223 |
|
BLAKE2b-256 | b23494f365dd4dad33ca6415355b70800717eb1fcf297770598b363c5c32ece1 |
Hashes for dna_parser-0.2.0-cp38-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b3d84eeac7e2775af5bdd8c3c243d35fca39b60a44ac6a4421079bbaf3e3cb6 |
|
MD5 | 650739f2594aea281b259a47a3b269e2 |
|
BLAKE2b-256 | 09c602765d0f164316cb0025f756e8779ac68730aa81fb1d3f281fac27a2635a |
Hashes for dna_parser-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | de897d34491de86153d7255ed93efc0a6c64608cf21b49c135b0ccf6aa5ab9ca |
|
MD5 | 457b789d4fa5530d67aad7e20c1f2d2d |
|
BLAKE2b-256 | 83c927729404c84429d4c2ee44e24a7bf3a037e8e8b9b4dd4d92c784695ff4ad |
Hashes for dna_parser-0.2.0-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 056716d1d19b353868e2360e74dafe4350fcc13b310dcb83823736b21a0af707 |
|
MD5 | 66f6eaf89c50e8c4325ce7cf5ca2aeae |
|
BLAKE2b-256 | 1f1d5a379bc164d1bc19683c206c9491691463489ac6d7799f0df86c858250af |
Hashes for dna_parser-0.2.0-cp37-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9fedb25099423a8e4b66a362104d36fec1fc9d0c5d7f8541f306488819e49644 |
|
MD5 | 68cd28b24798147bf152804146c2ed4d |
|
BLAKE2b-256 | ec88b3f3bca0d4a89c66b84ec24fbd8816bc9003d96d9ffad701c6385a8c62be |
Hashes for dna_parser-0.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | adb90d93292c256523e1d6a5aafc11bc1c3ee69866aa8d773262e78ade7c9e88 |
|
MD5 | 75dde6524be528a2f201ae60e1f71911 |
|
BLAKE2b-256 | 9094c88c2d2489a40a1a8e9e430443e24a044ff5696f0e70d590e29e232c3c6e |
Hashes for dna_parser-0.2.0-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9317049f236c6bd332cb73951a9143c057c0f8ddc3c77cdf1f2f3a4636154b00 |
|
MD5 | 6bd1c466eca4b50487d066149022c5f6 |
|
BLAKE2b-256 | 03c945585670332af1538317dc0b58b3539349f0b4e067d431b2a0dbc26361fb |