Skip to main content

No project description provided

Project description

dna_parser

Build Status

dna-parser is a Python library written in rust to encode (or perform feature extraction on) DNA/RNA sequences for machine learning.

Table of contents

  1. Install
  2. Usage
    1. Loading Fasta Files
    2. Encodings
    3. Other Functions

Install

To install dna-parser simply run:

pip install dna-parser

If there is no Python wheel available for your OS you can install Rust and re-install dna-parser which should now compile and your machine. Run the following command on Unix-like OS to install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

or see more options at https://www.rust-lang.org/tools/install.

Usage

import dna_parser

Loading Fasta Files

#load both metadata and sequence in tuples (metadata,sequences)
metadata_and_sequences= dna_parser.load_fasta("path/to/fasta/file")

#load sequence only
sequences= dna_parser.seq_from_fasta("path/to/fasta/file")

#load metadata only
metadata= dna_parser.metadata_from_fasta("path/to/fasta/file")

Encodings

Currently only support ordinal encoding, onehot encoding, cross encoding and Term Frequency Inverse Document Frequency (TF-IDF).

Ordinal Encoding

Nucleotides are currently encoded as follow:

  • A= 0.25
  • C= 0.50
  • G= 0.75
  • T/U= 1.0
  • Other characters or gaps = 0
#returns a list of 1D numpy arrays representing the encoding
encoding= dna_parser.ordinal_encoding(sequences, pad_type, pad_length, n_jobs)

Function Arguments:

  • sequences (list of str): list of genomic sequences.
  • pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
  • pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
  • n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.

OneHot Encoding

Nucleotides are currently encoded as follow:

  • A= [1,0,0,0]
  • C= [0,1,0,0]
  • G= [0,0,1,0]
  • T/U= [0,0,0,1]
  • Other characters or gaps = [0,0,0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.onehot_encoding(sequences, pad_type, pad_length, n_jobs)

Function Arguments:

  • sequences (list of str): list of genomic sequences.
  • pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
  • pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
  • n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.

Cross Encoding

Nucleotides are currently encoded as follow:

  • A= [0,-1]
  • C= [-1,0]
  • G= [1,0]
  • T/U= [0,1]
  • Other characters or gaps = [0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.cross_encoding(sequences, pad_type, pad_length, n_jobs)

Function Arguments:

  • sequences (list of str): list of genomic sequences.
  • pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
  • pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
  • n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.

TF-IDF Encoding

Note that for this function, your sequences need to be split up in words (or k-mers) where each word is separated by a whitespace. To do so you can use the make_kmers function (see Other Functions section).

encoding= dna_parser.tfidf_encoding(corpus)

Function Arguments:

  • corpus (list of str): genomic sequences.

Other Functions

Generating Random sequences

This function generates random dna, rna or amino acid sequences and returns them in a list.

sequences= dna_parser.random_seq(lenght, nb_of_seq, seq_type, n_jobs)

Function Arguments:

  • length (int): length of the sequences.
  • nb_of_seq (int): number of sequences to generate.
  • seq_type (str; default= dna): type of sequences. "dna", "rna" or "aa" (for amino acid).
  • n_jobs (int, default= 1): number of threads to use to generate the sequences. 0 to use all cpus available.

Making K-mers in Sequences

this function takes a string and returns a new one with withspaces inserted to form words of length k.

seq_k_mers= dna_parser.make_kmers(seq, k)

Function Arguments:

  • seq (str): the genomic sequence.
  • k (int): length of words to create in the sequence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dna_parser-0.2.1.tar.gz (14.9 kB view details)

Uploaded Source

Built Distributions

dna_parser-0.2.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp311-none-win_amd64.whl (197.3 kB view details)

Uploaded CPython 3.11 Windows x86-64

dna_parser-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (654.5 kB view details)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

dna_parser-0.2.1-cp310-none-win_amd64.whl (197.3 kB view details)

Uploaded CPython 3.10 Windows x86-64

dna_parser-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (654.5 kB view details)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

dna_parser-0.2.1-cp39-none-win_amd64.whl (197.6 kB view details)

Uploaded CPython 3.9 Windows x86-64

dna_parser-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (655.5 kB view details)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

dna_parser-0.2.1-cp38-none-win_amd64.whl (197.3 kB view details)

Uploaded CPython 3.8 Windows x86-64

dna_parser-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (654.8 kB view details)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

dna_parser-0.2.1-cp37-none-win_amd64.whl (197.2 kB view details)

Uploaded CPython 3.7 Windows x86-64

dna_parser-0.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (654.7 kB view details)

Uploaded CPython 3.7m macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

File details

Details for the file dna_parser-0.2.1.tar.gz.

File metadata

  • Download URL: dna_parser-0.2.1.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for dna_parser-0.2.1.tar.gz
Algorithm Hash digest
SHA256 31872ca6240f9860dcac9cd26061bd61b60d623108cbdc2c2a7ea96d1f233591
MD5 be958edfa2c9f8dfe1c540c0843ed4d5
BLAKE2b-256 4ebd0bd8aa9ed058adb6cb2ea371145bc66201fac45dc3956ca3ab78c63b67e8

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c55372324b4e850f1f689256892a5816d9b170cf1e4caabbc84590fade9d7f79
MD5 87441030a2a8269752b7bcd73fd8fbf7
BLAKE2b-256 1966b943f304fbbb77dc9f248a94faa888d4910e293ea8ede1e6a8bae8a6cc6f

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 63e67aeb2b5e93e9814ab15784a34cb091f2d5289286afa30c5bd43b33d1fa38
MD5 d86823bbac13f9b378043cfca9bec2f0
BLAKE2b-256 ee06c2ba1246464323c3f3f930227dfc1232cbf035da7f474b754a2aa9a1d7f0

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d269dda146c7afc4bb8dded949830d9b06dd7ec9f80b5fda855242707e89b983
MD5 0b94746925cbd524cf94f7a54b6dcc70
BLAKE2b-256 4f5026c30b22df10ddafddaf394cbffae8aa8932670474cab112141420c530c5

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f48548bedb5e1b79469644afabc89815861ad43f9f79eaa06870462b7600271a
MD5 2022b05ad5d6617e52c8e28c06fab6c3
BLAKE2b-256 65c39a915b6ec34043270ba33d12e5b011b4de933213cda3f4274156f34fcbbe

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp311-none-win_amd64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp311-none-win_amd64.whl
Algorithm Hash digest
SHA256 9ff7b903bed0fee28e0463dd090bff5bd382c31fe22c0128a39719097f5f3297
MD5 800c58d082f7fe238b9ad73a532077a2
BLAKE2b-256 a2d2e136c45a5360d96cdd47b533be3f3ada77e95fbd50bad41a9d3ce0762161

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a83f1544623c5d75f437859421d98843e57c551ac605e9626df17fec71f2092e
MD5 96a2b7d60461ec51b2ab9ee45e51f74b
BLAKE2b-256 cdb7c40fa97af12e3931de0a40876112e5b25eb958f2b4fee0581bc8d4c70f9d

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 dccf1fe5df2eab6717a1bb9bf61033186df3eb2fa5d36a1dae63558605334cc6
MD5 38fc933cd290a979a939f3f9bf0e797e
BLAKE2b-256 c4d97b6f449c72f3b008a4edad3301be74b5560b01d89fd814ed054e27dcb98e

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp310-none-win_amd64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp310-none-win_amd64.whl
Algorithm Hash digest
SHA256 60d88cb99d3b755316c3c99445d7d658dd0b050c09dee982158658160e4cf756
MD5 cc245453c35f1c5afe418873a7d3eb3b
BLAKE2b-256 e40e9d14da8c5484972111491a8626afec9d9fabeb27b73342d31a78d45a0043

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bdaaa8533b5db3df5af7e280a0909aab5f7aadd55db93da1c9fbd16a89e48548
MD5 037b87828f74779555b43daead702aed
BLAKE2b-256 6504a493aeddca8009312487be9364e8f549b77944a2067843e777c596cfe698

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 c2f92441d73345cc38a9690c56851f72493aa88d3f9cfe0092bf39cecbdfbd3c
MD5 28dbfd253c30247269b1b261126c39a3
BLAKE2b-256 2463cd860f83de94a497a15603252cc2f1b61fc2c77d5808495f762c732ab417

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp39-none-win_amd64.whl.

File metadata

  • Download URL: dna_parser-0.2.1-cp39-none-win_amd64.whl
  • Upload date:
  • Size: 197.6 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for dna_parser-0.2.1-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 93213d8d6d1dff5a20a365e9c465026f4c0bdbb740999a1d0027e18a932d19aa
MD5 be6ac2348e89b45afe9fa41f5203b546
BLAKE2b-256 e572202f9e0620db65a59859e0b1067a5fe1b3e75b35c1b9179bb14b22f437ac

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 51e69b798df84ddcecacdecfeac960eef3f64b89e7a7f5cbbbdf57a2ae4c2bba
MD5 c090a4609eebd4f6fbd4d9571331cfe4
BLAKE2b-256 fbc6c9cc163bfb610069a7b19c3c3d24159b7c5064e9b64b5ea9ec6dabda33fc

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 34478efdb35e06a45bec09a368d2a695ef2fb6c4a10141bbcdce3d41d5598f70
MD5 067b29c69c637967b7eda81e9caad4be
BLAKE2b-256 59650c482daa1ff84fbaac8e0c9f2da86a02af47883b32eb81df4a724c98bba6

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp38-none-win_amd64.whl.

File metadata

  • Download URL: dna_parser-0.2.1-cp38-none-win_amd64.whl
  • Upload date:
  • Size: 197.3 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for dna_parser-0.2.1-cp38-none-win_amd64.whl
Algorithm Hash digest
SHA256 9f7d4dfa93e30b96f64e8697287c906896a970ca3f1f37794d388af61d97503b
MD5 e9be7d3e493afe20cc5f96ede6e34410
BLAKE2b-256 2d0254673f8f4b30ca4bf498c8b3ea13b0b8bb1cfe1bd7991a509e57231ffa09

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7e2a9358e022a0d6efc71b471c39076a055185f618cd98ec3a6821f2743c06b6
MD5 b892c9cc7a8ee346a1be1535e553669a
BLAKE2b-256 0a2f3326f24f30e4311001d20bbf6276df85710f91bf60151d90a281c58c805e

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 543ce6205824ffcbb6b4219c91117404eb80dcda1d84903aeb27e4db3d16d2c3
MD5 33f09bd6cf06ff0a0e22dd2da115b5f0
BLAKE2b-256 451e10f66985f1e3e3d40f898e90342e1e0eda5b3897f58173d4e794e8f3c35d

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp37-none-win_amd64.whl.

File metadata

  • Download URL: dna_parser-0.2.1-cp37-none-win_amd64.whl
  • Upload date:
  • Size: 197.2 kB
  • Tags: CPython 3.7, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for dna_parser-0.2.1-cp37-none-win_amd64.whl
Algorithm Hash digest
SHA256 5b3b398f86546a1ebe85ae3392c5c2b4dd80f57253673fd66164b146cf7649cc
MD5 538cc4c9cb32f533f0f73fccdaf578fd
BLAKE2b-256 469bce73fbb0fa850645008313c52d986d157511ee72f513b975268f7c0e0c2c

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bd4b467ac91177e7ca8b08029f7584640ed211e864e6fdbdc5b7ba05248d0022
MD5 e8dd370af48609ce32408478eea5dc55
BLAKE2b-256 54787ed3eb544c19615e7d6e2d9adb9502d8b5c5a2e335d0ea5708f63e719b53

See more details on using hashes here.

File details

Details for the file dna_parser-0.2.1-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for dna_parser-0.2.1-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 47c68dbbf72a07bd2dcc2c0e43e1b178175d980a75992f47645a38cfd7abb517
MD5 72b3e9fd735d2bb432f655171e665f01
BLAKE2b-256 1b1d309ab593a45155f30028d0171be82e81277508e7773fd8a9cac96263c16b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page