Skip to main content

No project description provided

Project description

dna_parser

Build Status

dna-parser is a Python library written in rust to encode (or perform feature extraction on) DNA/RNA sequences for machine learning.

Table of contents

  1. Install
  2. Usage
    1. Loading Fasta Files
    2. Encodings
    3. Other Functions

Install

To install dna-parser simply run:

pip install dna-parser

If there is no Python wheel available for your OS you can install Rust and re-install dna-parser which should now compile and your machine. Run the following command on Unix-like OS to install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

or see more options at https://www.rust-lang.org/tools/install.

Usage

import dna_parser

Loading Fasta Files

#load both metadata and sequence in tuples (metadata,sequences)
metadata_and_sequences= dna_parser.load_fasta("path/to/fasta/file")

#load sequence only
sequences= dna_parser.seq_from_fasta("path/to/fasta/file")

#load metadata only
metadata= dna_parser.metadata_from_fasta("path/to/fasta/file")

Encodings

Currently only support ordinal encoding, onehot encoding, cross encoding and Term Frequency Inverse Document Frequency (TF-IDF).

Ordinal Encoding

Nucleotides are currently encoded as follow:

  • A= 0.25
  • C= 0.50
  • G= 0.75
  • T/U= 1.0
  • Other characters or gaps = 0
#returns a list of 1D numpy arrays representing the encoding
encoding= dna_parser.ordinal_encoding(sequences, pad_type, pad_length, n_jobs)

Function Arguments:

  • sequences (list of str): list of genomic sequences.
  • pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
  • pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
  • n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.

OneHot Encoding

Nucleotides are currently encoded as follow:

  • A= [1,0,0,0]
  • C= [0,1,0,0]
  • G= [0,0,1,0]
  • T/U= [0,0,0,1]
  • Other characters or gaps = [0,0,0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.onehot_encoding(sequences, pad_type, pad_length, n_jobs)

Function Arguments:

  • sequences (list of str): list of genomic sequences.
  • pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
  • pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
  • n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.

Cross Encoding

Nucleotides are currently encoded as follow:

  • A= [0,-1]
  • C= [-1,0]
  • G= [1,0]
  • T/U= [0,1]
  • Other characters or gaps = [0,0]
#returns a list of 2D numpy arrays representing the encoding
encoding= dna_parser.cross_encoding(sequences, pad_type, pad_length, n_jobs)

Function Arguments:

  • sequences (list of str): list of genomic sequences.
  • pad_type (str; default= "after"): pad (or trim) "before" or "after" the sequences.
  • pad_length (int; default= 0): -2 to pad according to the longest sequence, -1 to trim to the shortest sequence, 0 for no paddding, any positive number for a fixed length.
  • n_jobs (int; default= 1): number of threads to use to encode the sequences. 0 to use all cpus available.

TF-IDF Encoding

Note that for this function, your sequences need to be split up in words (or k-mers) where each word is separated by a whitespace. To do so you can use the make_kmers function (see Other Functions section).

encoding= dna_parser.tfidf_encoding(corpus)

Function Arguments:

  • corpus (list of str): genomic sequences.

Other Functions

Generating Random sequences

This function generates random dna, rna or amino acid sequences and returns them in a list.

sequences= dna_parser.random_seq(lenght, nb_of_seq, seq_type, n_jobs)

Function Arguments:

  • length (int): length of the sequences.
  • nb_of_seq (int): number of sequences to generate.
  • seq_type (str; default= dna): type of sequences. "dna", "rna" or "aa" (for amino acid).
  • n_jobs (int, default= 1): number of threads to use to generate the sequences. 0 to use all cpus available.

Making K-mers in Sequences

this function takes a string and returns a new one with withspaces inserted to form words of length k.

seq_k_mers= dna_parser.make_kmers(seq, k)

Function Arguments:

  • seq (str): the genomic sequence.
  • k (int): length of words to create in the sequence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dna_parser-0.2.1.tar.gz (14.9 kB view hashes)

Uploaded Source

Built Distributions

dna_parser-0.2.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp311-none-win_amd64.whl (197.3 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

dna_parser-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (654.5 kB view hashes)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

dna_parser-0.2.1-cp310-none-win_amd64.whl (197.3 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

dna_parser-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (654.5 kB view hashes)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

dna_parser-0.2.1-cp39-none-win_amd64.whl (197.6 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

dna_parser-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (655.5 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

dna_parser-0.2.1-cp38-none-win_amd64.whl (197.3 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

dna_parser-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (654.8 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

dna_parser-0.2.1-cp37-none-win_amd64.whl (197.2 kB view hashes)

Uploaded CPython 3.7 Windows x86-64

dna_parser-0.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

dna_parser-0.2.1-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (654.7 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page