Skip to main content

No project description provided

Project description

PyPI - Downloads GitHub stars

SeqPro (Sequence processing toolkit)

import seqpro as sp

SeqPro is a Python package for processing DNA/RNA sequences, with limited support for protein sequences. SeqPro is fully functional on its own but is heavily utilized by other packages including SeqData, MotifData, SeqExplainer, and EUGENe.

All functions in SeqPro take as input a string, a list of strings, a NumPy array of strings, a NumPy array of single character bytes (S1) or a NumPy array of one-hot encoded strings. There is also emerging integration with XArray through the seqpro.xr submodule.

Computational bottelnecks or code that is impossible to vectorize with NumPy alone are accelerated with Numba e.g. padding sequences, one-hot encoding, converting from one-hot encoding to nucleotides, etc.

Installation

pip install seqpro

Sequence cleaners (cleaners)

Remove sequences with ambiguous bases

# Padding
sp.pad_seqs(seqs, pad="right", pad_value="N", max_length=None)

# One-hot encoding
sp.ohe(seqs, alphabet=sp.alphabets.DNA)

# Decode one-hot encoding
sp.decode_ohe(ohe, ohe_axis=1, alphabet=sp.alphabets.DNA, unknown__char="N")

# Reverse complement
sp.reverse_complement(seqs, alphabet=sp.alphabets.DNA)

# k-let preserving shuffling
sp.k_shuffle(seqs, k=2, length_axis=1, seed=1234)

# Calculating GC content
sp.gc_content(seqs, normalize=True)

# Generating random sequences
sp.random_seqs(shape=(N, L), alphabet=sp.alphabets.DNA, seed=1234)

# Randomly jittering sequences
sp.jitter(seqs, max_jitter=128, length_axis=1, seed=1234)

Manipulating coverage

# Collapse coverage to a given bin width
sp.bin_coverage(coverage, bin_width=128, length_axis=1, normalize=False)

# Can jitter coverage and sequences so they stay aligned
sp.jitter((seqs, coverage), max_jitter=128, length_axis=1, seed=1234)

## One-hot encoding

```python
sp.ohe(seqs)

Sequence analysis (analyzers)

Calculate sequence properties (e.g. GC content)

sp.gc_content(seqs)
sp.nucleotide_content(seqs)

More to come!

All contributions, including bug reports, documentation improvements, and enhancement suggestions are welcome. Everyone within the community is expected to abide by our code of conduct

Preparing sequences for sequence-to-function models

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqpro-0.1.11.tar.gz (26.9 kB view hashes)

Uploaded Source

Built Distribution

seqpro-0.1.11-cp38-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (350.0 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page