Skip to main content

Automated, cross-platform Representative Embedding generation.

Project description

SeqRep

This tool is still in active development. Install with pip install SeREGen.

SeREGen is a biological sequence representation tool that will be available as a Python package. This is a computationally intensive analysis methodology, and best results are achieved when running on a modern computer with multiple CPU cores and a powerful GPU. The central idea is to train a machine learning model to convert individual DNA sequences into n-dimensional points, such that the distances between any two points in space is correlated with the true dissimilarity of those points' parent DNA sequences.

Currently, a preliminary library implementation of this methodology is offered, with goals of being easy-to-use, efficient, and highly extensible. Some knowledge of machine learning and TensorFlow is helpful, but not required to use this library. A copy of SILVA's 16S database is included, along with several Jupyter notebooks which show this tool's applications.

Tutorial

FAQ

Q: Distance vs Embedding Distance?

A: Generally, Distance refers to the distance between model inputs (Euclidean between k-mer count arrays, Levenshtein between string sequences, etc). Embedding distance is the distance metric used in the embedding space, trained to match with the true Distance. This is 'euclidean' or 'hyperbolic'.

Q: I'm getting this error when calling fit: ValueError: tried to create variables on non-first call.

A: If you attempted to run a text_input model in the past, you need to reinitialize the model before running again.


Below is the legacy documentation. This is very out of date.

Inside the SeqRep library are the following modules:

  • dataset_builder.py
    • DatasetBuilder
      • Imports FASTA data into a custom Dataset object and parses out taxonomic information from FASTA headers automatically. Header parsing is designed to be extensible to additional formats.
    • Dataset
      • Builds on top of the pandas DataFrame to allow easy importing of FASTA data, parsing out of taxonomic information, and dataset filtering.
      • Taxonomic information can be added after object creation if the source is something other than the FASTA headers.
      • Integrates with visualization module to simplify plot generation.
  • comparative_encoder.py
    • ComparativeEncoder
      • Converts a TensorFlow encoder model into a comparative encoder model. Takes a Distance object as an argument.
      • Designed to be as generic and extensible as possible, and can function on any input shape and with any output size. Encoder model can either be built using included utilities or programmed from scratch and passed as an argument.
  • encoders.py
    • ModelBuilder
      • Helpful class that can reduce the difficulty of designing an encoder model.
  • distance.py
    • EuclideanDistance
      • Currently the only available distance metric, implements a simple euclidean distance measure between the inputs and normalizes that distance as a z-score.
      • Works best when the distribution of distances between randomly sampled points in the dataset is approximately normal (as in the SILVA dataset).
  • visualize.py
    • repr_scatterplot
      • The most basic scatterplot function that wraps matplotlib and plots a scatterplot of sequence representations.
    • reprs_by_taxa
      • An incredibly useful function that can filter down input arrays with a boolean mask and plot all points in each value of a given taxonomic level, colored by taxonomic classification. Takes arguments as: (sequence representations, Dataset object, string taxonomic level to target, plot title, alpha: optional alpha value for points, filter: optional minimum number of sequences in a taxa for that taxa to be plotted, savepath: optional save path for the generated figure, mask: optional boolean mask to apply before plotting).

Many of these files have reasonable internal documentation, so it's worth looking at that for assistance as well.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seregen-0.1.2.tar.gz (43.2 kB view details)

Uploaded Source

Built Distribution

SeREGen-0.1.2-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file seregen-0.1.2.tar.gz.

File metadata

  • Download URL: seregen-0.1.2.tar.gz
  • Upload date:
  • Size: 43.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.10

File hashes

Hashes for seregen-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d9fd3cbd7f70246b0b793c7d3a554a5d9b82d8183de282c1d22d7d1bef6f8ff8
MD5 824a8f3be27fce0881f9603ba021b3e9
BLAKE2b-256 5a6eef1d2d58c8a0c085e920c2e9d8614f0252b417a3896f2447a2b0b7c8716d

See more details on using hashes here.

File details

Details for the file SeREGen-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: SeREGen-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 45.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.10

File hashes

Hashes for SeREGen-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a82cbdc4f76a34c246dc45ba4fc59144a2dac4a98b38c44424491f3c3ee5001c
MD5 f9cd9c26172362437640cc5328348bbf
BLAKE2b-256 da917f08c50e01431be58f46e69dd6e183b3e23c7beddf8b557191713e91db17

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page