Skip to main content

No project description provided

Project description

GeneVecTools

Reading in Variety of Genetic File Types

Vector Embedding Algorithms

Byte Array Encoders

Clustering and Preprocessing Steps for Compression

Similarity Search Tools for FASTA/FASTQ files

Installing

Tester files: https://tinyurl.com/cDNALibraryExampleFiles

.. code-block:: bash

pip install GeneVecTools

Usage

.. code-block:: bash

>>> from GeneVecTools import simSearch
>>> from GeneVecTools import reader
>>> from GeneVecTools import mapper
>>> from GeneVecTools import encoder

.. code-block:: bash

"""
file is location of the "small_cDNA_Sequences_pbmc_1k_v2_S1_L002_R2_001.fastq" 
that you downloaded from https://tinyurl.com/cDNALibraryExampleFiles
if it is in current directory, just use file name
"""
>>> file = "small_cDNA_Sequences_pbmc_1k_v2_S1_L002_R2_001.fastq"

.. code-block:: bash

"""
f is the file location and name
length is the number of sequences we want in our scope
encoding is one of three choices: "one-hot-encoding", "standard", or "no-encoding"
bits is one of three choices: 2, 4, or 8
"""
>>> VECSS = simSearch.VecSS(f=file, length=10000, encoding="one-hot-encoding",bits=8)
>>> sequences = VECSS.readq()

.. code-block:: bash

# The function "embed" produces the vector embedding of the sequence
>>> embedded = VECSS.embed(VECSS.s)
>>> print(embedded)

.. code-block:: bash

"""
similarity search
I are the indices of the similar sequences
D are how different the similar sequences are from the query sequence
time is the time it takes to perform this similarity search query
"""
>>> D, I, time = VECSS.run_search()
>>> print(D,I,time)

.. code-block:: bash

# Testing the embedding and umembedding process
>>> assert VECSS.unembed(VECSS.embed(VECSS.s)) == VECSS.s

.. code-block:: bash

# Extracting sequences
>>> R = reader.Reader()
>>> mp, count, total_len, quality = R.read_fastq(dir)
>>> sequences_dict_items = mp.values()
>>> sequences = list(sequences_dict_items)
>>> print(sequences)

.. code-block:: bash

# Clustering
>>> mapObj = mapper.Mapper(sequences, 2, 3)
>>> groups_of_similar_kmers = mapper.cluster(mapObj.hfs)
>>> cluster_of_sequences = mapper.groupings(groups_of_similar_kmers, sequences)
>>> print(cluster_of_sequences)

.. code-block:: bash

# Encoding
>>> encoder =encoder. Encoder(4)
>>> c = encoder.encode_sequences(sequences)
>>> print(c)

.. code-block:: bash

# Compress
>>> encoded_clusters_compressed = encoder.encode_clusters(cluster_of_sequences)
>>> print(encoded_clusters_compressed)

.. code-block:: bash

# Decompress
>>> decoded_clusters_compressed = encoder.decode_clusters(encoded_clusters_compressed)
>>> print(decoded_clusters_compressed)

.. code-block:: bash

# Testing the compressing and decompressing process
>>> assert cluster_of_sequences == decoded_clusters_compressed

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GeneVecTools-1.44.tar.gz (11.1 kB view details)

Uploaded Source

File details

Details for the file GeneVecTools-1.44.tar.gz.

File metadata

  • Download URL: GeneVecTools-1.44.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.2

File hashes

Hashes for GeneVecTools-1.44.tar.gz
Algorithm Hash digest
SHA256 04e332e66cdf5d2ebe59e6a824980f793df86580e4768a35404779fa8d7d2b47
MD5 b5fad1813e1cf6f305371c6ccc38e25e
BLAKE2b-256 8c55251bc030281f273985fe5592e9ee01e13abba5c5d7a40c16e47189a4ba2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page