Skip to main content

Biotext bridges Bioinformatics and NLP by encoding data into a Biological Sequence-Like format. Combined with the SWeeP method, it supports vector operations for efficient analysis.

Project description

Biotext Python Package

License: Non-Commercial

The Biotext Python Package bridges Bioinformatics and Natural Language Processing (NLP) by adapting biological sequence analysis techniques for Text Mining. At its core, Biotext utilizes the SWeeP algorithm — designed initially for biomolecular data — to introduce SWeePtex (SWeeP for text), a method for large-scale text representation and analysis. The package provides two text-to-BSL (Biological Sequence-Like) encoding schemes: AMINOcode, which models text in an amino acid-inspired (AAL) format, and DNAbits, which employs a nucleotide-like (NTL) representation. By transforming text into BSL, Biotext ensures compatibility with SWeeP, enabling applications in text similarity, clustering, and machine learning while maintaining computational efficiency.

Features

  • aminocode: Implements AMINOcode. Encodes and decodes text using amino acid representations.
  • dnabits: Implements DNAbits. Encodes and decodes text using DNA binary representations.
  • sweeptex: Implements SWeePtex. Generates fixed-length vector representations of text using the SWeeP algorithm.
  • sweeptex_emb: Implements Biotext Embedding. Processes text data through a pipeline to generate word and document embeddings.

Installation

pip install biotext

Modules

aminocode

Encode and decode text using amino acid representations.

from biotext import aminocode

# Encode a string
encoded = aminocode.encode_string("Hello world!", 'dp')
print(encoded)  # Output: 'HYELLYQYSYWYQRLDYPW'

# Decode a string
decoded = aminocode.decode_string("HYELLYQYSYWYQRLDYPW", 'dp')
print(decoded)  # Output: 'hello world!'

dnabits

Encode and decode text using DNA binary representations.

from biotext import dnabits

# Encode a string
encoded = dnabits.encode_string("Hello world!")
print(encoded)  # Output: 'AGACCCGCATGCATGCTTGCAAGATCTCTTGCGATCATGCACGCCAGA'

# Decode a string
decoded = dnabits.decode_string("AGACCCGCATGCATGCTTGCAAGATCTCTTGCGATCATGCACGCCAGA")
print(decoded)  # Output: 'Hello world!'

sweeptex

Generate fixed-length vector representations of text using the SWeePtex.

from biotext import sweeptex

corpus = ["This is a sample text", "Another text example"]
embeddings = sweeptex(corpus, emb_size=1200)
print(embeddings.shape)  # Output: (2, 1200)

sweeptex_emb

Process text data through a pipeline to generate word and document embeddings.

from biotext import sweeptex_emb

corpus = ["First document", "Second document text", "Third example"]
results = sweeptex_emb(corpus, return_doc_emb=True, return_word_emb=True)

print(results['doc_emb'].shape)  # Document embeddings
print(results['word_emb'].shape)  # Word embeddings

Authors

  • Diogo de Jesus Soares Machado
  • Roberto Tadeu Raittz

License

This project is licensed under a non-commercial license. See the LICENSE.txt file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biotext-2025.6.27.tar.gz (23.0 kB view details)

Uploaded Source

Built Distribution

biotext-2025.6.27-py3-none-any.whl (47.2 kB view details)

Uploaded Python 3

File details

Details for the file biotext-2025.6.27.tar.gz.

File metadata

  • Download URL: biotext-2025.6.27.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for biotext-2025.6.27.tar.gz
Algorithm Hash digest
SHA256 522d59466497350f16b2a01732289e10da4c68efb9beb1df570a9174257cecde
MD5 a522f09ccbf678cb610659f3153ecb55
BLAKE2b-256 7eee63a736523fb30d6d88526ab526c6e95cda6044b02e4b8aad31accaa3cd4c

See more details on using hashes here.

File details

Details for the file biotext-2025.6.27-py3-none-any.whl.

File metadata

  • Download URL: biotext-2025.6.27-py3-none-any.whl
  • Upload date:
  • Size: 47.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for biotext-2025.6.27-py3-none-any.whl
Algorithm Hash digest
SHA256 6c61d26e8fe95d4f92d5abf674c7e2ea6b41dde7eee25e651f7a2192ba05e853
MD5 c5a63232846f9825712b899b67c3d99c
BLAKE2b-256 11e3b7dc1ecc3a2bfa0e1b17cf2c32ead7df136c928cb9136481dfaad44f92d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page