Biotext bridges Bioinformatics and NLP by encoding data into a Biological Sequence-Like format. Combined with the SWeeP method, it supports vector operations for efficient analysis.
Project description
Biotext Python Package
The Biotext Python Package bridges Bioinformatics and Natural Language Processing (NLP) by adapting biological sequence analysis techniques for Text Mining. At its core, Biotext utilizes the SWeeP algorithm — designed initially for biomolecular data — to introduce SWeePtex (SWeeP for text), a method for large-scale text representation and analysis. The package provides two text-to-BSL (Biological Sequence-Like) encoding schemes: AMINOcode, which models text in an amino acid-inspired (AAL) format, and DNAbits, which employs a nucleotide-like (NTL) representation. By transforming text into BSL, Biotext ensures compatibility with SWeeP, enabling applications in text similarity, clustering, and machine learning while maintaining computational efficiency.
Features
- aminocode: Implements AMINOcode. Encodes and decodes text using amino acid representations.
- dnabits: Implements DNAbits. Encodes and decodes text using DNA binary representations.
- sweeptex: Implements SWeePtex. Generates fixed-length vector representations of text using the SWeeP algorithm.
- sweeptex_emb: Implements Biotext Embedding. Processes text data through a pipeline to generate word and document embeddings.
Installation
pip install biotext
Modules
aminocode
Encode and decode text using amino acid representations.
from biotext import aminocode
# Encode a string
encoded = aminocode.encode_string("Hello world!", 'dp')
print(encoded) # Output: 'HYELLYQYSYWYQRLDYPW'
# Decode a string
decoded = aminocode.decode_string("HYELLYQYSYWYQRLDYPW", 'dp')
print(decoded) # Output: 'hello world!'
dnabits
Encode and decode text using DNA binary representations.
from biotext import dnabits
# Encode a string
encoded = dnabits.encode_string("Hello world!")
print(encoded) # Output: 'AGACCCGCATGCATGCTTGCAAGATCTCTTGCGATCATGCACGCCAGA'
# Decode a string
decoded = dnabits.decode_string("AGACCCGCATGCATGCTTGCAAGATCTCTTGCGATCATGCACGCCAGA")
print(decoded) # Output: 'Hello world!'
sweeptex
Generate fixed-length vector representations of text using the SWeePtex.
from biotext import sweeptex
corpus = ["This is a sample text", "Another text example"]
embeddings = sweeptex(corpus, emb_size=1200)
print(embeddings.shape) # Output: (2, 1200)
sweeptex_emb
Process text data through a pipeline to generate word and document embeddings.
from biotext import sweeptex_emb
corpus = ["First document", "Second document text", "Third example"]
results = sweeptex_emb(corpus, return_doc_emb=True, return_word_emb=True)
print(results['doc_emb'].shape) # Document embeddings
print(results['word_emb'].shape) # Word embeddings
Authors
- Diogo de Jesus Soares Machado
- Roberto Tadeu Raittz
License
This project is licensed under a non-commercial license. See the LICENSE.txt file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file biotext-2025.6.27.tar.gz
.
File metadata
- Download URL: biotext-2025.6.27.tar.gz
- Upload date:
- Size: 23.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
522d59466497350f16b2a01732289e10da4c68efb9beb1df570a9174257cecde
|
|
MD5 |
a522f09ccbf678cb610659f3153ecb55
|
|
BLAKE2b-256 |
7eee63a736523fb30d6d88526ab526c6e95cda6044b02e4b8aad31accaa3cd4c
|
File details
Details for the file biotext-2025.6.27-py3-none-any.whl
.
File metadata
- Download URL: biotext-2025.6.27-py3-none-any.whl
- Upload date:
- Size: 47.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
6c61d26e8fe95d4f92d5abf674c7e2ea6b41dde7eee25e651f7a2192ba05e853
|
|
MD5 |
c5a63232846f9825712b899b67c3d99c
|
|
BLAKE2b-256 |
11e3b7dc1ecc3a2bfa0e1b17cf2c32ead7df136c928cb9136481dfaad44f92d7
|