Biotext bridges Bioinformatics and NLP by encoding data into a Biological Sequence-Like format. Combined with the SWeeP method, it supports vector operations for efficient analysis.

Project description

Biotext Python Package

License: Non-Commercial

The Biotext Python Package bridges Bioinformatics and Natural Language Processing (NLP) by adapting biological sequence analysis techniques for Text Mining. At its core, Biotext utilizes the SWeeP algorithm — designed initially for biomolecular data — to introduce SWeePtex (SWeeP for text), a method for large-scale text representation and analysis. The package provides two text-to-BSL (Biological Sequence-Like) encoding schemes: AMINOcode, which models text in an amino acid-inspired (AAL) format, and DNAbits, which employs a nucleotide-like (NTL) representation. By transforming text into BSL, Biotext ensures compatibility with SWeeP, enabling applications in text similarity, clustering, and machine learning while maintaining computational efficiency.

Features

aminocode: Implements AMINOcode. Encodes and decodes text using amino acid representations.
dnabits: Implements DNAbits. Encodes and decodes text using DNA binary representations.
sweeptex: Implements SWeePtex. Generates fixed-length vector representations of text using the SWeeP algorithm.
sweeptex_emb: Implements Biotext Embedding. Processes text data through a pipeline to generate word and document embeddings.

Installation

pip install biotext

Modules

aminocode

Encode and decode text using amino acid representations.

from biotext import aminocode

# Encode a string
encoded = aminocode.encode_string("Hello world!", 'dp')
print(encoded)  # Output: 'HYELLYQYSYWYQRLDYPW'

# Decode a string
decoded = aminocode.decode_string("HYELLYQYSYWYQRLDYPW", 'dp')
print(decoded)  # Output: 'hello world!'

dnabits

Encode and decode text using DNA binary representations.

from biotext import dnabits

# Encode a string
encoded = dnabits.encode_string("Hello world!")
print(encoded)  # Output: 'AGACCCGCATGCATGCTTGCAAGATCTCTTGCGATCATGCACGCCAGA'

# Decode a string
decoded = dnabits.decode_string("AGACCCGCATGCATGCTTGCAAGATCTCTTGCGATCATGCACGCCAGA")
print(decoded)  # Output: 'Hello world!'

sweeptex

Generate fixed-length vector representations of text using the SWeePtex.

from biotext import sweeptex

corpus = ["This is a sample text", "Another text example"]
embeddings = sweeptex(corpus, emb_size=1200)
print(embeddings.shape)  # Output: (2, 1200)

sweeptex_emb

Process text data through a pipeline to generate word and document embeddings.

from biotext import sweeptex_emb

corpus = ["First document", "Second document text", "Third example"]
results = sweeptex_emb(corpus, return_doc_emb=True, return_word_emb=True)

print(results['doc_emb'].shape)  # Document embeddings
print(results['word_emb'].shape)  # Word embeddings

Authors

Diogo de Jesus Soares Machado
Roberto Tadeu Raittz

License

This project is licensed under a non-commercial license. See the LICENSE.txt file for details.

Project details

Release history Release notifications | RSS feed

This version

2025.6.27

Jun 27, 2025

2025.6.25

Jun 25, 2025

3.1.0.0

Jul 31, 2024

3.0.1.0

Sep 18, 2023

3.0.0.1

Sep 16, 2023

3.0.0.0

Jul 13, 2023

2.4.1.3

Nov 9, 2022

2.4.1.2

Nov 9, 2022

2.4.1.1

Nov 9, 2022

2.4.1.0

Nov 9, 2022

2.4.0.0

Mar 22, 2022

2.3.2.0

May 21, 2021

2.3.1.0

Sep 16, 2020

2.3.0.0

Aug 31, 2020

2.2.0.1

Aug 28, 2020

2.2.0.0

Aug 28, 2020

2.1.1.0

Aug 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biotext-2025.6.27.tar.gz (23.0 kB view details)

Uploaded Jun 27, 2025 Source

Built Distribution

biotext-2025.6.27-py3-none-any.whl (47.2 kB view details)

Uploaded Jun 27, 2025 Python 3

File details

Details for the file biotext-2025.6.27.tar.gz.

File metadata

Download URL: biotext-2025.6.27.tar.gz
Upload date: Jun 27, 2025
Size: 23.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for biotext-2025.6.27.tar.gz
Algorithm	Hash digest
SHA256	`522d59466497350f16b2a01732289e10da4c68efb9beb1df570a9174257cecde`
MD5	`a522f09ccbf678cb610659f3153ecb55`
BLAKE2b-256	`7eee63a736523fb30d6d88526ab526c6e95cda6044b02e4b8aad31accaa3cd4c`

See more details on using hashes here.

File details

Details for the file biotext-2025.6.27-py3-none-any.whl.

File metadata

Download URL: biotext-2025.6.27-py3-none-any.whl
Upload date: Jun 27, 2025
Size: 47.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for biotext-2025.6.27-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c61d26e8fe95d4f92d5abf674c7e2ea6b41dde7eee25e651f7a2192ba05e853`
MD5	`c5a63232846f9825712b899b67c3d99c`
BLAKE2b-256	`11e3b7dc1ecc3a2bfa0e1b17cf2c32ead7df136c928cb9136481dfaad44f92d7`

See more details on using hashes here.

biotext 2025.6.27

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Biotext Python Package

Features

Installation

Modules

aminocode

dnabits

sweeptex

sweeptex_emb

Authors

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes