Skip to main content

The official SetBERT implementation using the deepbio-toolkit.

Project description

SetBERT

The official code repository for SetBERT: SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing

Graphical Abstract

Quick Start

Installation from PyPI:

pip install dbtk-setbert

Download SetBERT pre-trained on the Qiita 16S platform (see Available Models for other options):

from setbert import SetBert

# Download the model
model = SetBert.from_pretrained("sirdavidludwig/setbert", revision="qiita-16s")

# Get the tokenizer
tokenizer = model.sequence_encoder.tokenizer

Example sample embedding

# Input sample
sequences = [
    "ACTGCAG",
    "TGACGTA",
    "ATGACGA"
]

# Tokenize sequences in the sample
sequence_tokens = torch.stack([tokenizer(s) for s in sequences])

# Compute embeddings
output = model(sequence_tokens)

# Sample level representation
sample_embedding = output["class"]

# Contextualized sequence representations
sequence_embeddings = output["sequences"]

Available Models:

Model Revision Platform Pre-training Dataset Description
qiita-16s 16S Amplicon ~280k 16S amplicon samples from the Qiita platform

Configuration

SetBERT embeds the DNA sequences in chunks using activation checkpointing. This chunk size is specified by the sequence_encoder_chunk_size parameter in the SetBert.Config class and adjusted freely at any point.

# Set chunk size
model.config.sequence_encoder_chunk_size = 256 # default

# Remove chunking and embed all sequences in parallel
model.config.sequence_encoder_chunk_size = None

Manual Installation

git clone https://github.com/DLii-Research/setbert
pip install -e ./new-setbert

Citation

@article{ludwig_setbert_2025,
	title = {{SetBERT}: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing},
	volume = {41},
	issn = {1367-4811},
	doi = {10.1093/bioinformatics/btaf370},
	number = {7},
	journal = {Bioinformatics},
	author = {Ludwig, II, David W and Guptil, Christopher and Alexander, Nicholas R and Zhalnina, Kateryna and Wipf, Edi M -L and Khasanova, Albina and Barber, Nicholas A and Swingley, Wesley and Walker, Donald M and Phillips, Joshua L},
	month = jul,
	year = {2025},
}

Original Experiment Source Code

The original source code used to produce the models and experiments for the manuscript are available in the bioinformatics branch of this repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbtk_setbert-1.0.2.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbtk_setbert-1.0.2-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file dbtk_setbert-1.0.2.tar.gz.

File metadata

  • Download URL: dbtk_setbert-1.0.2.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for dbtk_setbert-1.0.2.tar.gz
Algorithm Hash digest
SHA256 1b4dc660b1517477f137ac6790276993867f2c0d4c0a7f17214d3ff63977bc00
MD5 33d338c8c45243e79d1e7fa1afa605cb
BLAKE2b-256 f75385288dd16d03b9238cb2acae3d7fa8dd899c28988246446ed2f1b50c1331

See more details on using hashes here.

File details

Details for the file dbtk_setbert-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: dbtk_setbert-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for dbtk_setbert-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 97f761f0313c64ed292281448a545a28e04a866e5cfd423c7515819a26a16b06
MD5 f9d313a81cebab2499973d1608412df1
BLAKE2b-256 bbe6b27b91afba4d7c18af879f8ec3fc655ad3d3b3339eb95f52441f04647ed0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page