Skip to main content

The official SetBERT implementation using the deepbio-toolkit.

Project description

SetBERT

The official code repository for SetBERT: SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing

Graphical Abstract

Quick Start

Installation:

pip install dbtk-setbert

Download SetBERT pre-trained on the Qiita 16S platform (see Available Models for other options):

from setbert import SetBert

# Download the model
model = SetBert.from_pretrained("sirdavidludwig/setbert", revision="qiita-16s")

# Get the tokenizer
tokenizer = model.sequence_encoder.tokenizer

Example sample embedding

# Input sample
sequences = [
    "ACTGCAG",
    "TGACGTA",
    "ATGACGA"
]

# Tokenize sequences in the sample
sequence_tokens = torch.stack([tokenizer(s) for s in sequences])

# Compute embeddings
output = model(sequence_tokens)

# Sample level representation
sample_embedding = output["class"]

# Contextualized sequence representations
sequence_embeddings = output["sequences"]

Available Models:

Model Revision Platform Pre-training Dataset Description
qiita-16s 16S Amplicon ~280k 16S amplicon samples from the Qiita platform

Configuration

SetBERT embeds the DNA sequences in chunks using activation checkpointing. This chunk size is specified by the sequence_encoder_chunk_size parameter in the SetBert.Config class and adjusted freely at any point.

# Set chunk size
model.config.sequence_encoder_chunk_size = 256 # default

# Remove chunking and embed all sequences in parallel
model.config.sequence_encoder_chunk_size = None

Manual Installation

git clone https://github.com/DLii-Research/setbert
pip install -e ./new-setbert

Citation

@article{ludwig_setbert_2025,
	title = {{SetBERT}: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing},
	volume = {41},
	issn = {1367-4811},
	doi = {10.1093/bioinformatics/btaf370},
	number = {7},
	journal = {Bioinformatics},
	author = {Ludwig, II, David W and Guptil, Christopher and Alexander, Nicholas R and Zhalnina, Kateryna and Wipf, Edi M -L and Khasanova, Albina and Barber, Nicholas A and Swingley, Wesley and Walker, Donald M and Phillips, Joshua L},
	month = jul,
	year = {2025},
}

Original Experiment Source Code

The original source code used to produce the models and experiments for the manuscript are available in the bioinformatics branch of this repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbtk_setbert-1.0.1.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbtk_setbert-1.0.1-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file dbtk_setbert-1.0.1.tar.gz.

File metadata

  • Download URL: dbtk_setbert-1.0.1.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for dbtk_setbert-1.0.1.tar.gz
Algorithm Hash digest
SHA256 024fcf278e63c86456363e3d1c6719d62d8e6405760f061b2b8f7f1860a2a7c2
MD5 984ade9db0b302e36c12eb6e0c81c8d0
BLAKE2b-256 077bdfdeef82a8e727a176af34e5cbd7087d5cf3fd4207e826cfa061bbc5b773

See more details on using hashes here.

File details

Details for the file dbtk_setbert-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: dbtk_setbert-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for dbtk_setbert-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0581e7894daae8d748c8e3ca0cd78206b1a37b55127a32942c46b0d31dc0c6b9
MD5 bd833e1379918094da28bc6d5e0b2bbd
BLAKE2b-256 a077f475dc3a0c57a96669b15a761168a78d521082543708d42d60b093d56ad4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page