Skip to main content

BertNado: A framework for training and evaluating transformer-based models for Chromatin binding

Project description

BertNado

Docs License CI Release PyPI Version

BertNado logo

BertNado is a modular framework for fine-tuning Hugging Face DNA language models such as GROVER, NT2, and DNABERT variants on genomic prediction tasks. It supports both full fine-tuning and parameter-efficient transfer learning (PEFT) strategies like LoRA.


Features

  • Model Support: GROVER, NT2 (Nucleotide Transformer), DNABERT, and other Hugging Face-compatible DNA language models
  • Task Flexibility: Supports regression, binary, and multi-label classification, as well as masked DNA modeling
  • Chromosome-aware Splits: Train/val/test split by chromosome to prevent data leakage
  • Efficient Fine-tuning: Drop-in support for parameter-efficient tuning methods like LoRA
  • Hyperparameter Optimization: Integrated with Weights & Biases for Bayesian sweep-based tuning
  • Robust Evaluation: Automatically generates ROC, PR, and confusion matrix plots for binary classification
  • Model Interpretation: SHAP and Captum Layer Integrated Gradients (LIG) for biological insight
  • Trainer Integration: Built on Hugging Face Trainer with custom heads and metrics
  • W&B Logging: Full experiment tracking with Weights & Biases out of the box

Installation

git clone https://github.com/CChahrour/BertNado.git
cd BertNado
pip install -e .

Project Structure

bertnado/
├── cli.py                      # Command-line interface
├── data/
│   └── prepare_dataset.py      # Dataset creation and tokenization
├── evaluation/
│   ├── predict.py              # Predict from trained models
│   └── feature_extraction.py   # SHAP / LIG-based interpretation
└── training/
    ├── finetune.py             # Fine-tuning using best config
    ├── full_train.py           # Full training loop
    ├── model.py                # PEFT/LoRA model architecture
    ├── sweep.py                # W&B sweep setup
    ├── trainers.py             # Trainer wrappers
    └── metrics.py              # Metric computation

Quickstart

Step 1: Prepare Dataset

bertnado-data \
  --file-path test/data/mock_data.parquet \
  --target-column bound \
  --fasta-file test/data/mock_genome.fasta \
  --tokenizer-name PoetschLab/GROVER \
  --output-dir output/dataset \
  --task-type binary_classification \
  --threshold 0.5

Step 2: Run Hyperparameter Sweep

bertnado-sweep \
  --config-path test/data/mock_sweep_config.json \
  --output-dir output/sweep \
  --model-name PoetschLab/GROVER \
  --dataset output/dataset \
  --sweep-count 2 \
  --project-name project \
  --metric-name eval/roc_auc \
  --metric-goal maximize \
  --task-type binary_classification

--config-path points to a Weights & Biases sweep config. The sweep metric is also used to choose the best checkpoint inside each run.


Step 3: Train Best Model

bertnado-train \
  --output-dir output/train \
  --model-name PoetschLab/GROVER \
  --dataset output/dataset \
  --best-config-path output/sweep/best_sweep_config.json \
  --task-type binary_classification \
  --project-name project \
  --metric-name eval/roc_auc \
  --metric-goal maximize

The metric flags are optional when best_sweep_config.json was produced by bertnado-sweep, because the resolved metric is saved in that file.


Step 4: Predict on Test Set

bertnado-predict \
  --tokenizer-name PoetschLab/GROVER \
  --model-dir output/train/model \
  --dataset-dir output/dataset \
  --output-dir output/predictions \
  --task-type binary_classification

Step 5: Interpret Model with SHAP or LIG

bertnado-feature \
  --tokenizer-name PoetschLab/GROVER \
  --model-dir output/train/model \
  --dataset-dir output/dataset \
  --output-dir output/feature_analysis \
  --task-type binary_classification \
  --method shap \
  --target-class 1

Run both SHAP and LIG:

--method both --target-class 1

Outputs

  • Figures saved to output/figures/

    • Binary classification: ROC and precision-recall curves
    • Binary classification: Confusion matrix
  • SHAP scores saved to output/shap/

  • Trained models saved to output/models/


Interpretation Tools

  • SHAP: Global and local token importance
  • Captum LIG: Gradient-based token attribution at the embedding level

Acknowledgements

  • Hugging Face Transformers
  • PoetschLab/GROVER
  • PEFT/LoRA
  • SHAP & Captum for interpretability
  • crested for efficient sequence extraction

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bertnado-0.1.2.tar.gz (599.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bertnado-0.1.2-py3-none-any.whl (40.9 kB view details)

Uploaded Python 3

File details

Details for the file bertnado-0.1.2.tar.gz.

File metadata

  • Download URL: bertnado-0.1.2.tar.gz
  • Upload date:
  • Size: 599.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bertnado-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2a4518dca728705a9772cbec9d9e3cd8a44f57a0cf6d959dae616d9ffb6ec0cc
MD5 8e58f46017b70decd3f3c02f55e21b9d
BLAKE2b-256 44644c4ec07c792c1be1db1602160a818b2c613065c965d9dc10fc214438f46b

See more details on using hashes here.

Provenance

The following attestation bundles were made for bertnado-0.1.2.tar.gz:

Publisher: pypi.yml on CChahrour/BertNado

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bertnado-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: bertnado-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 40.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bertnado-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6df4a017d3983e6460f045936f7b5d54ba3ae1ca19126932f3709d62225c7002
MD5 bdcc73bbfa688d9b169b15d0f16e1ccb
BLAKE2b-256 0af7f75cf58743f2510b719d5154cdfdb0827796d43bd761247ab2202dc7dacc

See more details on using hashes here.

Provenance

The following attestation bundles were made for bertnado-0.1.2-py3-none-any.whl:

Publisher: pypi.yml on CChahrour/BertNado

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page