Skip to main content

BertNado: A framework for training and evaluating transformer-based models for Chromatin binding

Project description

BertNado

Docs License CI Release PyPI Version

BertNado logo

BertNado is a modular framework for fine-tuning Hugging Face DNA language models such as GROVER, NT2, and DNABERT variants on genomic prediction tasks. It supports both full fine-tuning and parameter-efficient transfer learning (PEFT) strategies like LoRA.


Features

  • Model Support: GROVER, NT2 (Nucleotide Transformer), DNABERT, and other Hugging Face-compatible DNA language models
  • Task Flexibility: Supports regression, binary, and multi-label classification, as well as masked DNA modeling
  • Chromosome-aware Splits: Train/val/test split by chromosome to prevent data leakage
  • Efficient Fine-tuning: Drop-in support for parameter-efficient tuning methods like LoRA
  • Hyperparameter Optimization: Integrated with Weights & Biases for Bayesian sweep-based tuning
  • Robust Evaluation: Automatically generates ROC, PR, and confusion matrix plots for binary classification
  • Model Interpretation: SHAP and Captum Layer Integrated Gradients (LIG) for biological insight
  • Trainer Integration: Built on Hugging Face Trainer with custom heads and metrics
  • W&B Logging: Full experiment tracking with Weights & Biases out of the box

Installation

git clone https://github.com/CChahrour/BertNado.git
cd BertNado
pip install -e .

Project Structure

bertnado/
├── cli.py                      # Command-line interface
├── data/
│   └── prepare_dataset.py      # Dataset creation and tokenization
├── evaluation/
│   ├── predict.py              # Predict from trained models
│   └── feature_extraction.py   # SHAP / LIG-based interpretation
└── training/
    ├── finetune.py             # Fine-tuning using best config
    ├── full_train.py           # Full training loop
    ├── model.py                # PEFT/LoRA model architecture
    ├── sweep.py                # W&B sweep setup
    ├── trainers.py             # Trainer wrappers
    └── metrics.py              # Metric computation

Quickstart

Step 1: Prepare Dataset

bertnado-data \
  --file-path test/data/mock_data.parquet \
  --target-column bound \
  --fasta-file test/data/mock_genome.fasta \
  --tokenizer-name PoetschLab/GROVER \
  --output-dir output/dataset \
  --task-type binary_classification \
  --threshold 0.5

Step 2: Run Hyperparameter Sweep

bertnado-sweep \
  --config-path test/data/mock_sweep_config.json \
  --output-dir output/sweep \
  --model-name PoetschLab/GROVER \
  --dataset output/dataset \
  --sweep-count 2 \
  --project-name project \
  --metric-name eval/roc_auc \
  --metric-goal maximize \
  --task-type binary_classification

--config-path points to a Weights & Biases sweep config. The sweep metric is also used to choose the best checkpoint inside each run.


Step 3: Train Best Model

bertnado-train \
  --output-dir output/train \
  --model-name PoetschLab/GROVER \
  --dataset output/dataset \
  --best-config-path output/sweep/best_sweep_config.json \
  --task-type binary_classification \
  --project-name project \
  --metric-name eval/roc_auc \
  --metric-goal maximize

The metric flags are optional when best_sweep_config.json was produced by bertnado-sweep, because the resolved metric is saved in that file.


Step 4: Predict on Test Set

bertnado-predict \
  --tokenizer-name PoetschLab/GROVER \
  --model-dir output/train/model \
  --dataset-dir output/dataset \
  --output-dir output/predictions \
  --task-type binary_classification

Step 5: Interpret Model with SHAP or LIG

bertnado-feature \
  --tokenizer-name PoetschLab/GROVER \
  --model-dir output/train/model \
  --dataset-dir output/dataset \
  --output-dir output/feature_analysis \
  --task-type binary_classification \
  --method shap \
  --target-class 1

Run both SHAP and LIG:

--method both --target-class 1

Outputs

  • Figures saved to output/figures/

    • Binary classification: ROC and precision-recall curves
    • Binary classification: Confusion matrix
  • SHAP scores saved to output/shap/

  • Trained models saved to output/models/


Interpretation Tools

  • SHAP: Global and local token importance
  • Captum LIG: Gradient-based token attribution at the embedding level

Acknowledgements

  • Hugging Face Transformers
  • PoetschLab/GROVER
  • PEFT/LoRA
  • SHAP & Captum for interpretability
  • crested for efficient sequence extraction

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bertnado-0.1.5.tar.gz (600.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bertnado-0.1.5-py3-none-any.whl (41.1 kB view details)

Uploaded Python 3

File details

Details for the file bertnado-0.1.5.tar.gz.

File metadata

  • Download URL: bertnado-0.1.5.tar.gz
  • Upload date:
  • Size: 600.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bertnado-0.1.5.tar.gz
Algorithm Hash digest
SHA256 d588c51de1c65bc42f6e8b7a80c1872ac80f8c037f03e5b5532ced78bf2895b6
MD5 56698ee381e55a51f19a58cb664d784f
BLAKE2b-256 4d984c8a20cd92b82a8caa5f178e3e60931d9ce24086d5a5dde7053f590b66e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for bertnado-0.1.5.tar.gz:

Publisher: pypi.yml on CChahrour/BertNado

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bertnado-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: bertnado-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 41.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bertnado-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 7380983bde4fe13e91df14663afc7ee54dad9c7ea60d9326da82cdfbcda3832c
MD5 bbf32a4b4b3b5abc1f836f0babea7bfe
BLAKE2b-256 c40c103e1f3ed6e1c7919983a46978412d908f885803438abcfcfe32b492da24

See more details on using hashes here.

Provenance

The following attestation bundles were made for bertnado-0.1.5-py3-none-any.whl:

Publisher: pypi.yml on CChahrour/BertNado

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page