Skip to main content

DNABERT-based framework for predicting the functional impact of regulatory variants

Project description

DeepVRegulome

DeepVRegulome Pipeline

PyPI HuggingFace arXiv Streamlit License

DeepVRegulome is a DNABERT-based deep-learning framework for predicting the functional impact of short genomic variants on the human regulome. It provides 462 fine-tuned models (458 transcription factors + 4 histone modifications) trained on ENCODE ChIP-seq data, covering splice-site and transcription-factor-binding-site disruption analysis.


Table of Contents


Installation

Requirements

  • Python ≥ 3.11
  • GPU recommended — DNABERT inference runs on CPU but is significantly faster on CUDA-enabled GPUs

1. Install the Python package

pip install deepvregulome

This installs the core package with PyTorch, Transformers, and HuggingFace Hub.

Upgrading: We release updates frequently with new features and bug fixes. To get the latest version:

pip install deepvregulome --upgrade

Check your installed version: python -c "import deepvregulome; print(deepvregulome.__version__)"

2. Install optional dependencies

DeepVRegulome has optional extras depending on your use case:

# For variant scoring from genomic coordinates (requires samtools/htslib)
pip install deepvregulome[genome]        # installs pysam

# For VCF file processing
pip install deepvregulome[vcf]           # installs cyvcf2

# For visualization (attention maps, motif logos)
pip install deepvregulome[viz]           # installs matplotlib, seaborn

# For motif interpretation (logo plots, statistical tests)
pip install deepvregulome[interpret]     # installs logomaker, scipy

# Install everything
pip install deepvregulome[all]

Recommended: Use a dedicated conda environment

If you're on a shared server (e.g., HPC, JupyterHub), we strongly recommend creating a dedicated conda environment to avoid dependency conflicts:

# Create and activate environment
conda create -n dvr python=3.11 -y
conda activate dvr

# Install deepvregulome with all optional dependencies
pip install deepvregulome[all]

# If using JupyterHub, register as a selectable kernel
pip install jupyterlab ipykernel
python -m ipykernel install --user --name dvr --display-name "DVR (Python 3.11)"

Then select "DVR (Python 3.11)" as your kernel in JupyterHub (Kernel → Change Kernel).


External Data Requirements

DeepVRegulome requires two external data files that are not included in the package. You must download these before running variant-level analyses.

Human Reference Genome (hg38)

Required for extracting flanking sequences around variant positions.

# Option 1: UCSC hg38
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
samtools faidx hg38.fa

# Option 2: GENCODE GRCh38 primary assembly
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz
gunzip GRCh38.primary_assembly.genome.fa.gz
samtools faidx GRCh38.primary_assembly.genome.fa

Note: The .fai index file is required. Run samtools faidx on your FASTA file if the index does not exist.

JASPAR Motif Database (optional, for motif analysis)

Required only if you want to run motif-level interpretation and TF binding site overlap analysis.

# Download JASPAR 2024 vertebrate motifs in MEME format
wget https://jaspar.elixir.no/download/data/2024/CORE/JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt

Quick Start

Score a single variant

from deepvregulome import DVR

# Initialize with path to your reference genome FASTA
dvr = DVR(genome="/path/to/hg38.fa")

# Score a variant against specific TF models
result = dvr.score_variant(
    chrom="chr1",
    pos=3456782,
    ref="A",
    alt="C",
    models=["CTCFL", "SP1", "MYC"]
)
print(result)

Output is a pandas DataFrame with columns: chrom, pos, ref, alt, model, type, prob_ref, prob_alt, log_odds_change, disrupted.

Score variants from a VCF file

results = dvr.score_vcf(
    "/path/to/patient.vcf",
    models=["CTCFL", "SP1", "GATA3"],
    batch_size=100,
    gpus=[0, 1]         # multi-GPU support
)
results.head()

Batch Scoring from DataFrame (score_variants)

Auto-detects column names: chrom/pos/ref/alt or CHROM/start/end/REF/ALT.

import pandas as pd

# VCF-style DataFrame
# variant_df = pd.DataFrame({
#     "chrom": ["chr21", "chr1", "chr17", "chr12", "chr7"],
#     "pos":   [10448027, 3456782, 7674220, 25245350, 55181378],
#     "ref":   ["C", "A", "C", "G", "T"],
#     "alt":   ["T", "C", "T", "A", "C"],
# })
# Create a DataFrame of variants to score
variant_df = pd.read_csv("test_vcf.tsv", sep="\t")

results = dvr.score_variants(
    variant_df,
    models=["CTCFL", "SP1", "ATF4"],
    batch_size=5,
)

Score pre-extracted sequences

result = dvr.score_sequence(
    ref_seq="ATCGATCG...",   # 301bp reference sequence
    alt_seq="ATCGTTCG...",   # 301bp alternate sequence
    models=["CTCFL"]
)

Python API Reference

DVR class

Method Description
DVR(genome=...) Initialize with reference genome FASTA path
dvr.score_variant(chrom, pos, ref, alt, models) Score a single variant by genomic coordinates
dvr.score_variants(df, models) Batch-score a DataFrame of variants (columns: chrom, pos, ref, alt)
dvr.score_vcf(vcf_path, models) Score all variants in a VCF file
dvr.score_sequence(ref_seq, alt_seq, models) Score pre-extracted 301bp REF/ALT sequences
dvr.list_models(model_type=None) List available models; filter by "TF" or "HISTONE"
dvr.search_models(query) Search models by name (e.g., "ZNF", "GATA")

Scoring parameters

All scoring methods accept these optional arguments:

Parameter Default Description
models required List of model names (e.g., ["CTCFL", "SP1"])
model_type None Score all models of a type: "TF" (458 models) or "HISTONE" (4 models)
batch_size 32 Batch size for GPU inference
gpus [0] List of GPU device IDs for parallel inference
return_attention False Extract DNABERT attention weights for interpretability
coordinate_system "1-based" VCF standard is 1-based; set to "0-based" if needed

CLI

# Score a single variant from the command line
deepvregulome score --chrom chr1 --pos 3456782 --ref A --alt C \
    --models CTCFL SP1 --genome /path/to/hg38.fa

# Score from VCF
deepvregulome score-vcf --vcf patient.vcf --models CTCFL SP1 MYC \
    --genome /path/to/hg38.fa --batch-size 100 --gpus 0 1

Repository Structure

DeepVRegulome/
├── src/deepvregulome/              # Python package (pip install deepvregulome)
│   ├── __init__.py                 #   Public API: DVR class
│   ├── dvr.py                      #   Main scoring engine
│   ├── registry.py                 #   Model registry (462 models + metadata)
│   ├── utils.py                    #   k-mer tokenization, sequence extraction
│   └── cli.py                      #   Command-line interface
├── notebooks/                      # Tutorial notebooks (see below)
│   ├── 01_quickstart.ipynb         #   Getting started with DVR
│   ├── 02_vcf_scoring.ipynb        #   Batch VCF analysis pipeline
│   ├── 03_attention_motifs.ipynb   #   Attention visualization & motif analysis
│   └── 04_clinical_pipeline.ipynb  #   End-to-end clinical variant analysis
├── streamlit_app/                  # Interactive web dashboard
│   └── app_variant_clinical_dashboard.py
├── assets/                         # Figures for README
│   └── flowchart.png
├── pyproject.toml                  # Package metadata & dependencies
├── LICENSE
└── README.md

Tutorial Notebooks

The notebooks/ folder contains Jupyter notebooks that walk through common use cases:

Notebook Description
01_quickstart.ipynb Install, load models, score your first variant
02_vcf_scoring.ipynb Parse a VCF, batch-score variants, filter candidates
03_attention_motifs.ipynb Extract DNABERT attention weights, plot motif disruption
04_clinical_pipeline.ipynb Full pipeline: VCF → TFBS intersection → scoring → candidate ranking

To run the notebooks:

pip install deepvregulome[all] jupyterlab
jupyter lab notebooks/

Streamlit Dashboard

An interactive dashboard for exploring variant predictions and clinical stratification is available:


Model Checkpoints

All 462 fine-tuned DNABERT models (458 TFs + 4 histone marks) are hosted on HuggingFace:

https://huggingface.co/duttaprat/DeepVRegulome

Models are automatically downloaded when you use DVR(). No manual download is needed.

For direct access without the deepvregulome package:

from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "duttaprat/DeepVRegulome", subfolder="models/CTCFL"
)
model = AutoModelForSequenceClassification.from_pretrained(
    "duttaprat/DeepVRegulome", subfolder="models/CTCFL"
)

Roadmap

Current capabilities (v0.1.8):

  • Single-variant and batch VCF scoring with 462 ENCODE ChIP-seq models
  • Multi-GPU inference support
  • Attention-based interpretability
  • CLI and Python API

In development:

  • Splice-site disruption scoring (acceptor + donor models)
  • JASPAR motif enrichment integration
  • Expanded model zoo: additional cell types and epigenomic marks
  • Conda package

Planned:

  • REST API for web-based scoring
  • Integration with ClinVar and gnomAD annotation

Citation

If you use DeepVRegulome in your research, please cite:

@article{dutta2025deepvregulome,
  title={DeepVRegulome: DNABERT-based deep-learning framework for predicting
         the functional impact of short genomic variants on the human regulome},
  author={Dutta, Pratik and Obusan, Matthew and Sathian, Rekha and Chao, Max
          and Surana, Pallavi and Papineni, Nimisha and Ji, Yanrong
          and Zhou, Zhihan and Liu, Han and Yurovsky, Alisa
          and Davuluri, Ramana V},
  journal={arXiv preprint arXiv:2511.09026},
  year={2025},
  url={https://arxiv.org/abs/2511.09026}
}

License

CC-BY-NC-4.0. See LICENSE for details.


Davuluri Lab · Department of Biomedical Informatics · Stony Brook University
GitHub ·

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepvregulome-0.1.9.tar.gz (45.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepvregulome-0.1.9-py3-none-any.whl (42.7 kB view details)

Uploaded Python 3

File details

Details for the file deepvregulome-0.1.9.tar.gz.

File metadata

  • Download URL: deepvregulome-0.1.9.tar.gz
  • Upload date:
  • Size: 45.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for deepvregulome-0.1.9.tar.gz
Algorithm Hash digest
SHA256 1f5235cb590654b3acd66876d1902b78a92bff7860c6d565886bba78546031ca
MD5 bdb6f81a860400871da2eab33468dd3a
BLAKE2b-256 95f86efc2eac8f4db40effbc3d78ca8292f76210465577a115725bc2a6c26dee

See more details on using hashes here.

File details

Details for the file deepvregulome-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: deepvregulome-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 42.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for deepvregulome-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 e67cf9a2f3fc5ce6c3d940232311d456ee8b1897e888ec68aa2875c4e0f92cc4
MD5 c56074f063c652fa03ee589ee3daa8e0
BLAKE2b-256 c3f98391cfcc12ee84ed7e99e0c6eaa693d7e9316e8f81092520aba19e6b7f4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page