DNABERT-based framework for predicting the functional impact of regulatory variants

These details have not been verified by PyPI

Project links

Project description

DeepVRegulome

DeepVRegulome Pipeline

DeepVRegulome is a DNABERT-based deep-learning framework for predicting the functional impact of short genomic variants on the human regulome. It provides 462 fine-tuned models (458 transcription factors + 4 histone modifications) trained on ENCODE ChIP-seq data, covering splice-site and transcription-factor-binding-site disruption analysis.

Installation
Quick Start
Python API Reference
External Data Requirements
Repository Structure
Tutorial Notebooks
Streamlit Dashboard
Model Checkpoints
Roadmap
Citation
License

Installation

Requirements

Python ≥ 3.11
GPU recommended — DNABERT inference runs on CPU but is significantly faster on CUDA-enabled GPUs

1. Install the Python package

pip install deepvregulome

This installs the core package with PyTorch, Transformers, and HuggingFace Hub.

Upgrading: We release updates frequently with new features and bug fixes. To get the latest version:
pip install deepvregulome --upgrade
Check your installed version: python -c "import deepvregulome; print(deepvregulome.__version__)"

2. Install optional dependencies

DeepVRegulome has optional extras depending on your use case:

# For variant scoring from genomic coordinates (requires samtools/htslib)
pip install deepvregulome[genome]        # installs pysam

# For VCF file processing
pip install deepvregulome[vcf]           # installs cyvcf2

# For visualization (attention maps, motif logos)
pip install deepvregulome[viz]           # installs matplotlib, seaborn

# For motif interpretation (logo plots, statistical tests)
pip install deepvregulome[interpret]     # installs logomaker, scipy

# Install everything
pip install deepvregulome[all]

Recommended: Use a dedicated conda environment

If you're on a shared server (e.g., HPC, JupyterHub), we strongly recommend creating a dedicated conda environment to avoid dependency conflicts:

# Create and activate environment
conda create -n dvr python=3.11 -y
conda activate dvr

# Install deepvregulome with all optional dependencies
pip install deepvregulome[all]

# If using JupyterHub, register as a selectable kernel
pip install jupyterlab ipykernel
python -m ipykernel install --user --name dvr --display-name "DVR (Python 3.11)"

Then select "DVR (Python 3.11)" as your kernel in JupyterHub (Kernel → Change Kernel).

External Data Requirements

DeepVRegulome requires two external data files that are not included in the package. You must download these before running variant-level analyses.

Human Reference Genome (hg38)

Required for extracting flanking sequences around variant positions.

# Option 1: UCSC hg38
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
samtools faidx hg38.fa

# Option 2: GENCODE GRCh38 primary assembly
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz
gunzip GRCh38.primary_assembly.genome.fa.gz
samtools faidx GRCh38.primary_assembly.genome.fa

Note: The .fai index file is required. Run samtools faidx on your FASTA file if the index does not exist.

JASPAR Motif Database (optional, for motif analysis)

Required only if you want to run motif-level interpretation and TF binding site overlap analysis.

# Download JASPAR 2024 vertebrate motifs in MEME format
wget https://jaspar.elixir.no/download/data/2024/CORE/JASPAR2024_CORE_vertebrates_non-redundant_pfms_meme.txt

Quick Start

Score a single variant

from deepvregulome import DVR

# Initialize with path to your reference genome FASTA
dvr = DVR(
    genome="/path/to/hg38.fa",
    model_dir="/path/to/preferred_models",
)

# Score a variant against specific TF models
result = dvr.score_variant(
    chrom="chr1",
    pos=3456782,
    ref="A",
    alt="C",
    models=["CTCFL", "SP1", "MYC"],
)
print(result)

Download models into a preferred directory

from deepvregulome import download_models

download_models(
    models=["CTCFL", "SP1"],
    model_dir="/path/to/preferred_models",
)

DeepVRegulome now checks model_dir first for local checkpoints and falls back to the Hugging Face cache only when a requested model is not present there.

Output is a pandas DataFrame with columns: chrom, pos, ref, alt, model, type, prob_ref, prob_alt, log_odds_change, disrupted.

Score variants from a VCF file

results = dvr.score_vcf(
    "/path/to/patient.vcf",
    models=["CTCFL", "SP1", "GATA3"],
    batch_size=100,
    gpus=[0, 1]         # multi-GPU support
)
results.head()

Batch Scoring from DataFrame (`score_variants`)

Auto-detects column names: chrom/pos/ref/alt or CHROM/start/end/REF/ALT.

import pandas as pd

# VCF-style DataFrame
# variant_df = pd.DataFrame({
#     "chrom": ["chr21", "chr1", "chr17", "chr12", "chr7"],
#     "pos":   [10448027, 3456782, 7674220, 25245350, 55181378],
#     "ref":   ["C", "A", "C", "G", "T"],
#     "alt":   ["T", "C", "T", "A", "C"],
# })
# Create a DataFrame of variants to score
variant_df = pd.read_csv("test_vcf.tsv", sep="\t")

results = dvr.score_variants(
    variant_df,
    models=["CTCFL", "SP1", "ATF4"],
    batch_size=5,
)

Score pre-extracted sequences

result = dvr.score_sequence(
    ref_seq="ATCGATCG...",   # 301bp reference sequence
    alt_seq="ATCGTTCG...",   # 301bp alternate sequence
    models=["CTCFL"]
)

Python API Reference

`DVR` class

Method	Description
`DVR(genome=..., model_dir=...)`	Initialize with reference genome FASTA path and an optional preferred model directory
`dvr.score_variant(chrom, pos, ref, alt, models)`	Score a single variant by genomic coordinates
`dvr.score_variants(df, models)`	Batch-score a DataFrame of variants (columns: chrom, pos, ref, alt)
`dvr.score_vcf(vcf_path, models)`	Score all variants in a VCF file
`dvr.score_sequence(ref_seq, alt_seq, models)`	Score pre-extracted 301bp REF/ALT sequences
`dvr.list_models(model_type=None)`	List available models; filter by `"TF"` or `"HISTONE"`
`dvr.search_models(query)`	Search models by name (e.g., `"ZNF"`, `"GATA"`)

Scoring parameters

All scoring methods accept these optional arguments:

Parameter	Default	Description
`models`	required	List of model names (e.g., `["CTCFL", "SP1"]`)
`model_type`	`None`	Score all models of a type: `"TF"` (458 models) or `"HISTONE"` (4 models)
`batch_size`	`32`	Batch size for GPU inference
`gpus`	`[0]`	List of GPU device IDs for parallel inference
`return_attention`	`False`	Extract DNABERT attention weights for interpretability
`coordinate_system`	`"1-based"`	VCF standard is 1-based; set to `"0-based"` if needed

CLI

# Score a single variant from the command line
deepvregulome score --chrom chr1 --pos 3456782 --ref A --alt C \
    --models CTCFL SP1 --genome /path/to/hg38.fa

# Score from VCF
deepvregulome score-vcf --vcf patient.vcf --models CTCFL SP1 MYC \
    --genome /path/to/hg38.fa --batch-size 100 --gpus 0 1

Repository Structure

DeepVRegulome/
├── src/deepvregulome/              # Python package (pip install deepvregulome)
│   ├── __init__.py                 #   Public API: DVR class
│   ├── dvr.py                      #   Main scoring engine
│   ├── registry.py                 #   Model registry (462 models + metadata)
│   ├── utils.py                    #   k-mer tokenization, sequence extraction
│   └── cli.py                      #   Command-line interface
├── notebooks/                      # Tutorial notebooks (see below)
│   ├── 01_quickstart.ipynb         #   Getting started with DVR
│   ├── 02_vcf_scoring.ipynb        #   Batch VCF analysis pipeline
│   ├── 03_attention_motifs.ipynb   #   Attention visualization & motif analysis
│   └── 04_clinical_pipeline.ipynb  #   End-to-end clinical variant analysis
├── streamlit_app/                  # Interactive web dashboard
│   └── app_variant_clinical_dashboard.py
├── assets/                         # Figures for README
│   └── flowchart.png
├── pyproject.toml                  # Package metadata & dependencies
├── LICENSE
└── README.md

Tutorial Notebooks

The notebooks/ folder contains Jupyter notebooks that walk through common use cases:

Notebook	Description
`01_quickstart.ipynb`	Install, load models, score your first variant
`02_vcf_scoring.ipynb`	Parse a VCF, batch-score variants, filter candidates
`03_attention_motifs.ipynb`	Extract DNABERT attention weights, plot motif disruption
`04_clinical_pipeline.ipynb`	Full pipeline: VCF → TFBS intersection → scoring → candidate ranking

To run the notebooks:

pip install deepvregulome[all] jupyterlab
jupyter lab notebooks/

Streamlit Dashboard

An interactive dashboard for exploring variant predictions and clinical stratification is available:

Live demo: https://deepvregulome.streamlit.app

Model Checkpoints

All 462 fine-tuned DNABERT models (458 TFs + 4 histone marks) are hosted on HuggingFace:

https://huggingface.co/duttaprat/DeepVRegulome

Models can still be fetched automatically through the Hugging Face cache, but you can now pre-download them into your own directory with download_models(..., model_dir=...) and point DVR(..., model_dir=...) or any scoring call at that location.

For direct access without the deepvregulome package:

from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "duttaprat/DeepVRegulome", subfolder="models/CTCFL"
)
model = AutoModelForSequenceClassification.from_pretrained(
    "duttaprat/DeepVRegulome", subfolder="models/CTCFL"
)

Roadmap

Current capabilities (v0.1.8):

Single-variant and batch VCF scoring with 462 ENCODE ChIP-seq models
Multi-GPU inference support
Attention-based interpretability
CLI and Python API

In development:

Splice-site disruption scoring (acceptor + donor models)
JASPAR motif enrichment integration
Expanded model zoo: additional cell types and epigenomic marks
Conda package

Planned:

REST API for web-based scoring
Integration with ClinVar and gnomAD annotation

Citation

If you use DeepVRegulome in your research, please cite:

@article{dutta2025deepvregulome,
  title={DeepVRegulome: DNABERT-based deep-learning framework for predicting
         the functional impact of short genomic variants on the human regulome},
  author={Dutta, Pratik and Obusan, Matthew and Sathian, Rekha and Chao, Max
          and Surana, Pallavi and Papineni, Nimisha and Ji, Yanrong
          and Zhou, Zhihan and Liu, Han and Yurovsky, Alisa
          and Davuluri, Ramana V},
  journal={arXiv preprint arXiv:2511.09026},
  year={2025},
  url={https://arxiv.org/abs/2511.09026}
}

License

CC-BY-NC-4.0. See LICENSE for details.

Davuluri Lab · Department of Biomedical Informatics · Stony Brook University
GitHub ·

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Apr 14, 2026

0.1.9

Mar 18, 2026

0.1.8

Mar 17, 2026

0.1.7

Mar 17, 2026

0.1.6

Mar 17, 2026

0.1.5

Mar 16, 2026

0.1.4

Mar 16, 2026

0.1.3

Mar 16, 2026

0.1.2

Mar 14, 2026

0.1.1

Mar 14, 2026

0.1.0

Mar 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepvregulome-0.3.0.tar.gz (45.5 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deepvregulome-0.3.0-py3-none-any.whl (42.7 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file deepvregulome-0.3.0.tar.gz.

File metadata

Download URL: deepvregulome-0.3.0.tar.gz
Upload date: Apr 14, 2026
Size: 45.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for deepvregulome-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`147f244969724606dfe67b9f6eb9234e2b87cd430eb455db911e4d9eedc43331`
MD5	`52d26c5b14a6c2cc84739cb210958da6`
BLAKE2b-256	`242e381c75eff772920d3dc38c18b9b2f9ad9a14943f4e303d80c3667032c1f2`

See more details on using hashes here.

File details

Details for the file deepvregulome-0.3.0-py3-none-any.whl.

File metadata

Download URL: deepvregulome-0.3.0-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 42.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for deepvregulome-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fc5516fef52acf26c757c3e363e2202f44af408569670e30043604362e51df51`
MD5	`5cc839ab289a211df1c6c97b3f14c8e3`
BLAKE2b-256	`2d2eecc091a16f741944414af19ec6c6549d0d4777ce7579abc917acd1179890`

See more details on using hashes here.

deepvregulome 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DeepVRegulome

Table of Contents

Installation

Requirements

1. Install the Python package

2. Install optional dependencies

Recommended: Use a dedicated conda environment

External Data Requirements

Human Reference Genome (hg38)

JASPAR Motif Database (optional, for motif analysis)

Quick Start

Score a single variant

Download models into a preferred directory

Score variants from a VCF file

Batch Scoring from DataFrame (score_variants)

Score pre-extracted sequences

Python API Reference

DVR class

Scoring parameters

CLI

Repository Structure

Tutorial Notebooks

Streamlit Dashboard

Model Checkpoints

Roadmap

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Batch Scoring from DataFrame (`score_variants`)

`DVR` class