Skip to main content

DNABERT-based framework for predicting the functional impact of regulatory variants

Project description

DeepVRegulome

DeepVRegulome Pipeline

DeepVRegulome is an end‑to‑end framework for predicting the functional impact of small somatic variants in non‑coding regulatory regions (splice sites and transcription‑factor‑binding sites) using fine‑tuned DNABERT models.


✨ Key Features

  • ✅ DNABERT-based classifiers for:
    • Splice sites (acceptor, donor)
    • ~700 TFBS models
  • ✅ Region-aware scoring of somatic variants using Δp and log₂ odds
  • ✅ Batch processing with multiprocessing and BED/VCF support
  • ✅ Interactive Streamlit dashboard with:
    • Variant tables, plots, and survival analysis
    • Attention score visualizations

📁 Repository Structure

DeepVRegulome/
├── .devcontainer/
├── .streamlit/
├── data/
│   └── Brain/
├── figures/                         # Exported visualizations (e.g. attention maps)
│   └── attention/
│       ├── CTCFL/
│       └── ZNF384/
├── notebooks/                      # Jupyter notebooks for key pipeline steps
│   ├── 01_parse_and_merge_vcfs.ipynb            # Merge and parse VCFs
│   ├── 02_tfbs_intersection.ipynb               # Intersect VCF with TFBS BEDs
│   ├── 03_dnabert_input_generation.ipynb        # Generate sequences for DNABERT
│   ├── 04_scoring_candidate_variants.ipynb      # Compute Δp / logOR & rank variants
│   └── 05_tfbs_attention_motif_visualization.ipynb  # Plot attention scores & motifs
├── scripts/                       # Shell scripts for batch inference
│   ├── run_prediction_tfbs.sh                 # Predict with TFBS models
│   └── run_prediction_splice_acceptor.sh      # Predict with acceptor models
├── src/
│   └── deepvregulome/             # Core Python modules
│       ├── __init__.py
│       ├── dnabert_data_generation.py         # Wild/mutated seq generation
│       ├── intersect.py                       # BED/VCF overlap engine
│       ├── vcf_loader.py                      # VCF parsing utilities
│       └── config.yaml                        # Centralized path config
├── streamlit_app/
│   └── app_variant_clinical_dashboard.py      # Live clinical dashboard
├── LICENSE
├── README.md
├── requirements.txt
└── .gitignore

🧪 Installation

git clone https://github.com/DavuluriLab//DeepVRegulome.git
cd DeepVRegulome
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

⚙️ Typical Pipeline Flow

Step Description Location
1️⃣ Parse + merge somatic VCFs 01_parse_and_merge_vcfs.ipynb
2️⃣ Intersect variants with TFBS BEDs 02_tfbs_intersection.ipynb
3️⃣ Generate ref/mutated k-mers for DNABERT 03_dnabert_input_generation.ipynb
4️⃣ Predict with DNABERT models scripts/run_prediction_tfbs.sh
5️⃣ Compute Δp, find candidate variants 04_scoring_candidate_variants.ipynb
6️⃣ Visualize attention scores and motifs 05_tfbs_attention_motif_visualization.ipynb
7️⃣ Browse results interactively streamlit_app/app_variant_clinical_dashboard.py

📊 Example Outputs

  • Candidate variant count by TFBS
  • DNABERT attention heatmaps
  • High-impact motif shifts due to mutations
  • Kaplan–Meier plots for clinical stratification

See figures/attention/ for examples like CTCFL.

🌐 Live Demo

An interactive instance of the DeepVRegulome dashboard is hosted here: ➡️ https://davuluri-lab-brainved.streamlit.app/ The deployed app lets you browse model performance metrics and variant-effect predictions without installing any software locally.

🧬 Model Checkpoints

Full DNABERT fine-tuned weights (acceptor, donor, and 700 TFBS models) will be deposited in Zenodo and made publicly available immediately upon journal acceptance. In the meantime, researchers may request access by emailing pratik.dutta@stonybrook.edu and ramana.davuluri@stonybrookmedicine.edu with a brief statement of intended use.

Citation

If you use DeepVRegulome in your research, please cite:

🧬 Model Checkpoints

MIT. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepvregulome-0.1.0.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepvregulome-0.1.0-py3-none-any.whl (28.2 kB view details)

Uploaded Python 3

File details

Details for the file deepvregulome-0.1.0.tar.gz.

File metadata

  • Download URL: deepvregulome-0.1.0.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for deepvregulome-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5bb7e413ff7f3de596c9c48a21c7a73abef2bf9c01953d3a13c0726c4b9ebf96
MD5 b3f084a6887778531af009fc07d997b2
BLAKE2b-256 26477fa9900e3a187b2c3d79f2cc30125b2e8189f72aa315b5b64df5b4580498

See more details on using hashes here.

File details

Details for the file deepvregulome-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: deepvregulome-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for deepvregulome-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e0c0bba11ea977b8d609856749e28e5efbbd08497c33d8b96c334d3e07c4157e
MD5 366ffe6f193bd0ec5dc39def6671d496
BLAKE2b-256 4a8f83aaefdea0af29b12d1b7e332d531b58ed993d99b208c485c1b6c13eb800

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page