DVC remote plugin for Databricks Unity Catalog Volumes
Project description
dvc-databricks
A DVC remote storage plugin that enables data versioning on Databricks Unity Catalog Volumes.
Store large data files on Databricks Volumes (backed by S3 or ADLS), keep only lightweight .dvc pointer files in your git repository, and use standard DVC commands — no custom code required.
dvc push # uploads data to Databricks Volume via Databricks SDK
dvc pull # downloads data from Databricks Volume
Why this plugin?
Databricks Unity Catalog Volumes cannot be accessed like a plain S3 bucket — all I/O should go through the Databricks Files API. This plugin bridges DVC and the Databricks SDK so you can version and share datasets stored on Volumes without ever leaving the standard DVC workflow.
Requirements
- Python >= 3.11
- DVC >= 3.0
- Databricks CLI configured with a profile in
~/.databrickscfg - Access to a Databricks Unity Catalog Volume
Installation
pip install dvc-databricks
Once installed, the dbvol:// remote protocol is automatically available to DVC in every process — no imports or additional configuration needed.
Setup
1. Initialize DVC in your repository (if not already done)
dvc init
git add .dvc
git commit -m "initialize DVC"
2. Add the Databricks Volume as a DVC remote
dvc remote add -d myremote \
dbvol:///Volumes/<catalog>/<schema>/<volume>/<path>
Example:
dvc remote add -d myremote \
dbvol:///Volumes/ml_catalog/datasets/storage/dvc_cache
3. Set your Databricks profile
export DATABRICKS_CONFIG_PROFILE=<your-profile-name>
Note: DVC remotes do not support arbitrary config keys, so the Databricks profile must be provided via this environment variable — it cannot be stored in
.dvc/config. Add the export to your~/.zshrcor~/.bashrcto make it permanent.
Usage
Track a data file
dvc add data/dataset.csv
This creates data/dataset.csv.dvc — a small pointer file that goes into git.
The actual data file must be listed in .gitignore.
Push data to the Volume
dvc push
Uploads the file to your Databricks Volume via the Databricks SDK.
Commit the pointer to git
git add data/dataset.csv.dvc .gitignore
git commit -m "track dataset v1 with DVC"
git push
Pull data in another environment
git clone <your-repo>
pip install dvc-databricks
export DATABRICKS_CONFIG_PROFILE=<your-profile-name>
dvc pull
CLI — dvc-databricks add
The dvc-databricks add command recursively finds files under a directory and tracks each one with DVC, creating one .dvc pointer file per file. The full folder structure is preserved in git, which allows granular pulls by file or subfolder — unlike DVC's built-in dvc add <dir>, which creates a single .dvc file for the whole directory.
Syntax
dvc-databricks add <path> [--include EXT ...] [--exclude EXT ...]
Arguments
| Argument | Description |
|---|---|
path |
Root directory to scan recursively (required). |
--include EXT ... |
Whitelist — only track files with these extensions. Accepts multiple values. |
--exclude EXT ... |
Blacklist — always skip files with these extensions. Accepts multiple values. Takes precedence over --include. |
Extensions can be written with or without a leading dot (.csv and csv are equivalent) and are matched case-insensitively.
Filter logic
--includeis a whitelist: only files whose extension is in the list are tracked.--excludeis a blacklist: files whose extension is in the list are always skipped.- When both are provided,
--excludetakes precedence over--include. - When neither is provided, all files are tracked.
Examples
# Track only CSV and JSON files
dvc-databricks add /path/to/dataset --include .csv .json
# Track all files except macOS artifacts and temp files
dvc-databricks add /path/to/dataset --exclude .DS_Store .tmp .log
# Only CSVs, but skip .DS_Store even if --include .csv is set
dvc-databricks add /path/to/dataset --include .csv --exclude .DS_Store
# Track all files with no filters
dvc-databricks add /path/to/dataset
After running
git add .
git commit -m "track dataset file by file"
dvc push
- One
.dvcpointer file is created next to each tracked data file. - Each directory containing tracked files gets a
.gitignorethat excludes the raw data files from git. dvc pushuploads all tracked files to the configured Databricks Volume.
How it works
Your git repo Databricks Volume (S3 / ADLS)
────────────────── ───────────────────────────────────
data/dataset.csv.dvc ──────► /Volumes/catalog/schema/vol/
.dvc/config └── files/md5/
├── ab/cdef1234... ← actual data
└── 9f/123abc... ← actual data
dvc|dvc-databricks add hashes the file and stores it in the local DVC cache (.dvc/cache).
A .dvc pointer file containing the MD5 hash is created next to your data file.
dvc push uploads from the local cache to the Volume using the Databricks
Files API (WorkspaceClient.files.upload). Files are stored content-addressed:
<volume_path>/files/md5/<hash[:2]>/<hash[2:]>.
dvc pull downloads from the Volume into the local cache, then restores
the file to its original path.
Only .dvc pointer files are ever committed to git — the data stays on the Volume.
Architecture
The plugin follows the same pattern as official DVC plugins:
| Class | Base | Role |
|---|---|---|
DatabricksVolumesFileSystem |
dvc_objects.FileSystem |
DVC-facing layer: config, checksum strategy, dependency check |
_DatabricksVolumesFS |
fsspec.AbstractFileSystem |
I/O layer: all Databricks SDK calls |
A .pth file installed into site-packages ensures the plugin is loaded at
Python startup in every process (including DVC CLI subprocesses), without
requiring any manual imports.
Environment variables
| Variable | Description |
|---|---|
DATABRICKS_CONFIG_PROFILE |
Databricks CLI profile name from ~/.databrickscfg. Falls back to the default profile if not set. |
License
MIT © Óscar Reyes
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dvc_databricks-1.2.5.tar.gz.
File metadata
- Download URL: dvc_databricks-1.2.5.tar.gz
- Upload date:
- Size: 25.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efc6bd50f0eaa8eca72136a6e54b8176c0d3554cf7a428ebd9194bec73b3cf9c
|
|
| MD5 |
7f27370f033b0de7179285240500f2a6
|
|
| BLAKE2b-256 |
740b1adfa03f30d89619b61008d4d099d64520d1c6ace880d91eaa3e7ae61c3c
|
File details
Details for the file dvc_databricks-1.2.5-py3-none-any.whl.
File metadata
- Download URL: dvc_databricks-1.2.5-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2109a0dd8ec46fce17ac5d195d9b91144f4a11004beea167be12acd27f67afdd
|
|
| MD5 |
ae61e4d47c20e084a5107f3f808b09be
|
|
| BLAKE2b-256 |
1dc476d120e41e788b2faee0a8eed82f7f15eab5d3b46e6b7f53f4d7114f3a3e
|