Minimal data version control - a lightweight wrapper around DVC
Project description
DVX - Minimal Data Version Control
DVX is a lightweight wrapper around DVC that provides core data versioning with several enhancements:
- Parallel pipeline execution with per-file provenance tracking
- Decentralized workflow definitions - each
.dvcfile contains its computation, deps, and outputs - Enhanced diff with preprocessing pipelines and directory support
- Cache introspection commands for examining cached data
- Performance optimizations for large repos (batched git lookups, mtime caching)
Why DVX?
Decentralized Pipelines
DVC stores all pipeline stages in a single dvc.yaml file. DVX takes a different approach: each .dvc file contains its own computation metadata:
# output.parquet.dvc
outs:
- md5: abc123...
size: 1048576
hash: md5
path: output.parquet
meta:
computation:
cmd: python process.py input.parquet output.parquet
deps:
- path: input.parquet
md5: def456...
This means:
- Parallel execution: Independent artifacts run concurrently
- Self-contained provenance: Each output knows exactly how it was created
- Git-friendly: Changes to one artifact don't touch other files
- No lock contention: Multiple processes can add artifacts simultaneously
Enhanced Diff
Diff DVC-tracked files between commits, optionally piping through preprocessing commands first. This is especially useful for binary formats like Parquet, gzipped files, or any format that benefits from transformation before diffing.
# Content diff of a file (HEAD vs worktree)
dvx diff data.csv
# Diff between commits
dvx diff -r HEAD^..HEAD data.csv
# Diff specific commit vs its parent
dvx diff -R abc123 data.csv
# Summary mode: show changed files with hashes
dvx diff -s
dvx diff -s -r HEAD~5..HEAD
Preprocessing Pipelines
The real power is piping files through commands before diffing:
# Compare line counts
dvx diff wc -l data.csv
# Compare Parquet schema (using parquet2json)
dvx diff parquet2json {} schema data.parquet
# Compare first row as pretty JSON
dvx diff 'parquet2json {} cat -l 1 | jq .' data.parquet
# Decompress and compare headers of gzipped CSVs
dvx diff 'gunzip -c {} | head -n1' data.csv.gz
# Compare sorted, deduplicated content
dvx diff 'sort {} | uniq' data.txt
Directory Diffs
When diffing DVC-tracked directories, DVX shows which files changed with their hashes:
$ dvx diff -R abc123 data/
test.parquet: c07bba3f... -> f46dd86f...
test.txt: e20b902b... -> 9306ec07...
Cache Introspection
Examine cached data without checkout:
# Get cache path for a tracked file
dvx cache path data.parquet
# Get MD5 hash
dvx cache md5 data.parquet
# View cached file contents directly
dvx cat data.csv
# Works with files inside DVC-tracked directories
dvx cat data_dir/subset.parquet
Installation
pip install dvx
# With S3 support
pip install dvx[s3]
# With all remote backends
pip install dvx[all]
Usage
Running Pipelines
# Run all .dvc computations (parallel by default)
dvx run
# Run specific target
dvx run output.dvc
# Use 4 parallel workers
dvx run -j 4
# Dry-run to see execution plan
dvx run --dry-run
# Force re-run (ignore freshness)
dvx run --force
Tracking Data
# Initialize
dvx init
# Track files (parallel-safe, lock-free)
dvx add data/
dvx add model.pkl
# Track with dependencies (for provenance)
dvx add output.parquet --dep input.parquet --cmd "python process.py"
# Auto-add stale deps first (recursive)
dvx add -r output.parquet
Status and Diff
# Check freshness (data vs deps)
dvx status
dvx status -v # also show fresh files
dvx status --yaml # detailed YAML output with hashes
dvx status -j4 data/ # parallel checking
# Content diff
dvx diff data.parquet
dvx diff -r HEAD^..HEAD results/
dvx diff -s # summary mode (files + hashes)
Push/Pull
# Configure remote
dvx remote add -d myremote s3://mybucket/dvc
# Push to remote
dvx push
dvx push --dry-run # see what would be pushed
# Pull from remote
dvx pull
dvx pull --dry-run # see what would be pulled
# Ref-specific operations
dvx pull -r HEAD~3 # pull data as of 3 commits ago
Python API
from dvx import Repo
# Initialize
repo = Repo.init()
# Or open existing
with Repo() as repo:
repo.add("data/")
repo.push()
status = repo.status()
diff = repo.diff("HEAD~1")
Commands
| Command | Description |
|---|---|
run |
Execute computations from .dvc files (parallel) |
add |
Track file(s) with optional provenance |
status |
Show freshness of tracked files (data & deps) |
diff |
Content diff with preprocessing support |
cache |
Inspect cache (path, md5, dir) |
cat |
View cached file contents |
push |
Upload data to remote storage |
pull |
Download data from remote storage |
fetch |
Download to cache (no checkout) |
checkout |
Restore data files from cache |
gc |
Garbage collect unused cache |
init |
Initialize a DVX repository |
remote |
Manage remotes |
config |
Configure settings |
remove |
Stop tracking file(s) |
move |
Move tracked file(s) |
import |
Import from another DVC repo |
import-url |
Import from a URL |
get |
Download without tracking |
get-url |
Download URL without tracking |
shell-integration |
Output shell aliases |
What's Different from DVC
Added in DVX
dvx run- Parallel pipeline execution with per-file provenancedvx cache path/md5- Cache introspectiondvx cat- View cached files directlydvx diffpreprocessing - Pipe through commands before diffingdvx status --yaml- Detailed status with hashes- Lock-free parallel
addoperations - Git blob batching for faster status checks
Not Included
DVX intentionally excludes DVC's heavier features:
- Central pipelines (
dvc.yaml,dvc repro,dvc dag) - Experiments (
dvc exp, experiment tracking) - Metrics/Params/Plots (
dvc metrics,dvc params,dvc plots)
If you need these features, use DVC directly.
Freshness Model
DVX tracks two types of freshness for each artifact:
- Data freshness: Does the actual data match the hash in the
.dvcfile? - Dep freshness: Do recorded dependency hashes match the deps'
.dvcfiles?
$ dvx status s3/output/
✗ s3/output/result.parquet.dvc (data changed (abc123... vs def456...))
✗ s3/output/summary.json.dvc (dep changed: s3/input/data.parquet)
✓ s3/output/metadata.json.dvc (up-to-date)
Provenance Tracking
When adding outputs with dependencies:
- Deps must be fresh:
dvx adderrors if any dep's hash differs from its.dvchash - Recursive add: Use
dvx add -rto auto-add stale deps first - Accurate recording: Recorded dep hashes always match what was actually used
Performance
DVX is optimized for large repos:
- Mtime caching: SQLite-backed cache skips unchanged files
- Batched git lookups: Single
git ls-treecall for all blob SHAs - Lock-free adds: Parallel-safe via atomic file writes
- Parallel status: Check many files concurrently with
-j/--jobs - Parallel runs: Independent computations execute concurrently
Compatibility
- DVX uses
.dvcfiles - fully compatible with DVC - DVX repos are DVC repos - you can use
dvccommands too - DVC plugins (dvc-s3, dvc-gs, etc.) work with DVX
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dvx-0.3.0.tar.gz.
File metadata
- Download URL: dvx-0.3.0.tar.gz
- Upload date:
- Size: 235.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16ce28a5fc2a484270f969c86f53444d71e5e95e83a9bce3ca9b7066cef551d3
|
|
| MD5 |
6c2bd868b8c6f9b40a470c1eec404644
|
|
| BLAKE2b-256 |
a3431a5ec66631fcaf3b1961cec5cafd5002ed0aa0e259ecb822795fc35f77ab
|
File details
Details for the file dvx-0.3.0-py3-none-any.whl.
File metadata
- Download URL: dvx-0.3.0-py3-none-any.whl
- Upload date:
- Size: 67.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7ed509e5ef29a482e10322ab9470634a46a6d9641cc69f66a4282819d37c4d4
|
|
| MD5 |
412bb06eef520f3344cdd57e1c41decd
|
|
| BLAKE2b-256 |
ed408b028794e9eed802671b59effd536ddc8f6e49a6adc5753a58b5883a8767
|