Skip to main content

Minimal data version control - a lightweight wrapper around DVC

Project description

DVX - Minimal Data Version Control

DVX is a lightweight wrapper around DVC that provides core data versioning with several enhancements:

  • Parallel pipeline execution with per-file provenance tracking
  • Decentralized workflow definitions - each .dvc file contains its computation, deps, and outputs
  • Enhanced diff with preprocessing pipelines and directory support
  • Cache introspection commands for examining cached data
  • Performance optimizations for large repos (batched git lookups, mtime caching)

Why DVX?

Decentralized Pipelines

DVC stores all pipeline stages in a single dvc.yaml file. DVX takes a different approach: each .dvc file contains its own computation metadata:

# output.parquet.dvc
outs:
  - md5: abc123...
    size: 1048576
    hash: md5
    path: output.parquet
meta:
  computation:
    cmd: python process.py input.parquet output.parquet
    deps:
      - path: input.parquet
        md5: def456...

This means:

  • Parallel execution: Independent artifacts run concurrently
  • Self-contained provenance: Each output knows exactly how it was created
  • Git-friendly: Changes to one artifact don't touch other files
  • No lock contention: Multiple processes can add artifacts simultaneously

Enhanced Diff

Diff DVC-tracked files between commits, optionally piping through preprocessing commands first. This is especially useful for binary formats like Parquet, gzipped files, or any format that benefits from transformation before diffing.

# Content diff of a file (HEAD vs worktree)
dvx diff data.csv

# Diff between commits
dvx diff -r HEAD^..HEAD data.csv

# Diff specific commit vs its parent
dvx diff -R abc123 data.csv

# Summary mode: show changed files with hashes
dvx diff -s
dvx diff -s -r HEAD~5..HEAD

Preprocessing Pipelines

The real power is piping files through commands before diffing:

# Compare line counts
dvx diff wc -l data.csv

# Compare Parquet schema (using parquet2json)
dvx diff parquet2json {} schema data.parquet

# Compare first row as pretty JSON
dvx diff 'parquet2json {} cat -l 1 | jq .' data.parquet

# Decompress and compare headers of gzipped CSVs
dvx diff 'gunzip -c {} | head -n1' data.csv.gz

# Compare sorted, deduplicated content
dvx diff 'sort {} | uniq' data.txt

Directory Diffs

When diffing DVC-tracked directories, DVX shows which files changed with their hashes:

$ dvx diff -R abc123 data/
test.parquet: c07bba3f... -> f46dd86f...
test.txt: e20b902b... -> 9306ec07...

Cache Introspection

Examine cached data without checkout:

# Get cache path for a tracked file
dvx cache path data.parquet

# Get MD5 hash
dvx cache md5 data.parquet

# View cached file contents directly
dvx cat data.csv

# Works with files inside DVC-tracked directories
dvx cat data_dir/subset.parquet

Installation

pip install dvx

# With S3 support
pip install dvx[s3]

# With all remote backends
pip install dvx[all]

Usage

Running Pipelines

# Run all .dvc computations (parallel by default)
dvx run

# Run specific target
dvx run output.dvc

# Use 4 parallel workers
dvx run -j 4

# Dry-run to see execution plan
dvx run --dry-run

# Force re-run (ignore freshness)
dvx run --force

Tracking Data

# Initialize
dvx init

# Track files (parallel-safe, lock-free)
dvx add data/
dvx add model.pkl

# Track with dependencies (for provenance)
dvx add output.parquet --dep input.parquet --cmd "python process.py"

# Auto-add stale deps first (recursive)
dvx add -r output.parquet

Status and Diff

# Check freshness (data vs deps)
dvx status
dvx status -v          # also show fresh files
dvx status --yaml      # detailed YAML output with hashes
dvx status -j4 data/   # parallel checking

# Content diff
dvx diff data.parquet
dvx diff -r HEAD^..HEAD results/
dvx diff -s            # summary mode (files + hashes)

Push/Pull

# Configure remote
dvx remote add -d myremote s3://mybucket/dvc

# Push to remote
dvx push
dvx push --dry-run     # see what would be pushed

# Pull from remote
dvx pull
dvx pull --dry-run     # see what would be pulled

# Ref-specific operations
dvx pull -r HEAD~3     # pull data as of 3 commits ago

Python API

from dvx import Repo

# Initialize
repo = Repo.init()

# Or open existing
with Repo() as repo:
    repo.add("data/")
    repo.push()

    status = repo.status()
    diff = repo.diff("HEAD~1")

Commands

Command Description
run Execute computations from .dvc files (parallel)
add Track file(s) with optional provenance
status Show freshness of tracked files (data & deps)
diff Content diff with preprocessing support
cache Inspect cache (path, md5, dir)
cat View cached file contents
push Upload data to remote storage
pull Download data from remote storage
fetch Download to cache (no checkout)
checkout Restore data files from cache
gc Garbage collect unused cache
init Initialize a DVX repository
remote Manage remotes
config Configure settings
remove Stop tracking file(s)
move Move tracked file(s)
import Import from another DVC repo
import-url Import from a URL
get Download without tracking
get-url Download URL without tracking
shell-integration Output shell aliases

What's Different from DVC

Added in DVX

  • dvx run - Parallel pipeline execution with per-file provenance
  • dvx cache path/md5 - Cache introspection
  • dvx cat - View cached files directly
  • dvx diff preprocessing - Pipe through commands before diffing
  • dvx status --yaml - Detailed status with hashes
  • Lock-free parallel add operations
  • Git blob batching for faster status checks

Not Included

DVX intentionally excludes DVC's heavier features:

  • Central pipelines (dvc.yaml, dvc repro, dvc dag)
  • Experiments (dvc exp, experiment tracking)
  • Metrics/Params/Plots (dvc metrics, dvc params, dvc plots)

If you need these features, use DVC directly.

Freshness Model

DVX tracks two types of freshness for each artifact:

  1. Data freshness: Does the actual data match the hash in the .dvc file?
  2. Dep freshness: Do recorded dependency hashes match the deps' .dvc files?
$ dvx status s3/output/
✗ s3/output/result.parquet.dvc (data changed (abc123... vs def456...)) s3/output/summary.json.dvc (dep changed: s3/input/data.parquet) s3/output/metadata.json.dvc (up-to-date)

Provenance Tracking

When adding outputs with dependencies:

  • Deps must be fresh: dvx add errors if any dep's hash differs from its .dvc hash
  • Recursive add: Use dvx add -r to auto-add stale deps first
  • Accurate recording: Recorded dep hashes always match what was actually used

Performance

DVX is optimized for large repos:

  • Mtime caching: SQLite-backed cache skips unchanged files
  • Batched git lookups: Single git ls-tree call for all blob SHAs
  • Lock-free adds: Parallel-safe via atomic file writes
  • Parallel status: Check many files concurrently with -j/--jobs
  • Parallel runs: Independent computations execute concurrently

Compatibility

  • DVX uses .dvc files - fully compatible with DVC
  • DVX repos are DVC repos - you can use dvc commands too
  • DVC plugins (dvc-s3, dvc-gs, etc.) work with DVX

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dvx-0.2.0.tar.gz (230.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dvx-0.2.0-py3-none-any.whl (66.6 kB view details)

Uploaded Python 3

File details

Details for the file dvx-0.2.0.tar.gz.

File metadata

  • Download URL: dvx-0.2.0.tar.gz
  • Upload date:
  • Size: 230.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for dvx-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2fcf368096382bca0e64ba0c4538c7976683568820b29518c2044bd2d7bca41e
MD5 172f2caba8f4a4c1e022b1e457fcdac5
BLAKE2b-256 689f6f857b57a0dfdaa5f9065b625d696387c09afac14063fb1e7152fdd6e0af

See more details on using hashes here.

File details

Details for the file dvx-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dvx-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 66.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for dvx-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9e5cb44abb41d34bdc9829dd6d0b68bdf9210b71fba006f5a7028d80caa11275
MD5 d040a19402598358c24b5c0c3a8d5c4d
BLAKE2b-256 06c9fe87396ed3ea69eb8ecb942cb5baf8ea19d52218ddefc2b241d0f4819daa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page