Skip to main content

Minimal data version control - a lightweight wrapper around DVC

Project description

DVX - Minimal Data Version Control

DVX is a lightweight wrapper around DVC that provides only the core data versioning functionality, without pipelines, experiments, metrics, params, or plots.

Why DVX?

DVC is a powerful tool, but its feature set has grown significantly. If you only need to:

  • Track large files with .dvc files
  • Push/pull data to remote storage (S3, GCS, etc.)
  • Version data alongside your code

...then DVX gives you exactly that, with a simpler interface and smaller surface area.

Installation

pip install dvx

# With S3 support
pip install dvx[s3]

# With all remote backends
pip install dvx[all]

Usage

CLI

# Initialize
dvx init

# Track files (parallel-safe, lock-free)
dvx add data/
dvx add model.pkl
dvx add -r output.parquet  # auto-add stale deps first

# Configure remote
dvx remote add -d myremote s3://mybucket/dvc

# Push to remote
dvx push

# Pull from remote
dvx pull

# Check status (shows data vs dep freshness)
dvx status
dvx status -v          # also show fresh files
dvx status --json      # JSON output
dvx status -j4 data/   # parallel checking

Python API

from dvx import Repo

# Initialize
repo = Repo.init()

# Or open existing
with Repo() as repo:
    repo.add("data/")
    repo.push()

    status = repo.status()
    diff = repo.diff("HEAD~1")

Commands

DVX exposes these DVC commands:

Command Description
init Initialize a DVX/DVC repository
add Track file(s) with DVX
push Upload data to remote storage
pull Download data from remote storage
fetch Download data to cache (no checkout)
checkout Restore data files from cache
status Show freshness of tracked files (data & deps)
diff Show changes between revisions
gc Garbage collect unused cache
remove Stop tracking file(s)
move Move tracked file(s)
import Import from another DVC repo
import-url Import from a URL
get Download without tracking
get-url Download URL without tracking
config Configure settings (delegates to DVC)
remote Manage remotes (delegates to DVC)
cache Manage cache (delegates to DVC)

What's NOT included

DVX intentionally excludes:

  • Pipelines (dvc.yaml, dvc run, dvc repro, dvc dag)
  • Experiments (dvc exp, experiment tracking)
  • Metrics (dvc metrics)
  • Params (dvc params)
  • Plots (dvc plots)
  • Stages (dvc stage)

If you need these features, use DVC directly.

Freshness Model

DVX tracks two types of freshness for each artifact:

  1. Data freshness: Does the actual data match the hash in the .dvc file?
  2. Dep freshness: Do recorded dependency hashes match the deps' .dvc files?

This mirrors git's model - each .dvc file declares what it expects, with no transitivity. If a dependency's data differs from its own .dvc file, that's a separate issue for that dependency.

$ dvx status s3/output/
✗ s3/output/result.parquet.dvc (data changed (abc123... vs def456...)) s3/output/summary.json.dvc (dep changed: s3/input/data.parquet) s3/output/metadata.json.dvc (up-to-date)

Provenance Tracking

When adding an output with deps, DVX ensures accurate provenance:

  • Deps must be fresh: dvx add errors if any dep's file hash differs from its .dvc hash
  • Recursive add: Use dvx add -r to auto-add stale deps first (depth-first)
  • Accurate recording: Recorded dep hashes always match what was actually used
$ dvx add output.parquet
Error: Cannot add output.parquet: 1 stale dep(s):
  input.parquet: .dvc=abc123... file=def456...
Run `dvx add` on deps first, or use --recursive

$ dvx add -r output.parquet  # adds input.parquet first, then output.parquet
Added input.parquet (def456...)
Added output.parquet (xyz789...)

Performance

DVX is optimized for large repos:

  • Mtime caching: Skips hash computation when file mtime unchanged (SQLite-backed)
  • Batched git lookups: Uses git ls-tree -r for all blob SHAs in one call
  • Lock-free adds: Parallel-safe cache operations via atomic file writes
  • Parallel status: Check many files concurrently with -j/--jobs

Compatibility

  • DVX uses .dvc files - fully compatible with DVC
  • DVX repos are DVC repos - you can use dvc commands too
  • DVC plugins (dvc-s3, dvc-gs, etc.) work with DVX

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dvx-0.1.0.tar.gz (63.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dvx-0.1.0-py3-none-any.whl (57.1 kB view details)

Uploaded Python 3

File details

Details for the file dvx-0.1.0.tar.gz.

File metadata

  • Download URL: dvx-0.1.0.tar.gz
  • Upload date:
  • Size: 63.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for dvx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4fc8333b50dd720544b3d3b67ee26de0e957308073c9c416d6dec3c6956d5ce0
MD5 7b5ea15f9d776cf34e4db9c77e7eb8a9
BLAKE2b-256 d0199e531f9a9b53c6743babae7aba44bceb0eee8cdf11cdf2b8eafab2ee26ae

See more details on using hashes here.

File details

Details for the file dvx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dvx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 57.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for dvx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d92f23aecc2e6bd9be31b99bd12e42af42d8d33be5a3b15d6c840b46fbdb2a69
MD5 51dfd5b3dbcab1e07296b19aca7caa1d
BLAKE2b-256 f9be02776e11de3db116f0302bf7c490919ed81ee99aca385b0785e0f806b8a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page