Skip to main content

Zero-copy import of DVC-tracked data into lakeFS

Project description

dvc-to-lakefs

Zero-copy import of DVC-tracked data into lakeFS. Objects are linked by reference from the DVC remote.

Requirements:

  • Python 3.10+
  • a Git-backed DVC repo (dvc init --no-scm is not supported)
  • a configured DVC remote with data already pushed (dvc push)
  • a running lakeFS instance with a repository created on the same blockstore as the DVC remote (e.g. both on S3); a mismatch will return an error
  • lakeFS credentials set up (reads from ~/.lakectl.yaml by default, or LAKECTL_CONFIG_FILE)
  • read access to the DVC remote storage configured for DVC and for lakeFS server

Installation

pip install dvc-to-lakefs

Usage

lakectl import-from-dvc dvc-repo lakefs://<repo-name>

[!NOTE] You can also invoke the tool directly:

lakectl-import-from-dvc dvc-repo lakefs://<repo>  # or,
python -m dvc_to_lakefs dvc-repo lakefs://<repo>

Reads the HEAD of the current Git branch and imports all tracked DVC outputs into a lakeFS branch of the same name.

  • The lakeFS branch is created from the default branch if it doesn't exist.
  • A new commit is created on the branch with the imported files. The commit message matches the Git commit message, and the commit includes a git_sha metadata field with the corresponding Git SHA.
  • Existing files at the same path are overwritten; all other files on the branch are left untouched.
  • Re-running the import never deletes files. Removing a .dvc file from DVC and re-importing will not remove it from lakeFS.

Use --dry-run to preview what would be imported:

lakectl import-from-dvc ./myrepo lakefs://myrepo --dry-run

Options

Flag Description
-r, --remote name DVC remote to use (default: the repo's default remote)
--branch branch Git branch to export; repeat to export multiple branches (default: current branch)
--dry-run Preview the import plan without writing anything to lakeFS
--skip-broken-stages Skip unreadable dvc.yaml/dvc.lock/.dvc files and export the rest of the repo
--skip-broken-revs When exporting multiple branches, skip any branch that fails instead of aborting
--show-files Expand directory outputs to list every file instead of a single summary line

Examples

# export two branches
lakectl import-from-dvc ./myrepo lakefs://myrepo --branch main --branch dev

# use a specific remote
lakectl import-from-dvc ./myrepo lakefs://myrepo --remote staging

# skip branches that fail
lakectl import-from-dvc ./myrepo lakefs://myrepo --branch main --branch dev --skip-broken-revs

Unsupported outputs

The following outputs are not supported and will be skipped (reported under "Skipped" in the output):

  • no hash info (stage was never run)
  • directory output with missing or corrupted cache (run dvc push to fix)
  • cache: false or push: false
  • dvc import and dvc import-url stages
  • external outputs or paths outside the repository
  • per-output remote: override in dvc.yaml
  • cloud-versioned outputs (pushed to a version_aware = true remote)

Supported remotes

S3, GCS, Azure Blob Storage, and local filesystem (Linux/macOS only).

Not supported: worktree remotes, version-aware remotes (version_aware = true), dvc init --no-scm.

Limitations

  • Only HEAD of the Git branch is exported. Git history is not replayed.
  • Uncommitted and staged changes are ignored; only committed state is exported.

Contributing

Contributions are welcome! This project uses uv for environment and dependency management.

Setup

First, fork the repository on GitHub and clone your fork:

git clone https://github.com/<your-username>/dvc-to-lakefs
cd dvc-to-lakefs

# create the virtualenv and install all dev dependencies
uv sync --group dev

# install the pre-commit hooks
uv run prek install

Code is formatted and linted with ruff and type checked with mypy in strict mode. Both run automatically via pre-commit.

Tests

uv run pytest                       # unit + e2e tests (runs in parallel via pytest-xdist)
uv run pytest tests/unit            # unit tests only

The e2e tests spin up local backends and a lakeFS instance (the lakeFS binary is downloaded automatically on first run), so they may take longer than the unit tests. The S3 backend runs in-process via moto, but the Azure and GCS backends need Docker (they start Azurite and fake-gcs-server containers).

Opening a PR

  • Sign the lakeFS CLA (individual or corporate) when opening your first pull request.

  • Work on a branch off main, not your fork's main.

  • Keep each PR focused on a single change.

  • Run the same checks as CI before requesting review:

    uv run prek run --all-files   # ruff lint + format, mypy, and assorted checks
    uv run pytest                 # full test suite
    

License

Licensed under the Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dvc_to_lakefs-0.1.0.tar.gz (281.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dvc_to_lakefs-0.1.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file dvc_to_lakefs-0.1.0.tar.gz.

File metadata

  • Download URL: dvc_to_lakefs-0.1.0.tar.gz
  • Upload date:
  • Size: 281.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dvc_to_lakefs-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fcfdad169610d320b4984b19f838bb5d8f2a920bb657880ef5790475bb6581f8
MD5 9dabd2f47e5363a3d912acbe31af7ee7
BLAKE2b-256 e711db917e8b408ab6d8538492944de4c585703547c8da071ebfd86cef371ae8

See more details on using hashes here.

Provenance

The following attestation bundles were made for dvc_to_lakefs-0.1.0.tar.gz:

Publisher: build.yaml on treeverse/dvc-to-lakefs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dvc_to_lakefs-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dvc_to_lakefs-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dvc_to_lakefs-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dd4b7fdf215d7a2cf7992a179f79443a14b2778e8b68f4ec2b5d12ac03965843
MD5 5a2965607f9063f9444ca6f1dcb4ce99
BLAKE2b-256 5c84c5c275ee16f9ee74673a42a5fee15a508f4f9dbb27b717c7f504ce99857e

See more details on using hashes here.

Provenance

The following attestation bundles were made for dvc_to_lakefs-0.1.0-py3-none-any.whl:

Publisher: build.yaml on treeverse/dvc-to-lakefs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page