Zero-copy import of DVC-tracked data into lakeFS
Project description
dvc-to-lakefs
Zero-copy import of DVC-tracked data into lakeFS. Objects are linked by reference from the DVC remote.
Requirements:
- Python 3.10+
- a Git-backed DVC repo (
dvc init --no-scmis not supported) - a configured DVC remote with data already pushed (
dvc push) - a running lakeFS instance with a repository created on the same blockstore as the DVC remote (e.g. both on S3); a mismatch will return an error
- lakeFS credentials set up (reads from
~/.lakectl.yamlby default, orLAKECTL_CONFIG_FILE) - read access to the DVC remote storage configured for DVC and for lakeFS server
Installation
pip install dvc-to-lakefs
Usage
lakectl import-from-dvc dvc-repo lakefs://<repo-name>
[!NOTE] You can also invoke the tool directly:
lakectl-import-from-dvc dvc-repo lakefs://<repo> # or, python -m dvc_to_lakefs dvc-repo lakefs://<repo>
Reads the HEAD of the current Git branch and imports all tracked DVC outputs into a lakeFS branch of the same name.
- The lakeFS branch is created from the default branch if it doesn't exist.
- A new commit is created on the branch with the imported files. The commit message matches the Git commit message, and the commit includes a
git_shametadata field with the corresponding Git SHA. - Existing files at the same path are overwritten; all other files on the branch are left untouched.
- Re-running the import never deletes files. Removing a
.dvcfile from DVC and re-importing will not remove it from lakeFS.
Use --dry-run to preview what would be imported:
lakectl import-from-dvc ./myrepo lakefs://myrepo --dry-run
Options
| Flag | Description |
|---|---|
-r, --remote name |
DVC remote to use (default: the repo's default remote) |
--branch branch |
Git branch to export; repeat to export multiple branches (default: current branch) |
--dry-run |
Preview the import plan without writing anything to lakeFS |
--skip-broken-stages |
Skip unreadable dvc.yaml/dvc.lock/.dvc files and export the rest of the repo |
--skip-broken-revs |
When exporting multiple branches, skip any branch that fails instead of aborting |
--show-files |
Expand directory outputs to list every file instead of a single summary line |
Examples
# export two branches
lakectl import-from-dvc ./myrepo lakefs://myrepo --branch main --branch dev
# use a specific remote
lakectl import-from-dvc ./myrepo lakefs://myrepo --remote staging
# skip branches that fail
lakectl import-from-dvc ./myrepo lakefs://myrepo --branch main --branch dev --skip-broken-revs
Unsupported outputs
The following outputs are not supported and will be skipped (reported under "Skipped" in the output):
- no hash info (stage was never run)
- directory output with missing or corrupted cache (run
dvc pushto fix) cache: falseorpush: falsedvc importanddvc import-urlstages- external outputs or paths outside the repository
- per-output
remote:override indvc.yaml - cloud-versioned outputs (pushed to a
version_aware = trueremote)
Supported remotes
S3, GCS, Azure Blob Storage, and local filesystem (Linux/macOS only).
Not supported: worktree remotes, version-aware remotes (version_aware = true), dvc init --no-scm.
Limitations
- Only HEAD of the Git branch is exported. Git history is not replayed.
- Uncommitted and staged changes are ignored; only committed state is exported.
Contributing
Contributions are welcome! This project uses uv for environment and dependency management.
Setup
First, fork the repository on GitHub and clone your fork:
git clone https://github.com/<your-username>/dvc-to-lakefs
cd dvc-to-lakefs
# create the virtualenv and install all dev dependencies
uv sync --group dev
# install the pre-commit hooks
uv run prek install
Code is formatted and linted with ruff and type checked with mypy in strict mode. Both run automatically via pre-commit.
Tests
uv run pytest # unit + e2e tests (runs in parallel via pytest-xdist)
uv run pytest tests/unit # unit tests only
The e2e tests spin up local backends and a lakeFS instance (the lakeFS binary is downloaded automatically on first run), so they may take longer than the unit tests. The S3 backend runs in-process via moto, but the Azure and GCS backends need Docker (they start Azurite and fake-gcs-server containers).
Opening a PR
-
Sign the lakeFS CLA (individual or corporate) when opening your first pull request.
-
Work on a branch off
main, not your fork'smain. -
Keep each PR focused on a single change.
-
Run the same checks as CI before requesting review:
uv run prek run --all-files # ruff lint + format, mypy, and assorted checks uv run pytest # full test suite
License
Licensed under the Apache License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dvc_to_lakefs-0.1.0.tar.gz.
File metadata
- Download URL: dvc_to_lakefs-0.1.0.tar.gz
- Upload date:
- Size: 281.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcfdad169610d320b4984b19f838bb5d8f2a920bb657880ef5790475bb6581f8
|
|
| MD5 |
9dabd2f47e5363a3d912acbe31af7ee7
|
|
| BLAKE2b-256 |
e711db917e8b408ab6d8538492944de4c585703547c8da071ebfd86cef371ae8
|
Provenance
The following attestation bundles were made for dvc_to_lakefs-0.1.0.tar.gz:
Publisher:
build.yaml on treeverse/dvc-to-lakefs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dvc_to_lakefs-0.1.0.tar.gz -
Subject digest:
fcfdad169610d320b4984b19f838bb5d8f2a920bb657880ef5790475bb6581f8 - Sigstore transparency entry: 2018846279
- Sigstore integration time:
-
Permalink:
treeverse/dvc-to-lakefs@a23870d14583188c897deac741a6cac5edee4154 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/treeverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build.yaml@a23870d14583188c897deac741a6cac5edee4154 -
Trigger Event:
release
-
Statement type:
File details
Details for the file dvc_to_lakefs-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dvc_to_lakefs-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd4b7fdf215d7a2cf7992a179f79443a14b2778e8b68f4ec2b5d12ac03965843
|
|
| MD5 |
5a2965607f9063f9444ca6f1dcb4ce99
|
|
| BLAKE2b-256 |
5c84c5c275ee16f9ee74673a42a5fee15a508f4f9dbb27b717c7f504ce99857e
|
Provenance
The following attestation bundles were made for dvc_to_lakefs-0.1.0-py3-none-any.whl:
Publisher:
build.yaml on treeverse/dvc-to-lakefs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dvc_to_lakefs-0.1.0-py3-none-any.whl -
Subject digest:
dd4b7fdf215d7a2cf7992a179f79443a14b2778e8b68f4ec2b5d12ac03965843 - Sigstore transparency entry: 2018846357
- Sigstore integration time:
-
Permalink:
treeverse/dvc-to-lakefs@a23870d14583188c897deac741a6cac5edee4154 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/treeverse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build.yaml@a23870d14583188c897deac741a6cac5edee4154 -
Trigger Event:
release
-
Statement type: