API-first embedded versioned storage for local ML artifacts

These details have not been verified by PyPI

Project links

Project description

hubvault

PyPI - Python Version

hubvault is an API-first, embedded, portable versioned repository for local ML artifacts such as model weights, datasets, evaluation outputs, and experiment bundles.

It gives you Hugging Face style file APIs and Git-like commit / branch / tag / merge semantics, while the repository itself remains a single movable local directory. There is no remote service requirement and no repo-external database to operate.

Quick Start

Install from PyPI:

pip install hubvault

Create a local repository, commit files, read them back, and materialize detached download views:

from pathlib import Path

from hubvault import CommitOperationAdd, HubVaultApi

repo_dir = Path("demo-repo")
api = HubVaultApi(repo_dir)

info = api.create_repo()
print(info.default_branch)  # main

api.upload_file(
    path_or_fileobj=b"weights-v1",
    path_in_repo="artifacts/model.safetensors",
    commit_message="add model weights",
)

api.create_commit(
    operations=[CommitOperationAdd("README.md", b"# Demo repo\n")],
    commit_message="add readme",
)

print(api.list_repo_files())
print(api.read_bytes("README.md").decode("utf-8").strip())

download_path = api.hf_hub_download("artifacts/model.safetensors")
print(Path(download_path).as_posix().endswith("artifacts/model.safetensors"))

snapshot_dir = api.snapshot_download()
print(snapshot_dir)

print(api.quick_verify().ok)

Use the CLI when you want a git-like shell workflow without a Git workspace:

hubvault init demo-repo
printf 'weights-v1' > model.bin

hubvault -C demo-repo commit -m "add weights" --add artifacts/model.bin=./model.bin
hubvault -C demo-repo ls-tree
hubvault -C demo-repo download artifacts/model.bin
hubvault -C demo-repo verify

hubvault and hv point to the same CLI entry point. Current commands include init, commit, branch, tag, merge, log, ls-tree, download, snapshot, verify, reset, and status.

What hubvault Is For

hubvault is for users who need a durable repository for deep learning artifacts without first operating heavyweight infrastructure. It lets you keep large model weights, datasets, evaluation outputs, and experiment bundles in a local repository that still behaves like a normal movable directory.

The strongest fit is environments where a hosted Hub, a Docker or Kubernetes stack, or an external object storage service such as OSS or S3 would add too much operational cost, would not work offline, or would be constrained by free-tier resource limits. hubvault gives you a repo-local alternative: no server process, no cluster, no repo-external metadata database, and no object-store bucket are required.

It is especially useful when you need:

persistent maintenance of large deep-learning artifacts across many generations
explicit commits, refs, rollback, and verification instead of an ad-hoc cache directory
atomic repository mutations where interrupted writes roll back rather than leaving a half-published state
stable committed data with detached read paths, so downloaded files cannot silently mutate repository truth
customizable resource release through get_storage_overview(), gc(), and squash_history()
Hugging Face style file operations on top of a local embedded repository

hubvault is not:

a hosted Hub service
a Git remote / PR / review platform
a Git workspace or staging-area replacement
a writable cache that returns raw repository-truth file paths

Performance Snapshot

The numbers below are current benchmark snapshot values from a Linux x86_64 machine running CPython 3.10.10. They are shown as absolute measured throughput, together with the same-run local filesystem sequential read/write baselines.

Treat these as a concrete reference point, not as a universal guarantee. Warm-cache rows can exceed the raw disk-read baseline because they mostly measure detached-view reuse and cache hits rather than physical disk reads.

Byte-Oriented Workloads

Workload	Benchmark profile	Measured throughput	Same-run disk baseline	Approx. ratio
Local filesystem sequential read	standard	`9296.92 MiB/s`	read baseline	`100.00%`
Local filesystem sequential write	standard	`360.61 MiB/s`	write baseline	`100.00%`
Large file upload	standard	`230.69 MiB/s`	write `360.61 MiB/s`	`63.97%`
Large range read	standard	`1113.59 MiB/s`	read `9296.92 MiB/s`	`11.98%`
Cold file download	standard	`846.98 MiB/s`	read `9296.92 MiB/s`	`9.11%`
Warm file download	standard	`13761.47 MiB/s`	read `9296.92 MiB/s`	`148.02%`
Cache-heavy warm download	standard	`19704.43 MiB/s`	read `9296.92 MiB/s`	`211.95%`
Large file upload	pressure	`332.13 MiB/s`	write `360.22 MiB/s`	`92.20%`
Large range read	pressure	`910.23 MiB/s`	read `9532.68 MiB/s`	`9.55%`
Cold file download	pressure	`422.80 MiB/s`	read `9532.68 MiB/s`	`4.44%`
Warm file download	pressure	`637608.97 MiB/s`	read `9532.68 MiB/s`	cache/view hit
Cache-heavy warm download	pressure	`39457.46 MiB/s`	read `9532.68 MiB/s`	`413.92%`

Metadata and Maintenance Workloads

These workloads are not pure byte-stream reads or writes, so comparing them directly to raw disk bandwidth is misleading. They are included because they are the operations that usually make a versioned artifact repository feel fast or slow once history grows.

Workload	Public API surface	Measured result	Wall time
Deep history listing	`list_repo_commits` / `list_repo_refs` / `list_repo_reflog`	`15221.94 ops/s`	`4.40 s`
Recursive nested tree listing	`list_repo_tree(recursive=True)`	`31185.03 ops/s`	`0.88 s`
Heavy non-fast-forward merge	`merge`	`126.65 MiB/s`	`0.43 s`
Squash history with follow-up cleanup	`squash_history`	`146.83 MiB/s`	`1.48 s`
Chunk threshold sweep	`upload_file` + `get_paths_info`	`74.20 MiB/s`	`0.27 s`
Small-file read-all path	`read_bytes`	`5.76 MiB/s`, `1473.64 ops/s`	`0.91 s`

The practical reading is straightforward: large uploads are close to the measured write baseline, range reads and cold downloads are real byte-moving workloads with non-trivial repository overhead, and warm downloads are cache/view-hit paths. The clearest remaining performance work is small-file hot reads and warm-path metadata short-circuiting.

What You Get Today

repository metadata, refs, reflog, transaction state, chunk visibility, and object metadata live in repo-root metadata.sqlite3
payload bytes remain as ordinary filesystem data:
- objects/blobs/*.data
- chunks/packs/*.pack
repository-wide public concurrency is serialized by locks/repo.lock
read APIs return detached user views rather than writable aliases of repository truth
quick_verify(), full_verify(), gc(), squash_history(), and get_storage_overview() are available as public maintenance APIs

In practice, you get a repo-local metadata database with filesystem-managed payload storage. You do not need to operate the database directly; the public API stays focused on repository operations.

Core Strengths

1. The repository root is the product

All durable state stays under the repository root. A repo remains valid after:

moving it to another absolute path
archiving and restoring it later
handing the directory to another process or machine

Repository truth does not depend on absolute paths, host-local registries, or external sidecar databases.

2. Git-like history semantics without pretending to be Git workspace

hubvault exposes:

Git-style 40-hex commit / tree / blob OIDs
branches, tags, and reflog
fast-forward, merge-commit, and conflict merge outcomes
explicit commit APIs rather than implicit staging-area behavior

The mental model is closer to "a local artifact repository with Git-like history" than "Git transplanted onto large-file storage."

3. Hugging Face style file APIs

The public surface is centered on HubVaultApi, including:

upload_file() / upload_folder()
hf_hub_download() / snapshot_download()
list_repo_files() / list_repo_tree() / get_paths_info()
list_repo_commits() / list_repo_refs() / list_repo_reflog()

Where alignment with huggingface_hub improves usability, hubvault follows it closely. Parameters that would be meaningless no-ops for a local embedded repository are intentionally omitted.

4. Detached read views are a first-class rule

hf_hub_download("artifacts/model.safetensors") preserves the repo-relative suffix in the returned path
the returned path is a user-facing readable view
editing or deleting that path does not corrupt committed repository truth
the system can materialize the view again when needed

In other words, read APIs expose safe views, not writable aliases of committed truth.

5. Small and large files share one versioned model

small files can be stored as ordinary versioned objects
large files switch to chunk / pack storage after the configured threshold
public metadata still exposes HF-style oid / blob_id / sha256
internal addressing remains decoupled from the public file model

Runtime Layout

The current layout is best understood like this:

repo/
├── FORMAT
├── metadata.sqlite3
├── locks/
│   └── repo.lock
├── objects/
│   └── blobs/
│       └── ... *.data
├── chunks/
│   └── packs/
│       └── ... *.pack
├── cache/
├── txn/
└── quarantine/

You usually do not need to inspect these files directly. The layout is shown to explain why the repository can be copied, archived, and reopened as one directory.

Good Fits and Non-Goals

Good fits:

local model repositories
dataset and evaluation snapshot archives
training outputs and reproducible experiment bundles
offline artifact repositories that need branch / merge / verify / GC behavior

Current non-goals:

remote sync protocols
multi-tenant server deployment
a Git workspace or staging compatibility layer
storing all payload bytes directly inside SQLite

Docs and Contributor Entry Points

English docs: https://hubvault.readthedocs.io/en/latest/
Chinese docs: https://hubvault.readthedocs.io/zh/latest/
Contribution guide: CONTRIBUTING.md
Repository collaboration rules: AGENTS.md
Benchmark records: build/benchmark/

Project Status

The current published version is still 0.0.1, and the project remains pre-stable. That said, the following capabilities are already implemented:

SQLite truth-store
detached read views
local history / refs / merge / reflog
verify / gc / squash / storage overview
both Python API and CLI entry points

If you need a local, portable, ML-artifact-oriented versioned repository, hubvault is already a serious experimental foundation. If you need a mature remote collaboration platform or fully optimized hot-read performance, the project is still converging.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Apr 10, 2026

0.0.1

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hubvault-0.0.2.tar.gz (167.5 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hubvault-0.0.2-py3-none-any.whl (137.9 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file hubvault-0.0.2.tar.gz.

File metadata

Download URL: hubvault-0.0.2.tar.gz
Upload date: Apr 10, 2026
Size: 167.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hubvault-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`c970604f9af9553db03469bce35eeffaad584b609251cf227591b2de8729cc29`
MD5	`d22e4e619813ce0198be5fe3a6579365`
BLAKE2b-256	`fa56f696c9741db6ad94d35e7ebf2da386c156ac811cc9366f967dfa21a57964`

See more details on using hashes here.

File details

Details for the file hubvault-0.0.2-py3-none-any.whl.

File metadata

Download URL: hubvault-0.0.2-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 137.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hubvault-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`79f673425ef2266dfd5463b42b252db135b1aaace15a0955ad7fb7190008e61c`
MD5	`e69e5f194044f31d780c20d3ac39b00b`
BLAKE2b-256	`c90e4396e5080819be6f5b4bd5edcc532eef8acf613df52baee96dcdb25cb691`

See more details on using hashes here.

hubvault 0.0.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

hubvault

Quick Start

What hubvault Is For

Performance Snapshot

Byte-Oriented Workloads

Metadata and Maintenance Workloads

What You Get Today

Core Strengths

1. The repository root is the product

2. Git-like history semantics without pretending to be Git workspace

3. Hugging Face style file APIs

4. Detached read views are a first-class rule

5. Small and large files share one versioned model

Runtime Layout

Good Fits and Non-Goals

Docs and Contributor Entry Points

Project Status

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes