Reference-based file reconstruction with CDC chunking, SBD1 binary seed format, and IPFS transport
Project description
Seedbraid
Seedbraid is a reference-based reconstruction tool for large, similar binary artifacts.
It combines deterministic content-defined chunking (CDC), a compact binary SBD1 seed format, reusable genome storage, and optional IPFS transport so you can ship reconstruction intent instead of repeatedly shipping full blobs.
Why Seedbraid
Seedbraid is designed for workflows where ordinary file distribution becomes wasteful:
- large binary artifacts change often, but stay mostly similar
- fixed-size chunking loses reuse under shifted offsets
- you want compact transport plus bit-perfect restore guarantees
- you want one CLI surface for encode, verify, decode, publish, and fetch
In short: Seedbraid helps you move less data, reuse more content, and still verify exact reconstruction.
When Seedbraid Is a Good Fit
Seedbraid works especially well for:
- large binary versioning: datasets, ML models, media assets, VM images
- distribution of many similar files across releases
- shift-heavy changes such as insertions that break fixed chunk reuse
- IPFS-based distribution and retrieval with integrity validation
- environments where transfer size, dedup reuse, and reproducibility matter
Core Capabilities
- Lossless encode/decode with SHA-256 verification
- Deterministic chunking with
fixed,cdc_buzhash, andcdc_rabin - Genome storage backed by SQLite for deduplicated chunk reuse
SBD1binary seed container with manifest, recipe, optional RAW, and integrity data- IPFS publish/fetch transport
- Optional remote pin integration
- Strict verification mode for production-grade restore checks
- Optional signing and encryption support
Installation
pip
pip install seedbraid
pipx
pipx install seedbraid
seedbraid --help
uvx
uvx seedbraid --help
uvx seedbraid doctor
Optional extras
# pip
pip install "seedbraid[zstd]"
pip install "seedbraid[crypto]" # encryption / signing support
# pipx
pipx install "seedbraid[zstd]"
pipx install "seedbraid[crypto]"
# uvx
uvx --from "seedbraid[zstd]" seedbraid doctor
uvx --from "seedbraid[crypto]" seedbraid doctor
Quick Start
1. Encode a file into a seed
seedbraid encode input.bin --genome ./genome --out seed.sbd --portable
2. Verify the seed
seedbraid verify seed.sbd --genome ./genome --strict
3. Decode the file back
seedbraid decode seed.sbd --genome ./genome --out recovered.bin
4. Compare the result
cmp -s input.bin recovered.bin && echo "bit-perfect roundtrip: OK"
Note: If you installed via
uvx, prefix commands withuvx(e.g.uvx seedbraid encode ...). For development builds, useuv run --no-editable seedbraidinstead.
Typical Workflow
A common Seedbraid workflow looks like this:
- Prime or learn reusable chunks into a genome
- Encode a target artifact into a compact
SBD1seed - Verify integrity before distribution
- Publish the seed if needed, including via IPFS
- Fetch and decode later using the genome
- Run strict verification when exact restore is required
Stability
Seedbraid v2.0.0 is production-ready.
Before deploying to your environment, validate behavior in your own runtime, storage, and network configuration.
Treat successful verify --strict and bit-perfect restore checks as release gates.
Production Validation Checklist
Before using Seedbraid in CI/CD or production pipelines, run a strict smoke workflow like this:
uv sync --no-editable --extra dev
workdir="$(mktemp -d)"
python3 - <<'PY' "$workdir/input.bin"
from pathlib import Path
import sys
out = Path(sys.argv[1])
payload = (b"seedbraid-beta-smoke" * 20000) + bytes(range(256)) * 200
out.write_bytes(payload)
print(f"wrote {out} bytes={len(payload)}")
PY
uv run --no-sync --no-editable seedbraid encode "$workdir/input.bin" \
--genome "$workdir/genome" \
--out "$workdir/seed.sbd" \
--chunker cdc_buzhash \
--avg 65536 --min 16384 --max 262144 \
--learn --portable --compression zlib
uv run --no-sync --no-editable seedbraid verify "$workdir/seed.sbd" \
--genome "$workdir/genome" \
--strict
uv run --no-sync --no-editable seedbraid decode "$workdir/seed.sbd" \
--genome "$workdir/genome" \
--out "$workdir/decoded.bin"
cmp -s "$workdir/input.bin" "$workdir/decoded.bin" \
&& echo "bit-perfect roundtrip: OK"
CLI Reference
All examples below use bare
seedbraid. If you installed viauvx, prefix withuvx. For development builds, useuv run --no-editable seedbraid.
Core Commands
Encode
seedbraid encode input.bin --genome ./genome --out seed.sbd
seedbraid encode input.bin --genome ./genome --out seed.sbd \
--chunker cdc_buzhash --avg 65536 --min 16384 --max 262144 \
--learn --no-portable --compression zlib
seedbraid encode input.bin --genome ./genome --out seed.private.sbd \
--manifest-private
export SB_ENCRYPTION_KEY='your-secret-passphrase'
seedbraid encode input.bin --genome ./genome --out seed.encrypted.sbd \
--encrypt --manifest-private
Decode
seedbraid decode seed.sbd --genome ./genome --out recovered.bin
seedbraid decode seed.encrypted.sbd --genome ./genome --out recovered.bin \
--encryption-key "$SB_ENCRYPTION_KEY"
Verify
seedbraid verify seed.sbd --genome ./genome
seedbraid verify seed.sbd --genome ./genome --strict
seedbraid verify seed.sbd --genome ./genome --require-signature --signature-key "$SB_SIGNING_KEY"
seedbraid verify seed.encrypted.sbd --genome ./genome --strict \
--encryption-key "$SB_ENCRYPTION_KEY"
verify supports two modes:
- Quick mode: checks seed integrity and required chunk availability
- Strict mode: reconstructs all content and enforces source size and SHA-256 match
Prime
seedbraid prime "./dataset/**/*" --genome ./genome --chunker cdc_buzhash
Doctor
seedbraid doctor --genome ./genome
doctor checks:
- Python runtime compatibility (
>=3.12) - kubo API reachability (
SB_KUBO_API) IPFS_PATHstate- genome path writability
- compression support (
zlib, optionalzstd)
Advanced Commands
Genome Snapshot / Restore
seedbraid genome snapshot --genome ./genome --out genome.sgs
seedbraid genome restore genome.sgs --genome ./genome-dr --replace
Publish Chunks to IPFS
seedbraid publish-chunks seed.sbd --genome ./genome
seedbraid publish-chunks seed.sbd --genome ./genome \
--manifest-out chunks.json --workers 32
seedbraid publish-chunks seed.sbd --genome ./genome \
--pin --remote-pin \
--remote-endpoint https://pin.example/api/v1 \
--remote-token "$SB_PINNING_TOKEN"
publish-chunks publishes all CDC chunks referenced by a seed to IPFS as raw blocks, generates a chunk manifest sidecar (.sbd.chunks.json), and optionally pins the chunk DAG locally or via a remote pinning provider.
Fetch and Decode from IPFS
seedbraid fetch-decode seed.sbd --out recovered.bin
seedbraid fetch-decode seed.sbd --out recovered.bin \
--workers 64 --batch-size 200 --retries 5
seedbraid fetch-decode seed.sbd --out recovered.bin \
--gateway https://ipfs.io/ipfs
fetch-decode reads a seed and its chunk manifest, fetches all chunks from IPFS in parallel batches, and reconstructs the original file. Requires the chunk manifest sidecar (.sbd.chunks.json) alongside the seed.
Decode with IPFS Genome
seedbraid decode seed.sbd --genome ipfs:// --out recovered.bin
seedbraid decode seed.sbd --genome ipfs:///path/to/cache --out recovered.bin
seedbraid decode seed.sbd --genome ipfs:// --out recovered.bin \
--gateway https://ipfs.io/ipfs
Using --genome ipfs:// activates hybrid storage: chunks are fetched from IPFS with local SQLite caching. ipfs:// uses a temporary cache; ipfs:///path/to/cache persists fetched chunks for future reuse.
Publish to IPFS
seedbraid publish seed.sbd --no-pin
seedbraid publish seed.sbd --pin
seedbraid publish seed.sbd --remote-pin \
--remote-endpoint https://pin.example/api/v1 --remote-token "$SB_PINNING_TOKEN"
publish emits a warning when the seed is unencrypted. For sensitive data, prefer:
seedbraid encode --encrypt --manifest-private ...
When --remote-pin is enabled, Seedbraid also registers the CID with a configured Pinning Services API-compatible provider.
Fetch from IPFS
seedbraid fetch <cid> --out fetched.sbd
seedbraid fetch <cid> --out fetched.sbd --retries 5 --backoff-ms 300
seedbraid fetch <cid> --out fetched.sbd --gateway https://ipfs.io/ipfs
fetch retries with exponential backoff via the kubo HTTP API and can fall back to an HTTP gateway.
Pin Health
seedbraid pin-health <cid>
Remote Pin Add
export SB_PINNING_ENDPOINT='https://pin.example/api/v1'
export SB_PINNING_TOKEN='your-api-token'
seedbraid pin remote-add <cid>
Sign Seed
export SB_SIGNING_KEY='your-shared-secret'
seedbraid sign seed.sbd --out seed.signed.sbd --key-env SB_SIGNING_KEY --key-id team-a
Export / Import Genes
seedbraid export-genes seed.sbd --genome ./genome --out genes.pack
seedbraid import-genes genes.pack --genome ./another-genome
Generate an Encryption Key
Generate a high-entropy key for SB_ENCRYPTION_KEY:
seedbraid gen-encryption-key
Print shell export format:
seedbraid gen-encryption-key --shell
Set the current shell variable directly:
eval "$(seedbraid gen-encryption-key --shell)"
IPFS Setup
Start the kubo daemon:
ipfs daemon
By default, seedbraid connects to the kubo HTTP API at
http://127.0.0.1:5001/api/v0. Override with the SB_KUBO_API
environment variable:
export SB_KUBO_API=http://127.0.0.1:5001/api/v0
Run seedbraid doctor to verify connectivity.
Remote Pinning Setup
To use a remote pinning service, set the endpoint and token as environment variables.
Using a shell profile (~/.bashrc, ~/.zshrc):
export SB_PINNING_ENDPOINT='https://api.pinata.cloud/psa'
export SB_PINNING_TOKEN='your-api-token'
Using direnv (.envrc in your project directory):
# .envrc
export SB_PINNING_ENDPOINT='https://api.pinata.cloud/psa'
export SB_PINNING_TOKEN='your-api-token'
With these variables set, --remote-pin works without passing --remote-endpoint and --remote-token each time.
Verifying a Remote Pin
After publishing with --remote-pin, confirm the pin is active:
# 1. Check local pin and block availability
seedbraid pin-health <cid>
# 2. Verify the pinned content is fetchable from the network
seedbraid fetch <cid> --out /tmp/verify.sbd
seedbraid verify /tmp/verify.sbd --genome ./genome --strict
If pin-health reports the CID is pinned and fetch + verify --strict succeed, the remote pin is working correctly.
Common Failures
kubo daemon not reachable- Install Kubo, start the daemon with
ipfs daemon, and verify withseedbraid doctor
- Install Kubo, start the daemon with
Missing required chunkon decode or verify- Provide the correct
--genome, or re-encode with--portable
- Provide the correct
zstdcompression error- Install optional dependency
zstandard, or use--compression zlib
- Install optional dependency
Data Recovery Guide
Reconstructing a file requires two things: a seed (the recipe describing chunk order) and the chunks themselves (the actual data). If either is missing, recovery is impossible.
When Recovery Succeeds
| Scenario | Why It Works |
|---|---|
| Seed on hand + local genome available | Recipe and ingredients are both local |
| Seed on hand + own IPFS node running with chunks pinned | Recipe is local; ingredients are in your node's storage |
| Seed on hand + chunks held by a pinning service (Pinata, etc.) | Recipe is local; ingredients are in a paid storage provider |
| Seed on hand + teammate's IPFS node holds the chunks | Recipe is local; ingredients are on a peer's node |
Seed created with --portable (chunks embedded in seed) |
Recipe and ingredients are bundled together in one file |
Seed on hand + genome snapshot (.sgs backup) exists |
Recipe is local; ingredients are in a backup archive |
When Recovery Fails
| Scenario | Why It Fails |
|---|---|
| Seed file lost | Without the recipe, there is no way to know which chunks to fetch or how to reassemble them |
| Seed exists, but genome deleted and chunks never published to IPFS | Recipe exists, but all ingredients have been discarded |
| Seed exists, but IPFS node stopped and no other node holds the chunks | Recipe exists, but the only store that had the ingredients is offline |
| Seed exists, but IPFS pin removed and garbage collection ran | Recipe exists, but automatic cleanup deleted the ingredients |
| Seed exists, but pinning service subscription expired | Recipe exists, but the storage provider disposed of the ingredients |
| Seed exists, but even one chunk is missing from all sources | Partial recovery is not supported; every chunk is required |
| Seed is encrypted and the encryption key is lost | The recipe is unreadable without the key |
Protecting Against Data Loss
| Action | Risk Mitigated |
|---|---|
| Back up seed files | Prevents seed loss |
Use --pin when publishing chunks |
Prevents IPFS garbage collection |
Use a pinning service (--remote-pin) |
Survives local node shutdown |
Encode with --portable |
Self-contained seed; no external chunk source needed (seed size increases) |
| Keep encryption keys in a secret manager | Prevents key loss for encrypted seeds |
Take genome snapshots (genome snapshot) |
Preserves local chunk data independently of IPFS |
Safest option:
--portableembeds all chunks in the seed, making it fully self-contained. The trade-off is that the seed grows to roughly the size of the original file, reducing the benefit of IPFS distribution.
Troubleshooting Matrix
| Symptom | Error Code | Next Action |
|---|---|---|
| Encryption requested but key missing | SB_E_ENCRYPTION_KEY_MISSING |
Pass --encryption-key or set SB_ENCRYPTION_KEY. |
| Signing requested but key missing | SB_E_SIGNING_KEY_MISSING |
Export signing key env var and retry seedbraid sign. |
| Kubo daemon unreachable | SB_E_IPFS_NOT_FOUND |
Install Kubo, run ipfs daemon, set SB_KUBO_API if non-default endpoint. |
| IPFS fetch/publish failure | SB_E_IPFS_FETCH / SB_E_IPFS_PUBLISH |
Check daemon/network, retry, use gateway fallback if needed. |
| Remote pin configuration missing | SB_E_REMOTE_PIN_CONFIG |
Set endpoint/token env vars or pass options. |
| Remote pin auth failed | SB_E_REMOTE_PIN_AUTH |
Verify provider token permissions and retry. |
| Remote pin request invalid | SB_E_REMOTE_PIN_REQUEST |
Check CID/provider options and retry. |
| Remote pin timeout/failure | SB_E_REMOTE_PIN_TIMEOUT / SB_E_REMOTE_PIN |
Increase retries/timeout or check provider health. |
| Seed parse/integrity failure | SB_E_SEED_FORMAT |
Re-fetch/rebuild seed and verify source integrity. |
| IPFS chunk publish failed | SB_E_IPFS_CHUNK_PUT |
Check IPFS daemon, retry, verify chunk availability. |
| IPFS chunk fetch failed | SB_E_IPFS_CHUNK_GET |
Check daemon/network, retry, use --gateway fallback. |
| Chunk manifest invalid | SB_E_CHUNK_MANIFEST_FORMAT |
Regenerate manifest with publish-chunks. |
| IPFS MFS operation failed | SB_E_IPFS_MFS |
Verify daemon is running with seedbraid doctor. |
Development & Contributing
The sections below are for contributors and developers working on Seedbraid itself.
Development Setup
uv sync --no-editable --extra dev
Optional zstd support:
uv sync --no-editable --extra dev --extra zstd
Refresh the lockfile after dependency changes:
uv lock
Local Checks
UV_CACHE_DIR=.uv-cache uv run --no-editable ruff check .
PYTHONPATH=src uv run --no-editable python -m pytest
PYTHONPATH=src uv run --no-editable python -m pytest tests/test_compat_fixtures.py
IPFS tests auto-skip when the kubo daemon is not reachable.
Compatibility fixtures are stored in tests/fixtures/compat/v1/ and validated by tests/test_compat_fixtures.py.
To regenerate them intentionally:
uv run --no-editable python scripts/gen_compat_fixtures.py
CI
GitHub Actions workflows:
.github/workflows/ci.ymlruff check .python -m pytest- compatibility fixtures validation
- benchmark gate
.github/workflows/publish-seed.yml- manual only,
dry_run=trueby default - generates a seed from
source_path - runs
seedbraid verify --strict - publishes to IPFS only when
dry_run=false - installs Kubo when needed
- verifies Kubo release signature status and checksum
- supports
pin,portable,manifest_private, and optionalencrypt
- manual only,
Local parity commands:
uv sync --no-editable --extra dev
uv run --no-sync --no-editable ruff check .
PYTHONPATH=src uv run --no-sync --no-editable python -m pytest
PYTHONPATH=src uv run --no-sync --no-editable python -m pytest tests/test_compat_fixtures.py
uv run --no-sync --no-editable python scripts/bench_gate.py \
--min-reuse-improvement-bps 1 \
--max-seed-size-ratio 1.20 \
--min-cdc-throughput-mib-s 0.10 \
--json-out .artifacts/bench-report.json
Benchmarking
1-byte insertion dedup benchmark
uv run --no-editable python scripts/bench_shifted_dedup.py
uv run --no-editable python scripts/bench_gate.py \
--min-reuse-improvement-bps 1 \
--max-seed-size-ratio 1.20 \
--min-cdc-throughput-mib-s 0.10 \
--json-out .artifacts/bench-report.json
Expected behavior:
cdc_buzhashshould show better reuse thanfixedwhen a single-byte insertion shifts offsetsbench_gate.pyexits non-zero when configured thresholds are violated
Integrations
DVC Integration
- Minimal DVC bridge lives in
examples/dvc/ - Pipeline stages are
encode -> verify --strict -> fetch - The integration recipe and artifact layout are documented in
examples/dvc/README.md
OCI Integration
- ORAS bridge scripts and usage docs live in
examples/oci/ - Default OCI metadata convention:
- artifact type:
application/vnd.seedbraid.seed.v1 - layer media type:
application/vnd.seedbraid.seed.layer.v1+sbd - annotations: source SHA-256, chunker, manifest-private flag, seed title
- artifact type:
- Push/pull scripts:
examples/oci/scripts/push_seed.sh <seed.sbd> <registry/repository:tag>examples/oci/scripts/pull_seed.sh <registry/repository:tag> <out.sbd>
- After pull, run strict verification:
seedbraid verify <out.sbd> --genome <genome-path> --strict
ML Tooling Hooks
- Scripts for MLflow metadata logging and Hugging Face upload live in
examples/ml/ - MLflow hook logs seed metadata fields
- Hugging Face hook uploads
seed.sbdand a metadata sidecar - Restore workflow is documented in
examples/ml/README.md
Roadmap
Current adoption priorities include:
- a faster onboarding path
- stronger benchmark evidence versus alternatives
- security and operator tooling such as signing, encryption,
doctor,snapshot, andrestore - stable format governance and backward-compatibility policy for long-lived seed archives
Project Documents
- Format spec:
docs/FORMAT.md - Design rationale:
docs/DESIGN.md - Threat model:
docs/THREAT_MODEL.md - Error codes:
docs/ERROR_CODES.md - Performance gates:
docs/PERFORMANCE.md - DVC example:
examples/dvc/README.md - OCI example:
examples/oci/README.md - ML tooling example:
examples/ml/README.md
Support Seedbraid
Seedbraid is maintained as an open-source project.
If Seedbraid helps your workflow, please consider supporting the project through the repository Sponsor button. Support goes directly toward maintenance, documentation, and compatibility/performance validation.
Open Source Governance
- License:
MIT(LICENSE) - Security policy:
SECURITY.md - Contributing guide:
CONTRIBUTING.md - Code of Conduct:
CODE_OF_CONDUCT.md
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file seedbraid-2.0.2.tar.gz.
File metadata
- Download URL: seedbraid-2.0.2.tar.gz
- Upload date:
- Size: 96.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
206bb8f858d3f14a35374145c9d4fd6ddb99fe9c15dad41e667d242a0215c0f3
|
|
| MD5 |
f9a65314be3c6495c01586c18557523c
|
|
| BLAKE2b-256 |
8958a66a860e2b62af9f5f843066349072f197470da7902b0432fe8b9463d393
|
Provenance
The following attestation bundles were made for seedbraid-2.0.2.tar.gz:
Publisher:
release.yml on aimsise/seedbraid
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
seedbraid-2.0.2.tar.gz -
Subject digest:
206bb8f858d3f14a35374145c9d4fd6ddb99fe9c15dad41e667d242a0215c0f3 - Sigstore transparency entry: 1212738633
- Sigstore integration time:
-
Permalink:
aimsise/seedbraid@43ad170bebc291ff3ddfd4bc0bcc5fce3f47f83e -
Branch / Tag:
refs/tags/v2.0.2 - Owner: https://github.com/aimsise
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@43ad170bebc291ff3ddfd4bc0bcc5fce3f47f83e -
Trigger Event:
push
-
Statement type:
File details
Details for the file seedbraid-2.0.2-py3-none-any.whl.
File metadata
- Download URL: seedbraid-2.0.2-py3-none-any.whl
- Upload date:
- Size: 67.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c61830efe4ba8c22c4fa736c8af756e2e4a2333c9551297c087c75d07c2dbf43
|
|
| MD5 |
33bb5351c4fbdfb4392b673ea38cf3e0
|
|
| BLAKE2b-256 |
da6f91ccdb818498fe0724db7bc64997265f2e77dbd78384b5f17ee5c100cc70
|
Provenance
The following attestation bundles were made for seedbraid-2.0.2-py3-none-any.whl:
Publisher:
release.yml on aimsise/seedbraid
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
seedbraid-2.0.2-py3-none-any.whl -
Subject digest:
c61830efe4ba8c22c4fa736c8af756e2e4a2333c9551297c087c75d07c2dbf43 - Sigstore transparency entry: 1212738688
- Sigstore integration time:
-
Permalink:
aimsise/seedbraid@43ad170bebc291ff3ddfd4bc0bcc5fce3f47f83e -
Branch / Tag:
refs/tags/v2.0.2 - Owner: https://github.com/aimsise
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@43ad170bebc291ff3ddfd4bc0bcc5fce3f47f83e -
Trigger Event:
push
-
Statement type: