Skip to main content

Seedbraid reference-based reconstruction with CDC and IPFS seed transport

Project description

Seedbraid

CI

Seedbraid provides reference-based reconstruction with deterministic content-defined chunking (CDC), a binary SBD1 seed format, and IPFS publish/fetch transport.

Beta Status (Read First)

  • Seedbraid is currently in beta stage.
  • Before production use, run strict validation in your own runtime/storage/network environment.
  • Treat successful verify --strict and bit-perfect restore checks as release gates for your team.

Strict Validation Workflow (Required Before Production)

Run the following smoke workflow before relying on Seedbraid in CI/CD or production pipelines:

uv sync --no-editable --extra dev

workdir="$(mktemp -d)"
python3 - <<'PY' "$workdir/input.bin"
from pathlib import Path
import sys

out = Path(sys.argv[1])
payload = (b"seedbraid-beta-smoke" * 20000) + bytes(range(256)) * 200
out.write_bytes(payload)
print(f"wrote {out} bytes={len(payload)}")
PY

uv run --no-sync --no-editable seedbraid encode "$workdir/input.bin" \
  --genome "$workdir/genome" \
  --out "$workdir/seed.sbd" \
  --chunker cdc_buzhash \
  --avg 65536 --min 16384 --max 262144 \
  --learn --portable --compression zlib

uv run --no-sync --no-editable seedbraid verify "$workdir/seed.sbd" \
  --genome "$workdir/genome" \
  --strict

uv run --no-sync --no-editable seedbraid decode "$workdir/seed.sbd" \
  --genome "$workdir/genome" \
  --out "$workdir/decoded.bin"

cmp -s "$workdir/input.bin" "$workdir/decoded.bin" \
  && echo "bit-perfect roundtrip: OK"

UV_CACHE_DIR=.uv-cache uv run --no-sync --no-editable ruff check .
PYTHONPATH=src UV_CACHE_DIR=.uv-cache uv run --no-sync --no-editable python -m pytest

Features

  • Lossless encode/decode with SHA-256 verification.
  • Chunkers: fixed, cdc_buzhash, cdc_rabin.
  • Genome storage (SQLite) for deduplicated chunk reuse.
  • SBD1 binary seed container (manifest + recipe + optional RAW + integrity).
  • IPFS CLI integration (publish, fetch).
  • Optional remote pin integration (pin remote-add, publish-time remote pin).

Why Seedbraid

  • Seed-first architecture: reconstruction intent is shipped as a compact SBD1 seed (manifest + recipe) instead of shipping full blobs repeatedly.
  • End-to-end integrity posture: strict verify mode, compatibility fixtures, and performance gates are built into the project workflow.
  • Practical Web3 distribution: CID publish/fetch is part of the same CLI surface as encode/decode, reducing operational handoffs.
  • Shift-resilient dedup by default: CDC is first-class and benchmarked against fixed chunking with reproducible scripts.

Best-Fit Use Cases

  • Large binary versioning: datasets, ML models, media assets, and VM images.
  • Distribution of many similar files: share a common genome and distribute compact seeds.
  • IPFS-based distribution and retrieval: distribute by CID and verify reconstruction integrity.
  • Shift-heavy changes (for example, single-byte insertion): CDC improves reuse over fixed chunking.

What It Takes for OSS Adoption

  • A 5-minute onboarding path (installation + first encode/decode tutorial).
  • Benchmark evidence that Seedbraid wins against alternatives on size, transfer time, and restore speed.
  • Security and operations readiness: signing/encryption and operator tooling (doctor, snapshot, restore).
  • Stable format governance and backward-compatibility policy for long-lived seed archives.

Installation

Note: PyPI publishing is currently on hold. pip install seedbraid is not yet available. Please install from source.

Quick Start

uv sync --no-editable --extra dev

Optional zstd support:

uv sync --no-editable --extra dev --extra zstd

Refresh lockfile after dependency changes:

uv lock

Generate Encryption Key

Generate a high-entropy key for SB_ENCRYPTION_KEY:

uv run --no-editable seedbraid gen-encryption-key

Print shell export format:

uv run --no-editable seedbraid gen-encryption-key --shell

Set current shell variable directly:

eval "$(uv run --no-editable seedbraid gen-encryption-key --shell)"

CLI

Encode

uv run --no-editable seedbraid encode input.bin --genome ./genome --out seed.sbd \
  --chunker cdc_buzhash --avg 65536 --min 16384 --max 262144 \
  --learn --no-portable --compression zlib

uv run --no-editable seedbraid encode input.bin --genome ./genome --out seed.private.sbd \
  --manifest-private

export SB_ENCRYPTION_KEY='your-secret-passphrase'
uv run --no-editable seedbraid encode input.bin --genome ./genome --out seed.encrypted.sbd \
  --encrypt --manifest-private

Decode

uv run --no-editable seedbraid decode seed.sbd --genome ./genome --out recovered.bin
uv run --no-editable seedbraid decode seed.encrypted.sbd --genome ./genome --out recovered.bin \
  --encryption-key "$SB_ENCRYPTION_KEY"

Verify

uv run --no-editable seedbraid verify seed.sbd --genome ./genome
uv run --no-editable seedbraid verify seed.sbd --genome ./genome --strict
uv run --no-editable seedbraid verify seed.sbd --genome ./genome --require-signature --signature-key "$SB_SIGNING_KEY"
uv run --no-editable seedbraid verify seed.encrypted.sbd --genome ./genome --strict \
  --encryption-key "$SB_ENCRYPTION_KEY"

verify supports two modes:

  • Quick mode (default): checks seed integrity and required chunk availability.
  • Strict mode (--strict): reconstructs all content and enforces source size and SHA-256 match.

Prime

uv run --no-editable seedbraid prime "./dataset/**/*" --genome ./genome --chunker cdc_buzhash

Genome Snapshot / Restore

uv run --no-editable seedbraid genome snapshot --genome ./genome --out genome.sgs
uv run --no-editable seedbraid genome restore genome.sgs --genome ./genome-dr --replace

Publish (IPFS)

uv run --no-editable seedbraid publish seed.sbd --no-pin
uv run --no-editable seedbraid publish seed.sbd --pin
uv run --no-editable seedbraid publish seed.sbd --remote-pin \
  --remote-endpoint https://pin.example/api/v1 --remote-token "$SB_PINNING_TOKEN"

publish emits a warning when seed is unencrypted. For sensitive data, prefer: seedbraid encode --encrypt --manifest-private ... before publishing. When --remote-pin is enabled, Seedbraid also registers CID with configured remote pin provider (Pinning Services API-compatible).

Fetch (IPFS)

uv run --no-editable seedbraid fetch <cid> --out fetched.sbd
uv run --no-editable seedbraid fetch <cid> --out fetched.sbd --retries 5 --backoff-ms 300
uv run --no-editable seedbraid fetch <cid> --out fetched.sbd --gateway https://ipfs.io/ipfs

fetch retries ipfs cat with exponential backoff and can fallback to an HTTP gateway.

Pin Health (IPFS)

uv run --no-editable seedbraid pin-health <cid>

Remote Pin Add (IPFS)

export SB_PINNING_ENDPOINT='https://pin.example/api/v1'
export SB_PINNING_TOKEN='your-api-token'
uv run --no-editable seedbraid pin remote-add <cid>

Doctor

uv run --no-editable seedbraid doctor --genome ./genome

doctor checks:

  • Python runtime compatibility (>=3.12)
  • IPFS CLI availability/version
  • IPFS_PATH state
  • genome path writability
  • compression support (zlib, optional zstd)

Sign Seed (optional)

export SB_SIGNING_KEY='your-shared-secret'
uv run --no-editable seedbraid sign seed.sbd --out seed.signed.sbd --key-env SB_SIGNING_KEY --key-id team-a

Export / Import Genes (optional)

uv run --no-editable seedbraid export-genes seed.sbd --genome ./genome --out genes.pack
uv run --no-editable seedbraid import-genes genes.pack --genome ./another-genome

IPFS Installation/Check

Check if IPFS CLI is available:

ipfs --version

If missing, install Kubo (IPFS CLI) and ensure ipfs is on your PATH.

Common Failures

  • ipfs CLI not found:
    • Install IPFS and verify with ipfs --version.
  • Missing required chunk on decode/verify:
    • Provide the correct --genome, or re-encode with --portable.
  • zstd compression error:
    • Install optional dependency zstandard, or use --compression zlib.

Troubleshooting Matrix

Symptom Error Code Next Action
Encryption requested but key missing SB_E_ENCRYPTION_KEY_MISSING Pass --encryption-key or set SB_ENCRYPTION_KEY.
Signing requested but key missing SB_E_SIGNING_KEY_MISSING Export signing key env var and retry seedbraid sign.
IPFS CLI missing SB_E_IPFS_NOT_FOUND Install Kubo and confirm ipfs --version.
IPFS fetch/publish failure SB_E_IPFS_FETCH / SB_E_IPFS_PUBLISH Check daemon/network, retry, use gateway fallback if needed.
Remote pin configuration missing SB_E_REMOTE_PIN_CONFIG Set endpoint/token env vars or pass options.
Remote pin auth failed SB_E_REMOTE_PIN_AUTH Verify provider token permissions and retry.
Remote pin request invalid SB_E_REMOTE_PIN_REQUEST Check CID/provider options and retry.
Remote pin timeout/failure SB_E_REMOTE_PIN_TIMEOUT / SB_E_REMOTE_PIN Increase retries/timeout or check provider health.
Seed parse/integrity failure SB_E_SEED_FORMAT Re-fetch/rebuild seed and verify source integrity.

CI (SBD-ECO-001)

GitHub Actions workflows:

  • .github/workflows/ci.yml
    • Lint: ruff check .
    • Test: python -m pytest
    • Compatibility fixtures: python -m pytest tests/test_compat_fixtures.py
    • Benchmark gate: python scripts/bench_gate.py ...
  • .github/workflows/publish-seed.yml (manual only, dry_run=true default)
    • Generates seed from source_path via seedbraid encode
    • Runs strict integrity check via seedbraid verify --strict
    • Publishes to IPFS only when dry_run=false
    • Installs Kubo (ipfs CLI) on runner when dry_run=false (version configurable via kubo_version)
    • Verifies Kubo release tag signature status via GitHub API before install
    • Verifies downloaded Kubo archive checksum (sha512) before extraction
    • Supports pin, portable, manifest_private, and optional encrypt (SB_ENCRYPTION_KEY secret required when encrypt=true)

Local parity commands:

uv sync --no-editable --extra dev
uv run --no-sync --no-editable ruff check .
PYTHONPATH=src uv run --no-sync --no-editable python -m pytest
PYTHONPATH=src uv run --no-sync --no-editable python -m pytest tests/test_compat_fixtures.py
uv run --no-sync --no-editable python scripts/bench_gate.py \
  --min-reuse-improvement-bps 1 \
  --max-seed-size-ratio 1.20 \
  --min-cdc-throughput-mib-s 0.10 \
  --json-out .artifacts/bench-report.json

DVC Integration (SBD-ECO-003)

  • Minimal DVC bridge lives in examples/dvc/.
  • Pipeline stages are encode -> verify --strict -> fetch.
  • verify stage is strict and must fail pipeline reproduction on integrity mismatch.
  • Integration recipe and artifact layout are documented in examples/dvc/README.md.

OCI Integration (SBD-ECO-004)

  • ORAS bridge scripts and usage docs live in examples/oci/.
  • Default OCI metadata convention:
    • artifact type: application/vnd.seedbraid.seed.v1
    • layer media type: application/vnd.seedbraid.seed.layer.v1+sbd
    • annotations: source SHA-256, chunker, manifest-private flag, seed title
  • Push/pull scripts:
    • examples/oci/scripts/push_seed.sh <seed.sbd> <registry/repository:tag>
    • examples/oci/scripts/pull_seed.sh <registry/repository:tag> <out.sbd>
  • After pull, run strict verification:
    • seedbraid verify <out.sbd> --genome <genome-path> --strict

ML Tooling Hooks (SBD-ECO-005)

  • Scripts for MLflow metadata logging and Hugging Face upload live in examples/ml/.
  • MLflow hook logs seed metadata fields (seed digest, manifest provenance, optional transport refs).
  • Hugging Face hook uploads seed.sbd + metadata sidecar with env-provided token credentials.
  • Restore workflow from logged metadata is documented in examples/ml/README.md.

Tests and CI-Equivalent Local Commands

uv run --no-editable ruff check .
uv run --no-editable python -m pytest
uv run --no-editable python -m pytest tests/test_compat_fixtures.py

IPFS tests auto-skip when ipfs is not installed. Compatibility fixtures are stored in tests/fixtures/compat/v1/ and are validated by tests/test_compat_fixtures.py. Regenerate intentionally with: uv run --no-editable python scripts/gen_compat_fixtures.py.

1-byte Insertion Dedup Benchmark

Run:

uv run --no-editable python scripts/bench_shifted_dedup.py
uv run --no-editable python scripts/bench_gate.py \
  --min-reuse-improvement-bps 1 \
  --max-seed-size-ratio 1.20 \
  --min-cdc-throughput-mib-s 0.10 \
  --json-out .artifacts/bench-report.json

Expected behavior:

  • cdc_buzhash should show better reuse than fixed when a single-byte insertion shifts offsets.
  • bench_gate.py exits non-zero when configured thresholds are violated.

Project Documents

  • Format spec: docs/FORMAT.md
  • Design rationale: docs/DESIGN.md
  • Threat model: docs/THREAT_MODEL.md
  • Error codes: docs/ERROR_CODES.md
  • Performance gates: docs/PERFORMANCE.md
  • DVC workflow bridge example: examples/dvc/README.md
  • OCI/ORAS distribution example: examples/oci/README.md
  • ML tooling hooks example: examples/ml/README.md

Support Seedbraid

  • Seedbraid is maintained as an open-source project.
  • If Seedbraid helps your workflow, please consider donating via the repository Sponsor button.
  • Donations directly support maintenance, documentation, and compatibility/performance validation.

Open Source Governance

  • License: MIT (LICENSE)
  • Security policy: SECURITY.md
  • Contributing guide: CONTRIBUTING.md
  • Code of Conduct: CODE_OF_CONDUCT.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seedbraid-1.1.1.tar.gz (70.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seedbraid-1.1.1-py3-none-any.whl (52.0 kB view details)

Uploaded Python 3

File details

Details for the file seedbraid-1.1.1.tar.gz.

File metadata

  • Download URL: seedbraid-1.1.1.tar.gz
  • Upload date:
  • Size: 70.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for seedbraid-1.1.1.tar.gz
Algorithm Hash digest
SHA256 b0f3674358390a1ce0dd275850be51dd5bd7cef4d6cf335f3029a97a3156d6ee
MD5 5748c1aa451e6bafa1312e5e58ceeb2f
BLAKE2b-256 67f896701b82fcf45f4e2005dcf8dd9c10ff573debb9ce574eab0780dd8f492c

See more details on using hashes here.

Provenance

The following attestation bundles were made for seedbraid-1.1.1.tar.gz:

Publisher: release.yml on aimsise/seedbraid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file seedbraid-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: seedbraid-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 52.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for seedbraid-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a1207dc930d86e0400fd071f6e5360934b3337d8c476989558f6d115fd83c856
MD5 ef306e4bb780045b9922c6a9db2739ba
BLAKE2b-256 4218497702609b86d0edf2f5e44272168bf9c5ae148812eee44054617a590524

See more details on using hashes here.

Provenance

The following attestation bundles were made for seedbraid-1.1.1-py3-none-any.whl:

Publisher: release.yml on aimsise/seedbraid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page