Skip to main content

Content-addressed chunk storage with deduplication

Project description

chunkstore

License: MIT CI Rust Bindings

Embeddable content-addressed chunk storage (CAS) with byte-level deduplication, reference-count GC, and bindings for Rust, Python, and Go.

Drop a dedup layer into your app — not a backup CLI. When uploads, versions, and templates share bytes, store each unique chunk once and cut disk/S3 cost by up to ~90% on real workloads (see benchmarks).


Table of contents


Quick start

30 seconds — Python (filesystem backend):

cd python && maturin develop --release
from chunkstore import ChunkStore, FilesystemBackend

store = ChunkStore.open(FilesystemBackend("/data/chunks"))
store.ingest("doc_v1", b"hello world")
assert store.read("doc_v1") == b"hello world"
print(store.stats())  # savings_pct grows as you deduplicate

Rust:

cargo build --release && cargo test -p chunkstore-core
use chunkstore::{ChunkStore, FsBackend};

let store = ChunkStore::open(FsBackend::new("/data/chunks")?)?;
store.ingest("doc_v1", b"hello world")?;
println!("{:?}", store.stats()?);

Go (build core first):

CARGO_TARGET_DIR=target cargo build --release -p chunkstore-core
cd go/chunkstore && go test -v
store, _ := chunkstore.OpenFilesystem("/data/chunks")
defer store.Close()
store.Ingest("doc_v1", []byte("hello world"))

Cross-language smoke test: pytest -m cross_lang (Python write → Go read/delete → Python stats).


Design principles

Patterns borrowed from Restic, RocksDB, and zstd — applied to an embeddable dedup layer:

Principle What it means for you
Embed Library in your process. No daemon, no S3 proxy, no separate backup agent.
Content-addressed Chunk key = full SHA-256 hex (64 chars). Same bytes → same key → automatic dedup.
Verifiable Every read checks sha256(data) == digest. Bit rot and tampering surface immediately.
Efficient Refcount + GC: delete a file only drops chunks nothing else references.
Portable One on-disk format across Rust, Python, and Go (_manifest/, _refcount/, chunk blobs).
Honest scope Byte-identical dedup only. No perceptual hashing, no distributed metadata.

Public API boundary: language wrappers call the Rust core (PyO3 / cgo / direct crate). Backend layout and metadata keys are stable; internal Rust modules may change between minor versions until v1.0.


Should I use chunkstore?

flowchart TD
    start[Need to store files in my app] --> q1{Same bytes across uploads or versions?}
    q1 -->|No - unique photos/video| no[Skip dedup - use raw S3/FS]
    q1 -->|Yes - duplicates, templates, versions| q2{Want a backup CLI?}
    q2 -->|Yes| backup[Use Restic/Borg/Kopia]
    q2 -->|No - embed in upload service| q3{Edits at file start?}
    q3 -->|Often| cdc[chunkstore + CDC chunking]
    q3 -->|Rare - bulk immutable blobs| fixed[chunkstore + fixed 4 MiB]
Your situation Recommendation
Document CMS, PDF versions, rescans chunkstore + CDC
Idempotent re-upload of same file chunkstore (near 100% savings on re-upload)
Template / boilerplate libraries chunkstore (shared chunks across files)
Unique camera rolls, encodes, ZIPs Raw blob storage — dedup ~0%
Disaster recovery, off-site backups Restic / Borg — not chunkstore
S3-compatible object store for everything SeaweedFS / MinIO — not chunkstore

The problem

Most apps store files naively: every upload writes a full copy to disk or S3. With document versions, rescans, shared templates, and duplicate uploads, the same bytes are stored many times. Blob storage has no dedup API — you either accept the cost or build your own layer.

Naive storage What you pay
User uploads the same PDF twice 2× disk / S3 bytes
100 document versions with 90% shared content ~100× tail, ~100× shared prefix
1000 files built from 200 shared 4 MiB blocks Full logical size on disk

chunkstore splits files into chunks, addresses each chunk by SHA-256, stores each unique chunk once, and tracks references. Identical bytes → one physical object, refcount++. Delete a file → refcount-- → garbage-collect chunks at zero.


What you get

Benefit How
Lower storage bill Unique chunks stored once; shared prefixes and duplicates reused
Safe reads sha256(chunk_bytes) == digest on every read
Embeddable library Call from upload/versioning code — like embedding RocksDB, not running a DB server
Cross-language format Same directory from Python and Go
Pluggable backend FS and S3 in wrappers; core uses callbacks or built-in FS
Two chunking modes Fixed (fast) and CDC (edit-friendly)

Storage impact (measured)

Scenario: 1000 files from a pool of 200 × 4 MiB chunks (~7.8 GiB logical).

Naive storage vs chunkstore dedup

Approach Bytes on disk Savings vs naive
Naive per-file copy ~7.8 GiB
chunkstore dedup ~0.78 GiB ~90%

Savings by workload

Bar chart of dedup savings percent by workload type

Savings band Meaning
< 10% Noise — dedup overhead may not pay off
20–30% Noticeable in disk/S3 billing
50%+ Strong fit: duplicates, clones, versions
80%+ Shared templates, chunk pools

Who it's for

Use case Why chunkstore
Document management PDF versions, scans, templates — large shared prefixes
File upload services Same file uploaded twice; resumable re-uploads
File versioning v2 = v1 + small edit; CDC keeps chunk boundaries
Multi-service stacks Python API writes, Go worker reads — same chunk directory

Dogfood target: document storage / docs_service — versioned PDFs, scans, boilerplate templates.


What this is / is not

chunkstore is chunkstore is not
Embeddable CAS/dedup library Backup CLI (Restic / Borg / Kopia)
Dedup layer over FS or S3 S3 gateway or reverse proxy
Byte-identical dedup (SHA-256) Perceptual / similarity dedup
Manifest + refcount + GC Distributed multi-node store
Shared format across Python / Go / Rust Encryption or compliance product

Current limits: single-process store lock; refcount and metadata on the backend — fine for one app node, not for concurrent multi-writer clusters without external coordination.


Compared to alternatives

vs raw blob storage (S3 / local FS)

Raw S3/FS chunkstore
Dedup None SHA-256 chunk-level
Integration Native SDK Library in your app
Versioning Your DB/metadata file_id → manifest
Best for Unique blobs, media Repeated bytes across files

vs backup tools (Restic, Borg, Kopia)

Backup tool chunkstore
Model initbackuprestore ingest / read / delete in-process
Encryption / retention Built-in Not included
Dedup + CDC Yes (Restic uses Rabin CDC) Yes — similar ideas, different format
Fit Ops / DR Application embed

Restic (~35k GitHub stars) solves backup. chunkstore solves "my upload service stores too many duplicate bytes" — complementary, not competing.

vs object stores (SeaweedFS, MinIO)

Object store chunkstore
API S3 HTTP In-process library
Scope Cluster, buckets, ACLs Single-store dedup layer
When to use both chunkstore under your app; object store as backend via S3 wrapper

How it works

Architecture flow

file → chunks → SHA-256 (64-char hex) → backend
                ↓
         manifest (file_id → [digests])
         refcount (digest → count) → GC when count == 0
Layer Role
core/ Rust: chunking, hashing, manifests, refcount, C-API
python/ PyO3/maturin wrapper + FS/S3 backends
go/ cgo wrapper + FS backend

Metadata keys (shared across languages):

Key Content
_manifest/{file_id} JSON: ordered digests + file_bytes
_refcount/{digest} JSON: reference count
{digest} (64 hex) Raw chunk bytes

When dedup works — and when it doesn't

Works well

Workload Typical savings Notes
Pool 200×4 MiB chunks, 1000 files ~90% Benchmark scenario
Full duplicate file (2 copies) ~50% Second copy reuses all chunks
Partial overlap (shared prefix) 30–45% Only tail chunks are new
Document versions (90/10) ~40% 20 MiB files
Prefix insert 1 byte + CDC ~45% 4/5 CDC chunks reused
Re-upload of identical file ~100% All chunks exist

Poor or zero savings

Workload Typical savings Why
Prefix insert 1 byte + fixed 4 MiB ~0% All boundaries shift
Random unique binaries ~0% No overlap
Unique photos / video ~0% High entropy
JPEG, MP4, ZIP ~0% cross-file Already compressed

Fixed vs CDC on prefix insert


Chunking: fixed vs CDC

Fixed (default 4 MiB) CDC (FastCDC v2020)
Speed Fastest Moderate
Chunk sizes Uniform min 256 KiB, avg 4 MiB, max 8 MiB
Best for Bulk uploads Versioned docs, edits at start
Prefix insert (+1 byte) ~0% reuse ~45% reuse
API ingest / Ingest ingest_cdc / IngestCDC

Multi-language

Language Open Upload Read Delete
Rust ChunkStore::open(FsBackend::new(path)?) ingest read delete
Python ChunkStore.open(FilesystemBackend(path)) ingest read delete
Go chunkstore.OpenFilesystem(path) Ingest Read Delete

CI verifies: Python write → Go read/delete → Python stats (pytest -m cross_lang).


Installation

Language Requirements Command
Rust Rust 1.70+ cargo build --release -p chunkstore-core
Python Python 3.10+, Rust (maturin) cd python && maturin develop --release
Go Go 1.22+, built libchunkstore.a See go/README.md
Dev pytest, go toolchain pip install ".[dev]" in python/
# Full local verify (from repo root)
CARGO_TARGET_DIR=target cargo build --release -p chunkstore-core
cd python && maturin develop --release && pytest -q
cd ../go/chunkstore && go test -v

Examples

Example Description
examples/fastapi/ Upload / download / delete HTTP API
examples/go-http/ Go HTTP service (planned)
cd python && maturin develop --release && pip install ".[fastapi]"
PYTHONPATH=../examples/fastapi uvicorn main:app --reload

Endpoints: POST /files/{id}, GET /files/{id}, DELETE /files/{id}, GET /stats.


Benchmarks

Reproducible workloads via workload_analysis — run locally:

cargo run -p chunkstore-core --example workload_analysis --release
cargo bench -p chunkstore-core
Workload Savings
Dedup pool 200×4 MiB, 1000 files 90.0%
Versions 90/10 40.0%
Prefix insert + CDC 45.1%
Full duplicate (2 copies) 50.0%

On document-style workloads above, chunkstore averages ~45% savings on typical cases; the shared chunk-pool scenario reaches 90% — roughly better.

xychart-beta
    title "Dedup savings % (document workloads)"
    x-axis ["Pool 1k", "CDC +1B", "Versions", "Dup 2x"]
    y-axis "Savings %" 0 --> 100
    bar [90, 45, 40, 50]

Contributing

Want to fix a bug, add a feature, or improve bindings? See CONTRIBUTING.md for:

  • local setup and full CI commands
  • repository layout and where to put changes
  • test requirements (including cross-language)
  • Rust / Python / Go style rules
  • on-disk format stability rules

PyPI releases: docs/PYPI.md

Roadmap: docs/ROADMAP.md


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkstore-0.1.0.tar.gz (38.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chunkstore-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (453.4 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

chunkstore-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (431.9 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

File details

Details for the file chunkstore-0.1.0.tar.gz.

File metadata

  • Download URL: chunkstore-0.1.0.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkstore-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e6843dba4e73a8e12d65554f1363146f223c149cfb23a115b75ab1e0871f4026
MD5 12fb4125fc5e19d7b662743f7ad168a7
BLAKE2b-256 1399515281ba997623cce9c8e706c47af38cb3e34e190b4831aafac603942220

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkstore-0.1.0.tar.gz:

Publisher: pypi.yml on MuratovER/chunkstore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chunkstore-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for chunkstore-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8d55c28386a9b5020da87812b69aeda19bede3dcbb44f0f735cedd76079a689c
MD5 82b5bf4ee286ebbeb92a9d7c43a51a53
BLAKE2b-256 6b2f76a9f6984c97b32727d1e0919ca56f3d1997086b439a4316536f3c2d0904

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkstore-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: pypi.yml on MuratovER/chunkstore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chunkstore-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for chunkstore-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 90c7aed8803e0aacf33364af549878cd13cc910e2d1bb7d99ba7961eec1d12bc
MD5 497856a3dbdb1f5f70fbf2bab49fcb90
BLAKE2b-256 71a0847167bc357ae41eaaebf0fb88f4de5086e92fe9111898620b71c000234d

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkstore-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: pypi.yml on MuratovER/chunkstore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page