Content-addressed chunk storage with deduplication

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

chunkstore

Embeddable content-addressed chunk storage (CAS) with byte-level deduplication, reference-count GC, and bindings for Rust, Python, and Go.

Drop a dedup layer into your app — not a backup CLI. When uploads, versions, and templates share bytes, store each unique chunk once and cut disk/S3 cost by up to ~90% on real workloads (see benchmarks).

Quick start
Design principles
Should I use chunkstore?
The problem
What you get
Who it's for
What this is / is not
Compared to alternatives
How it works
When dedup works
Chunking: fixed vs CDC
Multi-language
Installation
Examples
Benchmarks
Contributing
Roadmap
License

Quick start

30 seconds — Python (filesystem backend):

cd python && maturin develop --release

from chunkstore import ChunkStore, FilesystemBackend

store = ChunkStore.open(FilesystemBackend("/data/chunks"))
store.ingest("doc_v1", b"hello world")
assert store.read("doc_v1") == b"hello world"
print(store.stats())  # savings_pct grows as you deduplicate

Rust:

cargo build --release && cargo test -p chunkstore-core

use chunkstore::{ChunkStore, FsBackend};

let store = ChunkStore::open(FsBackend::new("/data/chunks")?)?;
store.ingest("doc_v1", b"hello world")?;
println!("{:?}", store.stats()?);

Go (build core first):

CARGO_TARGET_DIR=target cargo build --release -p chunkstore-core
cd go/chunkstore && go test -v

store, _ := chunkstore.OpenFilesystem("/data/chunks")
defer store.Close()
store.Ingest("doc_v1", []byte("hello world"))

Cross-language smoke test: pytest -m cross_lang (Python write → Go read/delete → Python stats).

Design principles

Patterns borrowed from Restic, RocksDB, and zstd — applied to an embeddable dedup layer:

Principle	What it means for you
Embed	Library in your process. No daemon, no S3 proxy, no separate backup agent.
Content-addressed	Chunk key = full SHA-256 hex (64 chars). Same bytes → same key → automatic dedup.
Verifiable	Every read checks `sha256(data) == digest`. Bit rot and tampering surface immediately.
Efficient	Refcount + GC: delete a file only drops chunks nothing else references.
Portable	One on-disk format across Rust, Python, and Go (`_manifest/`, `_refcount/`, chunk blobs).
Honest scope	Byte-identical dedup only. No perceptual hashing, no distributed metadata.

Public API boundary: language wrappers call the Rust core (PyO3 / cgo / direct crate). Backend layout and metadata keys are stable; internal Rust modules may change between minor versions until v1.0.

Should I use chunkstore?

flowchart TD
    start[Need to store files in my app] --> q1{Same bytes across uploads or versions?}
    q1 -->|No - unique photos/video| no[Skip dedup - use raw S3/FS]
    q1 -->|Yes - duplicates, templates, versions| q2{Want a backup CLI?}
    q2 -->|Yes| backup[Use Restic/Borg/Kopia]
    q2 -->|No - embed in upload service| q3{Edits at file start?}
    q3 -->|Often| cdc[chunkstore + CDC chunking]
    q3 -->|Rare - bulk immutable blobs| fixed[chunkstore + fixed 4 MiB]

Your situation	Recommendation
Document CMS, PDF versions, rescans	chunkstore + CDC
Idempotent re-upload of same file	chunkstore (near 100% savings on re-upload)
Template / boilerplate libraries	chunkstore (shared chunks across files)
Unique camera rolls, encodes, ZIPs	Raw blob storage — dedup ~0%
Disaster recovery, off-site backups	Restic / Borg — not chunkstore
S3-compatible object store for everything	SeaweedFS / MinIO — not chunkstore

The problem

Most apps store files naively: every upload writes a full copy to disk or S3. With document versions, rescans, shared templates, and duplicate uploads, the same bytes are stored many times. Blob storage has no dedup API — you either accept the cost or build your own layer.

Naive storage	What you pay
User uploads the same PDF twice	2× disk / S3 bytes
100 document versions with 90% shared content	~100× tail, ~100× shared prefix
1000 files built from 200 shared 4 MiB blocks	Full logical size on disk

chunkstore splits files into chunks, addresses each chunk by SHA-256, stores each unique chunk once, and tracks references. Identical bytes → one physical object, refcount++. Delete a file → refcount-- → garbage-collect chunks at zero.

What you get

Benefit	How
Lower storage bill	Unique chunks stored once; shared prefixes and duplicates reused
Safe reads	`sha256(chunk_bytes) == digest` on every read
Embeddable library	Call from upload/versioning code — like embedding RocksDB, not running a DB server
Cross-language format	Same directory from Python and Go
Pluggable backend	FS and S3 in wrappers; core uses callbacks or built-in FS
Two chunking modes	Fixed (fast) and CDC (edit-friendly)

Storage impact (measured)

Scenario: 1000 files from a pool of 200 × 4 MiB chunks (~7.8 GiB logical).

Naive storage vs chunkstore dedup

Approach	Bytes on disk	Savings vs naive
Naive per-file copy	~7.8 GiB	—
chunkstore dedup	~0.78 GiB	~90%

Savings by workload

Bar chart of dedup savings percent by workload type

Savings band	Meaning
< 10%	Noise — dedup overhead may not pay off
20–30%	Noticeable in disk/S3 billing
50%+	Strong fit: duplicates, clones, versions
80%+	Shared templates, chunk pools

Who it's for

Use case	Why chunkstore
Document management	PDF versions, scans, templates — large shared prefixes
File upload services	Same file uploaded twice; resumable re-uploads
File versioning	v2 = v1 + small edit; CDC keeps chunk boundaries
Multi-service stacks	Python API writes, Go worker reads — same chunk directory

Dogfood target: document storage / docs_service — versioned PDFs, scans, boilerplate templates.

What this is / is not

chunkstore is	chunkstore is not
Embeddable CAS/dedup library	Backup CLI (Restic / Borg / Kopia)
Dedup layer over FS or S3	S3 gateway or reverse proxy
Byte-identical dedup (SHA-256)	Perceptual / similarity dedup
Manifest + refcount + GC	Distributed multi-node store
Shared format across Python / Go / Rust	Encryption or compliance product

Current limits: single-process store lock; refcount and metadata on the backend — fine for one app node, not for concurrent multi-writer clusters without external coordination.

Compared to alternatives

vs raw blob storage (S3 / local FS)

	Raw S3/FS	chunkstore
Dedup	None	SHA-256 chunk-level
Integration	Native SDK	Library in your app
Versioning	Your DB/metadata	`file_id` → manifest
Best for	Unique blobs, media	Repeated bytes across files

vs backup tools (Restic, Borg, Kopia)

	Backup tool	chunkstore
Model	`init` → `backup` → `restore`	`ingest` / `read` / `delete` in-process
Encryption / retention	Built-in	Not included
Dedup + CDC	Yes (Restic uses Rabin CDC)	Yes — similar ideas, different format
Fit	Ops / DR	Application embed

Restic (~35k GitHub stars) solves backup. chunkstore solves "my upload service stores too many duplicate bytes" — complementary, not competing.

vs object stores (SeaweedFS, MinIO)

	Object store	chunkstore
API	S3 HTTP	In-process library
Scope	Cluster, buckets, ACLs	Single-store dedup layer
When to use both	—	chunkstore under your app; object store as backend via S3 wrapper

How it works

Architecture flow

file → chunks → SHA-256 (64-char hex) → backend
                ↓
         manifest (file_id → [digests])
         refcount (digest → count) → GC when count == 0

Layer	Role
`core/`	Rust: chunking, hashing, manifests, refcount, C-API
`python/`	PyO3/maturin wrapper + FS/S3 backends
`go/`	cgo wrapper + FS backend

Metadata keys (shared across languages):

Key	Content
`_manifest/{file_id}`	JSON: ordered digests + `file_bytes`
`_refcount/{digest}`	JSON: reference count
`{digest}` (64 hex)	Raw chunk bytes

When dedup works — and when it doesn't

Works well

Workload	Typical savings	Notes
Pool 200×4 MiB chunks, 1000 files	~90%	Benchmark scenario
Full duplicate file (2 copies)	~50%	Second copy reuses all chunks
Partial overlap (shared prefix)	30–45%	Only tail chunks are new
Document versions (90/10)	~40%	20 MiB files
Prefix insert 1 byte + CDC	~45%	4/5 CDC chunks reused
Re-upload of identical file	~100%	All chunks exist

Poor or zero savings

Workload	Typical savings	Why
Prefix insert 1 byte + fixed 4 MiB	~0%	All boundaries shift
Random unique binaries	~0%	No overlap
Unique photos / video	~0%	High entropy
JPEG, MP4, ZIP	~0% cross-file	Already compressed

Fixed vs CDC on prefix insert

Chunking: fixed vs CDC

	Fixed (default 4 MiB)	CDC (FastCDC v2020)
Speed	Fastest	Moderate
Chunk sizes	Uniform	min 256 KiB, avg 4 MiB, max 8 MiB
Best for	Bulk uploads	Versioned docs, edits at start
Prefix insert (+1 byte)	~0% reuse	~45% reuse
API	`ingest` / `Ingest`	`ingest_cdc` / `IngestCDC`

Multi-language

Language	Open	Upload	Read	Delete
Rust	`ChunkStore::open(FsBackend::new(path)?)`	`ingest`	`read`	`delete`
Python	`ChunkStore.open(FilesystemBackend(path))`	`ingest`	`read`	`delete`
Go	`chunkstore.OpenFilesystem(path)`	`Ingest`	`Read`	`Delete`

CI verifies: Python write → Go read/delete → Python stats (pytest -m cross_lang).

Installation

Language	Requirements	Command
Rust	Rust 1.70+	`cargo build --release -p chunkstore-core`
Python	Python 3.10+, Rust (maturin)	`cd python && maturin develop --release`
Go	Go 1.22+, built `libchunkstore.a`	See `go/README.md`
Dev	pytest, go toolchain	`pip install ".[dev]"` in `python/`

# Full local verify (from repo root)
CARGO_TARGET_DIR=target cargo build --release -p chunkstore-core
cd python && maturin develop --release && pytest -q
cd ../go/chunkstore && go test -v

Examples

Example	Description
`examples/fastapi/`	Upload / download / delete HTTP API
`examples/go-http/`	Go HTTP service (planned)

cd python && maturin develop --release && pip install ".[fastapi]"
PYTHONPATH=../examples/fastapi uvicorn main:app --reload

Endpoints: POST /files/{id}, GET /files/{id}, DELETE /files/{id}, GET /stats.

Benchmarks

Reproducible workloads via workload_analysis — run locally:

cargo run -p chunkstore-core --example workload_analysis --release
cargo bench -p chunkstore-core

Workload	Savings
Dedup pool 200×4 MiB, 1000 files	90.0%
Versions 90/10	40.0%
Prefix insert + CDC	45.1%
Full duplicate (2 copies)	50.0%

On document-style workloads above, chunkstore averages ~45% savings on typical cases; the shared chunk-pool scenario reaches 90% — roughly 2× better.

xychart-beta
    title "Dedup savings % (document workloads)"
    x-axis ["Pool 1k", "CDC +1B", "Versions", "Dup 2x"]
    y-axis "Savings %" 0 --> 100
    bar [90, 45, 40, 50]

Contributing

Want to fix a bug, add a feature, or improve bindings? See CONTRIBUTING.md for:

local setup and full CI commands
repository layout and where to put changes
test requirements (including cross-language)
Rust / Python / Go style rules
on-disk format stability rules

PyPI releases: docs/PYPI.md

Roadmap: docs/ROADMAP.md

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

MuratovER

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkstore-0.1.0.tar.gz (38.7 kB view details)

Uploaded Jun 23, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunkstore-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (453.4 kB view details)

Uploaded Jun 23, 2026 CPython 3.10+manylinux: glibc 2.17+ x86-64

chunkstore-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (431.9 kB view details)

Uploaded Jun 23, 2026 CPython 3.10+manylinux: glibc 2.17+ ARM64

File details

Details for the file chunkstore-0.1.0.tar.gz.

File metadata

Download URL: chunkstore-0.1.0.tar.gz
Upload date: Jun 23, 2026
Size: 38.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkstore-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e6843dba4e73a8e12d65554f1363146f223c149cfb23a115b75ab1e0871f4026`
MD5	`12fb4125fc5e19d7b662743f7ad168a7`
BLAKE2b-256	`1399515281ba997623cce9c8e706c47af38cb3e34e190b4831aafac603942220`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkstore-0.1.0.tar.gz:

Publisher: pypi.yml on MuratovER/chunkstore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chunkstore-0.1.0.tar.gz
- Subject digest: e6843dba4e73a8e12d65554f1363146f223c149cfb23a115b75ab1e0871f4026
- Sigstore transparency entry: 1931515099
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: MuratovER/chunkstore@dad569c4af630fc4c332440c1d915ec86af7c99a
- Branch / Tag: refs/heads/main
- Owner: https://github.com/MuratovER
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@dad569c4af630fc4c332440c1d915ec86af7c99a
- Trigger Event: workflow_dispatch

File details

Details for the file chunkstore-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: chunkstore-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 23, 2026
Size: 453.4 kB
Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkstore-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`8d55c28386a9b5020da87812b69aeda19bede3dcbb44f0f735cedd76079a689c`
MD5	`82b5bf4ee286ebbeb92a9d7c43a51a53`
BLAKE2b-256	`6b2f76a9f6984c97b32727d1e0919ca56f3d1997086b439a4316536f3c2d0904`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkstore-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: pypi.yml on MuratovER/chunkstore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chunkstore-0.1.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Subject digest: 8d55c28386a9b5020da87812b69aeda19bede3dcbb44f0f735cedd76079a689c
- Sigstore transparency entry: 1931515392
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: MuratovER/chunkstore@dad569c4af630fc4c332440c1d915ec86af7c99a
- Branch / Tag: refs/heads/main
- Owner: https://github.com/MuratovER
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@dad569c4af630fc4c332440c1d915ec86af7c99a
- Trigger Event: workflow_dispatch

File details

Details for the file chunkstore-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: chunkstore-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jun 23, 2026
Size: 431.9 kB
Tags: CPython 3.10+, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkstore-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`90c7aed8803e0aacf33364af549878cd13cc910e2d1bb7d99ba7961eec1d12bc`
MD5	`497856a3dbdb1f5f70fbf2bab49fcb90`
BLAKE2b-256	`71a0847167bc357ae41eaaebf0fb88f4de5086e92fe9111898620b71c000234d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkstore-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: pypi.yml on MuratovER/chunkstore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chunkstore-0.1.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Subject digest: 90c7aed8803e0aacf33364af549878cd13cc910e2d1bb7d99ba7961eec1d12bc
- Sigstore transparency entry: 1931515258
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: MuratovER/chunkstore@dad569c4af630fc4c332440c1d915ec86af7c99a
- Branch / Tag: refs/heads/main
- Owner: https://github.com/MuratovER
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@dad569c4af630fc4c332440c1d915ec86af7c99a
- Trigger Event: workflow_dispatch

chunkstore 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

chunkstore

Table of contents

Quick start

Design principles

Should I use chunkstore?

The problem

What you get

Storage impact (measured)

Savings by workload

Who it's for

What this is / is not

Compared to alternatives

vs raw blob storage (S3 / local FS)

vs backup tools (Restic, Borg, Kopia)

vs object stores (SeaweedFS, MinIO)

How it works

When dedup works — and when it doesn't

Works well

Poor or zero savings

Chunking: fixed vs CDC

Multi-language

Installation

Examples

Benchmarks

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance