Extremely Lightweight Lightning-Fast In-Memory Database for Python
Project description
SnapDB
Extremely Lightweight, Lightning-Fast In-Memory Database for Python
A zero-dependency, pure-Python embedded database for single-writer, local Python applications. Columnar analytics engine, row store, memory-mapped files, lightweight column compression, and precompiled struct codecs — built for maximum speed at minimum memory within a minimal-dependency footprint.
Niche: compact test fixtures, small operational datasets, embedded Python tools, and NumPy-friendly analytical helpers where pulling in SQLite or a heavy binary extension is undesirable. SnapDB is intentionally not a SQLite or DuckDB replacement.
pip install pysnapdb
Contents
- Key Innovations
- Installation
- Quick Start
- Storage Modes
- Dictionary Encoding · Delta Encoding
- Vectorized Filtering · Auto-Indexing · NumPy / Zero-Copy Export
- Benchmarks
- Architecture · Supported Types
- Development
- Roadmap & Known Limitations
- License
Key Innovations
- Columnar engine — column-oriented per-column
array.arraystorage; full-scan aggregation ~27× faster than SQLite at a fraction of the memory - NumPy-accelerated aggregates (optional, v0.8.0) — when NumPy is installed,
aggregate()runs over the zero-copy column buffer (~530M rows/s, on par with pandas); pure-Python remains the zero-dependency default - NumPy-accelerated filters (optional, v0.9.0) —
select_where()builds masks vectorially;count_where()(filtered count, no row materialization) hits ~314M rows/s on numeric predicates (~166× the pure-Python path) - Lowest memory footprint of the field — ~2.2 MB / 100K rows vs SQLite 2.9 MB, pandas 11 MB, plain
dict22 MB (benchmarks) - Vectorized multi-condition filters (v0.6.0) —
select_where()combines per-column bitmasks with C-speed big-integerAND/OR(~2× faster selectiveWHERE) - O(1) delta-encoded reads (v0.6.0) — lazy reconstruction cache turns delta scans from O(n²) into O(n) (orders of magnitude faster)
- Auto-indexing (v0.6.0) —
auto_index=Truebuilds a hash index for a column once it's queried often enough - Zero-copy NumPy export (v0.6.0) —
to_numpy()/buffer()(PEP 688) share raw column memory with NumPy without copying - Dictionary encoding — transparent per-column dictionary for low-cardinality strings: ~3× memory reduction (v0.4.0)
- Delta encoding — base + deltas for monotonic columns (timestamps, IDs) (v0.5.0)
- Bit-packed booleans — Python
intbitmask: ~8× smaller thanarray('b') - Hash index —
create_index()/lookup()/find(), kept in sync on every insert / update / delete - Range index —
create_range_index()/range_find()for ordered numeric windows without adding a B-tree dependency - Durability safeguards — row-store transactions log row data to a replayed WAL; committed transactions recover after abrupt process exit
- Operational safety — per-file advisory locks, explicit
backup()andcompact(), optional at-rest encryption, CDC stream, Prometheus-style metrics - Zero dependencies — stdlib only (NumPy is optional, only for zero-copy export)
Installation
pip install pysnapdb
The PyPI distribution is named pysnapdb (snapdb was already taken), but the
import name is unchanged: import snapdb.
Or from source:
git clone https://github.com/hussain-alsaibai/snapdb.git
cd snapdb
pip install -e .
Quick Start
from snapdb import SnapDB, Schema, ColumnDef
# Define schema
schema = Schema([
ColumnDef("id", "i32"),
ColumnDef("email", "bytes:32"),
ColumnDef("score", "f32"),
ColumnDef("active", "bool"),
])
# Create database (columnar mode for analytics)
db = SnapDB("data.snap", schema, storage_type="columnar")
# Insert
db.insert({"id": 1, "email": "alice@test.com", "score": 100.0, "active": True})
# Fast columnar aggregate (~59M rows/sec full scan)
total = db.aggregate("score", "sum")
# Vectorized multi-condition filter (v0.6.0)
hot = db.select_where([("score", ">", 90.0), ("active", "==", True)])
# Create index for O(1) lookups
db.create_index("id")
result = db.lookup("id", 1)
# Batch insert for speed
db.batch_insert([
{"id": i, "email": f"user_{i}@test.com", "score": i * 10.0, "active": i % 2 == 0}
for i in range(1000)
])
# CDC (Change Data Capture)
from snapdb import Metrics
db = SnapDB("data.snap", schema, metrics=Metrics())
Storage Modes
| Mode | Best For | Strengths |
|---|---|---|
storage_type="columnar" |
OLAP / analytics | Fast full-scan aggregation (~59M rows/s), vectorized filters, column compression, lowest memory, snapshot persistence on close/backup |
storage_type="row" |
OLTP / full-row point access | Zero-copy get_raw(), replayed WAL transactions, hash indexes, CDC, compaction |
Data Safety APIs
schema = Schema([
ColumnDef("id", "i32", primary_key=True), # primary_key => unique + not_null
ColumnDef("email", "bytes:64", unique=True),
ColumnDef("score", "f32"),
])
db = SnapDB("users.snap", schema, encryption_key="optional-secret")
with db.transaction():
db.insert({"id": 1, "email": "alice@test.com", "score": 100.0})
# Consistent hot backup: flushes/snapshots before copying.
db.backup("users.backup.snap")
# Reclaim deleted row space in the row store.
reclaimed_bytes = db.compact()
# Lightweight integrity report and best-effort metadata repair.
report = db.fsck()
repair_report = db.repair()
- Missing required columns and
Nonevalues fail before encoding. unique=Trueandprimary_key=Trueare enforced on insert, batch insert, update, reopen, and WAL recovery.- A database file cannot be opened twice for writing in one process, and supported platforms use an advisory sidecar lock to reject concurrent writer processes.
encryption_keyencrypts row payloads, columnar snapshots, and WAL records at rest. It is intended to prevent casual raw-file secret recovery for embedded deployments; it is not a replacement for OS key management or a full RBAC/auth system.
See Benchmarks for measured throughput and memory.
Dictionary Encoding (v0.4.0)
For columns with few unique string values (status, category, type, country), dictionary encoding reduces memory by 3×:
from snapdb import ColumnarTable
schema = [
("id", "i32"),
("status", "bytes:20"), # "active", "inactive", "pending" — 3 unique
("category", "bytes:20"), # "electronics", "books", "clothing" — 5 unique
("score", "f32"),
]
# Enable dict encoding on low-cardinality columns
db = ColumnarTable("products", schema, dict_columns=["status", "category"])
| Metric | Raw | Dict-Encoded | Improvement |
|---|---|---|---|
| Memory (100K rows) | 4.0 MB | 1.34 MB | 3.0× reduction |
| Insert | 0.137s | 0.159s | ~15% overhead (acceptable) |
| Data integrity | — | ✅ 100% | Verified |
- Transparent: insert/query work with raw strings
- Auto-fallback: switches to raw when unique count > threshold (default 256)
- Per-column: specify which columns to encode via
dict_columns=[]
Delta Encoding (v0.5.0)
For monotonic columns (timestamps, auto-increment IDs, sequences), delta encoding reduces memory by storing differences instead of full values:
from snapdb import ColumnarTable
schema = [
("id", "i32"),
("timestamp", "i64"), # Monotonic timestamps → delta-encoded
("seq", "u32"), # Auto-increment IDs → delta-encoded
("value", "f32"),
]
# Enable delta encoding on monotonic columns
db = ColumnarTable("events", schema, delta_columns=["timestamp", "seq"])
| Metric | Raw | Delta-Encoded | Improvement |
|---|---|---|---|
| Memory (100K rows) | 2.29 MB | 1.91 MB | 1.2× reduction |
| Insert | 0.128s | 0.148s | ~16% overhead |
| Data integrity | — | ✅ 100% | Verified |
- Auto-detects: samples first 50 rows for monotonicity
- Auto-fallback: switches to raw if non-monotonic data detected
- Per-column: specify which columns via
delta_columns=[] - Auto-upgrade: dynamically upgrades delta typecode if deltas overflow
Frame-of-Reference Encoding (v0.7.0)
For numeric columns with bounded ranges (ages 0-120, scores 0-100, ratings 1-5), Frame-of-Reference (FOR) stores the minimum value once, then bit-packs deltas into the minimum required bits. 4–8× memory reduction:
from snapdb import ColumnarTable
schema = [
("user_id", "i32"),
("age", "i32"), # Ages 18-65 → 6 bits per value
("rating", "i32"), # Ratings 1-5 → 3 bits per value
("score", "i32"), # Scores 0-100 → 7 bits per value
]
# Enable FOR encoding on bounded numeric columns
db = ColumnarTable("survey", schema, for_columns=["age", "rating", "score"])
| Metric | Raw | FOR-Encoded | Improvement |
|---|---|---|---|
| Memory (100K rows, range 0-100) | 400 KB | ~88 KB | 4.5× reduction |
| Memory (100K rows, range 0-120) | 400 KB | ~103 KB | 3.9× reduction |
| Insert overhead | — | ~10% | Sampling cost |
| Data integrity | — | ✅ 100% | Verified |
- Auto-detects: samples first N rows (default 50) to measure range
- Auto-fallback: switches to raw if range exceeds 16 bits (saves <50%)
- Per-column: specify which columns via
for_columns=[] - Bit-packed: Python
intbitmask (same technique as v0.3.2 booleans) - Transparent: reads return full values, no API changes
Vectorized Filtering (v0.6.0, NumPy-accelerated in v0.9.0)
select_where() evaluates each condition column-at-a-time into a mask and
combines them with AND/OR. With NumPy installed the masks are built
vectorially over the column buffers (pure-Python big-integer masks otherwise).
For filtered counts, count_where() skips row materialization entirely and runs
at ~314M rows/s on numeric predicates (~166× the pure-Python path).
db = SnapDB("events.snap", schema, storage_type="columnar")
# (column, op, value) triples — op ∈ eq/ne/gt/gte/lt/lte/in/between
rows = db.select_where(
[("age", ">", 30), ("status", "==", b"active")],
columns=["id", "age"], limit=100,
)
# OR semantics, ranges and membership
db.select_where([("age", "<", 18), ("age", ">", 65)], combine="or")
db.select_where([("age", "between", (30, 40)), ("country", "in", [b"US", b"CA"])])
# dict shorthand
db.select_where({"status": b"active", "age": {"gte": 21}})
# fast filtered count — no rows materialized (NumPy-accelerated)
db.count_where([("age", ">", 30), ("temp", "<", 35.0)])
Batch Updates, Grouping, and Joins
# Update many rows without hand-written per-row loops.
db.batch_update(lambda row: row["score"] < 50, {"active": False})
# Small grouped aggregates.
totals = db.group_by("country", "score", "sum")
# In-memory equi-join between two SnapDB instances.
pairs = users.join(departments, "dept_id", "id")
# Ordered row-store windows without a heavyweight query planner.
db.create_range_index("score")
top_band = db.range_find("score", 90.0, 100.0)
Auto-Indexing (v0.6.0)
Let SnapDB index the columns you actually query, so you never forget a
create_index() for a hot path:
db = SnapDB("users.snap", schema, auto_index=True, auto_index_threshold=8)
# after the 8th equality query on a column, a hash index is built automatically
for uid in stream:
db.find(email=uid) # transparently O(1) once the index materializes
find() also works without any index (scan fallback), so correctness never
depends on remembering to index.
NumPy / Zero-Copy Export (v0.6.0)
Hand raw column memory to NumPy without copying (PEP 688 buffer protocol). NumPy is an optional dependency — only needed if you call these methods.
col = db.to_numpy("temperature") # safe copy (works for any column)
view = db.to_numpy("temperature", zero_copy=True) # shares memory, no copy
mv = db.column_buffer("temperature") # raw memoryview for advanced use
Plain numeric columns export a true zero-copy view; encoded columns (dictionary/delta) transparently fall back to a materialized copy.
Benchmarks
SnapDB's headline strength is memory efficiency — the columnar store is the lightest engine in this comparison while staying fully analytical:
~5× lighter than pandas and ~10× lighter than a plain dict — with zero dependencies.
Reproduce locally (numbers below are from the environment noted in the table):
python benchmarks/bench_suite.py --rows 100000 --markdown bench.md
100,000 rows · 50,000 point reads · best of 5 · Python 3.13 · win32 (NumPy installed → accelerated aggregate). Higher is better except Memory (lower is better).
| Workload | Unit | SnapDB (columnar) | SnapDB (row) | sqlite3 (:memory:) | pandas | dict (baseline) |
|---|---|---|---|---|---|---|
| Bulk insert | rows/s | 467,309 | 287,230 | 770,788 | 794,461 | 11,139,083 |
| Point read (PK) | ops/s | 86,243 | 87,836 | 370,698 | 32,296 | 5,494,807 |
| Full scan + SUM | rows/s | 529,660,985 | 483,067 | 19,910,403 | 513,874,544 | 19,488,619 |
| 3-cond filter | rows/s | 2,259,928 | 470,223 | 11,842,168 | 19,827,894 | 13,811,773 |
| Memory footprint | MB | 2.2 | n/a | 2.9 | 11.0 | 22.5 |
Where SnapDB wins (honestly):
- Memory — the columnar store is the lightest here: ~5× smaller than pandas and ~10× smaller than a plain
dict, with zero dependencies. - Full-scan aggregation — on par with pandas (~530M rows/s) and ~27× faster than in-memory SQLite. With NumPy installed,
aggregate()runs over the zero-copy column buffer (issue #14); without NumPy the pure-Python path still does ~58M rows/s (~3× SQLite). - Embeddable — a single mmap-backed file, no server, no C extensions.
Where it doesn't (also honestly): SQLite still wins on ACID semantics, SQL coverage, B-tree point reads, migrations, and ecosystem integration. DuckDB still wins on analytical SQL, joins, vectorized scans, and Parquet/Arrow workloads. Both win on multi-condition filter throughput. SnapDB's value is the zero-dependency footprint, direct Python dict/row APIs, and the columnar memory efficiency — not replacing either engine.
CI runs this suite on every push and publishes a fresh table to the workflow run summary (Actions → CI → Benchmark).
Encoding memory (100K rows)
| Encoding | Raw | Encoded | Reduction |
|---|---|---|---|
| Frame-of-Reference (bounded numeric) | 400 KB | ~88 KB | ~4.5× |
| Dictionary (low-cardinality strings) | 4.0 MB | 1.34 MB | ~3.0× |
| Delta (monotonic integers) | 2.29 MB | 1.91 MB | ~1.2× |
Architecture
SnapDB
├── core.py — Slab storage, Schema, CRUD, WAL
├── columnar.py — column-oriented analytical engine
├── metrics.py — Prometheus-style metrics collector
├── index.py — Hash + multi-column indexes
├── query.py — SQL-like query builder
├── wal.py — Write-ahead log for transactions
└── document_store.py — MongoDB-style DocumentStore API
Supported Types
| Type | Bytes | Use Case |
|---|---|---|
i8 / u8 |
1 | Flags, small counters |
i16 / u16 |
2 | IDs, ports |
i32 / u32 |
4 | Integers, IDs |
i64 / u64 |
8 | Timestamps, large IDs |
f32 |
4 | ML scores, prices |
f64 |
8 | Scientific, financial |
bool |
~0.125 | Bit-packed bitmask |
bytes:N |
N | Strings, hashes, fixed data |
Development
# Install with dev + optional extras
pip install -e ".[dev,numpy]"
# Lint (same config CI uses)
ruff check .
# Unit tests
pytest tests/ -q
# Legacy script-style suites (encoding/codec checks)
python tests/test_delta_encoding.py
python tests/test_dict_encoding.py
# Benchmark suite (writes a Markdown table you can drop into the README)
python benchmarks/bench_suite.py --rows 100000 --json bench.json --markdown bench.md
Continuous integration (.github/workflows/ci.yml) runs ruff, the test matrix
on Linux (3.9–3.13) and Windows, and the benchmark on every push and PR.
Version History
-
v0.13.0 — Speed, lightweight, and reliability micro-pass:
_xor_streamnow XORs full 32-byte SHA-256 blocks with a single 256-bit integer operation instead of a 32-iteration Python byte loop — significantly faster for encrypted row/WAL/blob operationsSchema.decode_row()acceptsmemoryviewdirectly without an intermediatebytes()copy, reducing per-row allocations on every read pathSlab.iter_rows()inlines the hot read path to avoid redundant per-row bounds and liveness checks- README and roadmap updated to reflect honest niche positioning per re-evaluation (single-writer embedded Python database; not a SQLite or DuckDB replacement)
-
v0.12.1 — Niche performance gap closing:
- Added stdlib-only sorted range indexes for row-store ordered lookups (
create_range_index()/range_find()), kept in sync across insert/update/delete/compact - Columnar
batch_update()andgroup_by()now use column-oriented helpers when constraints/index/CDC hooks do not require the generic row path
- Added stdlib-only sorted range indexes for row-store ordered lookups (
-
v0.12.0 — Production-readiness hardening:
- Row-store transactions now append row-level WAL records and replay committed transactions on open, so committed transactional writes recover after abrupt process exit
close()inside an open transaction rolls back instead of committing partial work; nested transactions now fail loudly- Per-instance
RLock, same-process double-open guard, and cross-process advisory sidecar lock prevent the demonstrated write races/corruption ColumnDef(unique=True)andColumnDef(primary_key=True)enforce uniqueness; missing required columns andNonevalues fail before binary encoding- Columnar
SnapDB(..., storage_type="columnar")persists snapshots to the provided path;backup()flushes/snapshots before copying;compact()reclaims deleted row space;fsck()/repair()provide lightweight integrity tooling limit <= 0returns no rows consistently;batch_update(),group_by(), and in-memory equi-join()added- Optional
encryption_keyencrypts row payloads, columnar snapshots, and WAL records at rest - DocumentStore JSON export/import preserves list/dict fields instead of stringifying Python reprs
-
v0.11.0 — NumPy-accelerated string filtering:
select_where()/count_where()on dict-encoded string columns compare integer dict codes via NumPy foreq/ne/ininstead of per-row string comparison — ~300×+ faster (dict==count ~969M rows/s); a mixed numeric+string filtered count now runs ~143× faster. Exact parity verified; ordering ops and non-dict bytes columns keep the Python path
-
v0.10.0 — Fast row-store bulk insert (#13):
batch_insert()now grows the backing file in a single truncate + remap for the whole batch instead of one per slab — ~26× faster (100K rows: ~5.8s → ~0.29s, now in the same ballpark as SQLite/pandas). On-disk format and durability guarantees unchanged
-
v0.9.0 — NumPy-accelerated filters (#14):
select_where()builds condition masks vectorially over the column buffers when NumPy is installed (~2× faster);use_numpy=Falseforces the pure-Python path- New
count_where()— filtered row count with no materialization, ~314M rows/s on numeric predicates (~166×). Exact parity with the pure-Python path verified - Bytes/encoded conditions fall back to the Python mask; mixed queries still accelerate their numeric conditions
-
v0.8.0 — Optional NumPy-accelerated aggregates (#14):
aggregate()runssum/min/max/avgover the zero-copy column buffer with NumPy when it's installed — ~13–27× faster (full-scan SUM ~530M rows/s, on par with pandas)- Auto-enabled when NumPy is present;
use_numpy=Falseforces the pure-Python path; exact parity verified (integers exact, floats within tolerance) - Zero-dependency default unchanged; encoded (delta/FOR) and 64-bit-int-sum cases fall through to the exact Python path
-
v0.7.0 — Frame-of-Reference encoding:
- New: Frame-of-Reference (FOR) + bit packing for bounded numeric columns (ages, scores, ratings): 4–8× memory reduction
- Auto-detects after sampling threshold (default 50 rows), auto-fallback when range exceeds 16 bits
- Per-column via
for_columns=[], transparent API, update fallback to raw - 6 new tests, zero regressions
-
v0.6.0 — Performance, correctness & features:
- New: vectorized multi-condition
select_where()(bitmaskAND/OR), auto-indexing (auto_index=True), zero-copy NumPy export (to_numpy()/buffer(), PEP 688) - Delta-encoded column reads are now O(1)/O(n) (lazy reconstruction cache) instead of O(n)/O(n²) — orders of magnitude faster delta scans/aggregates
- Hash indexes are genuinely kept in sync on insert /
batch_insert/ update / delete (previously went stale after the first build); single unifiedcreate_index()for row and columnar storage;find()gained a scan fallback - Fixed data corruption: deleting/nulling a delta-encoded row no longer shifts other rows' values
- Transaction rollback now actually undoes writes (and restores indexes)
- Durability fix: multi-slab row databases now survive
close()/reopen — the on-disk bitmap geometry and slab high-water marks are persisted correctly (previously reopening a >1-slab database lost data) - Vectorized aggregates (array-level
sum/min/max) for null-free numeric columns __slots__on hot classes;close()reliably releases the mmap (Windows file locks)- Tooling: reproducible benchmark suite, GitHub Actions CI (ruff + test matrix + benchmark),
ruff-clean codebase
- New: vectorized multi-condition
-
v0.5.0 — Delta encoding (1.2× memory reduction for monotonic numeric columns)
-
v0.4.0 — Dictionary encoding (3× memory reduction for low-cardinality strings)
-
v0.3.2 — Precompiled struct format, hash index, bit-packed booleans
-
v0.3.1 — Batch insert, optimized columnar, comprehensive benchmarks
-
v0.3.0 — Columnar engine, metrics, CDC
-
v0.2.0 — Query engine, hash indexes, WAL transactions, DocumentStore
-
v0.1.0 — Initial release
Roadmap & Known Limitations
Design boundary — SnapDB is a single-writer, local-file embedded database. The following are intentional non-goals; they will not be added:
- No SQL planner, MVCC, or CHECK/FOREIGN KEY constraints
- No server mode, RBAC, network encryption, ODBC/JDBC/ADBC, or SQLAlchemy dialect
- No DuckDB-style analytical engine or Parquet/Arrow integration
- Joins are in-memory equi-joins only, not a cost-based optimizer
Current limitations:
- The optional
encryption_keyprotects raw files/WAL from casual plaintext recovery; it is not a substitute for OS key management or encrypted volumes. - Multi-version snapshot isolation is not implemented; the file lock model is single-writer oriented.
Near-term reliability focus (per evaluation guidance):
- Lightweight per-operation benchmarks with loose CI thresholds (insert / query / range / group-by timing and memory)
- Additional
fsck/repairfixtures for corruption recovery paths - Narrow helper improvements (batch paths, range windows, zero-copy buffers) only where they reduce per-row Python overhead without adding dependencies
License
MIT — see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pysnapdb-0.13.0.tar.gz.
File metadata
- Download URL: pysnapdb-0.13.0.tar.gz
- Upload date:
- Size: 77.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
385ff23267e3a059fd8c93f3159b312915a2d3a70a258d830d3447133ede2d7e
|
|
| MD5 |
a4fb473cb2664fc6eb32c1cfd4e26031
|
|
| BLAKE2b-256 |
72b269c16dae61bddf52bf33f47e1701085a1b65e3a7aed49c77a4e90d90fd4a
|
File details
Details for the file pysnapdb-0.13.0-py3-none-any.whl.
File metadata
- Download URL: pysnapdb-0.13.0-py3-none-any.whl
- Upload date:
- Size: 51.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d03440c4ffd2a1c0ed414fa17e96e1a3926a9caea6b26725054041cc453b1151
|
|
| MD5 |
1d57db6dbe6b6300a133a1b4182a0fa3
|
|
| BLAKE2b-256 |
768f9251a8551908d00248f75a6412083e0d4da0947aa4856d27988b8c102b07
|