A lightweight key-value database backed by S3
Project description
s4db - Simple DB on S3
A lightweight key-value store where keys and values are strings. Data is written to numbered binary files on disk and synced to S3. Values are Snappy-compressed. An in-memory index tracks the exact file and byte offset for every live key, so reads never scan - they seek directly.
Installation
pip install s4db
s4db requires python-snappy, which links against the native Snappy C library.
# macOS
brew install snappy
# Ubuntu / Debian
apt-get install libsnappy-dev
Quick start
from s4db import S4DB
db = S4DB(
bucket="my-bucket",
prefix="my-db/", # S3 key prefix; include a trailing slash
region_name="ap-south-1", # any extra kwargs go to boto3.client("s3", ...)
)
db.put({"hello": "world"})
print(db.get("hello")) # "world"
db.delete(["hello"])
print(db.get("hello")) # None
On __init__, the index is downloaded from S3 into memory. If no index exists, the database starts empty. No local directory is created or used until a write operation (put / delete) is called.
API reference
__init__(bucket, prefix, local_dir=None, max_file_size=...)
db = S4DB(
bucket="my-bucket",
prefix="my-db/",
local_dir="/tmp/my-db", # optional; a temp dir is created automatically if omitted
max_file_size=64*1024*1024, # optional, default 64 MB
region_name="ap-south-1", # any extra kwargs go to boto3.client("s3", ...)
)
local_diris optional. If not provided, no directory is touched until aput()ordelete()is called, at which point a temporary directory is created automatically.- Read-only operations (
get,keys) never require a local directory - they use the in-memory index and S3 range requests. - The index is always loaded from S3 into memory on init; it is never read from a local file.
put(items: dict[str, str]) -> None
Writes one or more key/value pairs in a single append to the current data file.
db.put({"key1": "value1", "key2": "value2"})
- Overwrites any existing value for a key.
- If the current data file would exceed
max_file_size, a new file is opened before writing. - Creates
local_dir(or a temp dir) on first call if none was provided. - Does not push to S3 automatically - call
upload()when ready to sync.
get(key: str) -> str | None
Returns the value for a key, or None if the key does not exist or has been deleted.
value = db.get("key1")
- Looks up the key in the index to get the file number and byte offset.
- If
local_diris set and the data file is present there, reads exactly those bytes from disk. - Otherwise fetches only that entry's bytes from S3 using a range request - the full file is never downloaded, and no local directory is needed.
- Call
download()first if you want all reads served from disk.
keys() -> list[str]
Returns a list of all live keys currently in the database.
all_keys = db.keys()
- Reads directly from the in-memory index - no disk or S3 access.
- Only returns keys that are live (not deleted). Tombstoned keys are never included.
- The order of the returned list is not guaranteed.
iter(local=False) -> Generator[tuple[str, str], ...]
Yields (key, value) pairs for every live key in the database.
for key, value in db.iter():
print(key, value)
The local parameter controls how values are read:
local=False(default) - for each key, callsget()which fetches only that entry's bytes from S3 using a range request. No files are downloaded. Use this for sparse access or when disk space is limited.local=True- before iteration, downloads all data files referenced by the index that are not already present inlocal_dir. Existing local files are not replaced. Values are then read from disk - no S3 calls during iteration itself. Use this when iterating over many keys to avoid one S3 request per key.
# S3 range request per key (default)
for key, value in db.iter():
process(key, value)
# Download missing files first, then read from disk
for key, value in db.iter(local=True):
process(key, value)
- Deleted keys are never yielded.
- The iteration order is not guaranteed.
iter(local=True)createslocal_dir(or a temp dir) if none was provided.
delete(keys: list[str]) -> None
Writes tombstone entries for each key that exists in the index.
db.delete(["key1", "key2"])
- Keys not present in the index are silently skipped; no tombstone is written for them.
- Removes the keys from the in-memory index immediately.
- Tombstones consume space until
compact()is run.
download() -> None
Downloads all data files and the index from S3 into local_dir.
db.download()
- Creates
local_dir(or a temp dir) if none was provided. - Use this when you want all subsequent reads served from disk with no S3 round trips.
- Overwrites any local files with the same name.
upload() -> None
Pushes all local data files and the in-memory index to S3.
db.upload()
- The index is serialized directly from memory - no local index file is required.
- If
local_diris not set, only the index is uploaded (no local data files exist). - Useful after bulk operations like
compact()orrebuild_index()to force a full re-sync. - Does not check whether S3 already has the latest version - it uploads everything.
flush() -> None
Writes the in-memory index to disk.
db.flush()
- Creates
local_dir(or a temp dir) if none was provided. put()anddelete()already callflush()internally.
compact() -> None
Rewrites all data files to reclaim space from deleted and overwritten entries.
db.compact()
- Reads every entry from every local data file.
- Retains only entries whose (file number, byte offset) still matches the in-memory index - stale overwrites and tombstones are dropped.
- Writes the surviving entries into new sequentially numbered files, respecting
max_file_size. - Clears and rebuilds the index from the new locations, saves it, removes the old local files, deletes the old S3 objects, and uploads the new files and index.
- Run
download()first iflocal_dirmay be out of date. - All data files must be present locally; compaction does not fetch missing files from S3.
rebuild_index() -> None
Reconstructs the index by replaying all local data files from scratch.
db.rebuild_index()
- Scans every
data_*.s4dbfile inlocal_dirin order, applying puts and tombstones sequentially. - Later entries correctly overwrite earlier ones for the same key.
- Saves the rebuilt index to disk. Does not push to S3 automatically.
- Use this for recovery when the index file is lost or corrupted.
- Run
download()first to ensure all data files are present locally.
Context manager
S4DB supports the context manager protocol. The __exit__ is a no-op - there is no connection to close - but the pattern keeps resource handling consistent.
with S4DB("my-bucket", "my-db/") as db:
db.put({"k": "v"})
print(db.get("k"))
S3 layout
Given bucket="my-bucket" and prefix="my-db/":
my-bucket/
my-db/
index.idx
data_000001.s4db
data_000002.s4db
...
Data files are named data_NNNNNN.s4db with zero-padded six-digit sequence numbers. The index file is always index.idx.
Typical workflows
Read-only from S3 - no local directory needed
db = S4DB("my-bucket", "my-db/")
# Index is loaded from S3 into memory; gets use S3 range requests
print(db.get("some-key"))
print(db.keys())
Write locally, sync later
db = S4DB("my-bucket", "my-db/", local_dir="/tmp/my-db")
db.put({"a": "1", "b": "2"})
db.delete(["a"])
db.upload() # push everything to S3 when done
Write without specifying local_dir (temp dir created automatically)
db = S4DB("my-bucket", "my-db/")
db.put({"a": "1"}) # temp dir created here on first write
db.upload()
Full local mirror
db = S4DB("my-bucket", "my-db/", local_dir="/tmp/my-db")
db.download() # pull everything local
print(db.get("some-key")) # served from disk, no S3 call
Iterate over all key/value pairs
# One S3 range request per key (no local files needed)
db = S4DB("my-bucket", "my-db/")
for key, value in db.iter():
print(key, value)
# Download missing files first, then read entirely from disk
db = S4DB("my-bucket", "my-db/", local_dir="/tmp/my-db")
for key, value in db.iter(local=True):
print(key, value)
Periodic compaction
db = S4DB("my-bucket", "my-db/", local_dir="/tmp/my-db")
db.download() # ensure all data files are present
db.compact() # rewrite, clean up S3, upload new files
Index recovery
db = S4DB("my-bucket", "my-db/", local_dir="/tmp/my-db")
db.download() # pull all data files
db.rebuild_index() # reconstruct index from data files
db.upload() # push repaired index to S3
Edge cases and gotchas
local_diris not required for read-only usage. A temporary directory is created automatically on the firstput()ordelete()call if none was provided.put()anddelete()do not push to S3 automatically. Callupload()explicitly.get()on a key whose data file is not local will make a ranged S3 request on every call. Usedownload()if you expect repeated access to the same keys.compact()andrebuild_index()require all data files to be present inlocal_dir. Always rundownload()first if you are not certain the local directory is up to date.delete()silently skips keys that are not in the index. It never writes unnecessary tombstones.- If the process is interrupted during
put()ordelete(), the data file may contain entries that the index does not reference.rebuild_index()will recover them. max_file_sizeis a soft limit. An entry is never split across files, but a single oversized entry can make a file exceed the limit slightly.iter(local=False)makes one S3 range request per key. For large datasets preferiter(local=True)to batch the S3 downloads upfront.iter(local=True)only downloads files referenced by the current in-memory index. Files that contain only deleted or overwritten entries are not downloaded.
Dependencies
- boto3 >= 1.26
- python-snappy >= 0.6
Development
pip install -e ".[dev]"
pytest tests/ -v
Tests use moto to mock S3 - no real AWS credentials required.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file s4db-0.8.0.tar.gz.
File metadata
- Download URL: s4db-0.8.0.tar.gz
- Upload date:
- Size: 21.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ffa1847fc2924e3d30eb52fc02c69a6f4c373b2d1fbd304ef00188ff939a196
|
|
| MD5 |
17015c5085d9ac5a422c4f73d4a25f68
|
|
| BLAKE2b-256 |
73b4a42528bc1860f6cd1afbd667a5ae0e241f3c7bbe070a980b1ce757967d92
|
File details
Details for the file s4db-0.8.0-py3-none-any.whl.
File metadata
- Download URL: s4db-0.8.0-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0846492728f0dd0a199c4e42729f3f28efb0a5a3b7c7271074e5318115030b6b
|
|
| MD5 |
51700452be5b9e9331665a6c9ab4071b
|
|
| BLAKE2b-256 |
42791f1b373e2a0a7f69b5007c7d47ff7a8ce05b7944a2d33ef7e0497db379e7
|