A python key-value file database
Project description
Introduction
Booklet is a pure python key-value file database. It allows for multiple serializers for both the keys and values. Booklet uses the MutableMapping class API which is the same as python’s dictionary in addition to some dbm methods (i.e. sync and prune). It is thread-safe (using thread locks on writes) and multiprocessing-safe (using file locks).
Deletes do not remove data from the file directly. Similarly, reassigning a value to an existing key adds a new key/value set to the file. During normal usage, the user will not notice a difference when requesting a key/value set, but the file size will grow. If size becomes an issue because of lots of deletes or reassignments, then the user should create a new file by iterating through the original.
When an error occurs and is caught by the module (e.g. trying to access a key that doesn’t exist), booklet will properly close the file and remove the file locks. This will not sync any changes, so the user will lose any changes that were not synced. There will be errors that can occur that are not caught and in these circumstances there are no guarantees for what happens to the file.
Installation
Install via pip:
pip install booklet
Or conda:
conda install -c mullenkamp booklet
I’ll probably put it on conda-forge once I feel like it’s up to an appropriate standard…
Serialization
Both the keys and values stored in Booklet must be bytes when written to disk. This is the default when “open” is called. Booklet allows for various serializers to be used for taking input keys and values and converting them to bytes. There are many in-built serializers. Check the booklet.available_serializers list for what’s available. Some serializers require additional packages to be installed (e.g. orjson, zstd, etc). If you want to serialize to json, then it is highly recommended to use orjson or msgpack as they are substantially faster than the standard json python module. If in-built serializers are assigned at initial file creation, then they will be saved on future reading and writing on the same file (i.e. they don’t need to be passed after the first time). Setting a serializer to None will not do any serializing, and the input must be bytes. The user can also pass custom serializers to the key_serializer and value_serializer parameters. These must have “dumps” and “loads” static methods. This allows the user to chain a serializer and a compressor together if desired. Custom serializers must be passed for writing and reading as they are not stored in the booklet file.
import booklet
print(booklet.available_serializers)
Usage
The docstrings have a lot of info about the classes and methods. Files should be opened with the booklet.open function. Read the docstrings of the open function for more details.
Write data using the context manager
import booklet
with booklet.open('test.blt', 'n', value_serializer='pickle', key_serializer='str', n_buckets=12007) as db:
db['test_key'] = ['one', 2, 'three', 4]
Read data using the context manager
with booklet.open('test.blt', 'r') as db:
test_data = db['test_key']
Notice that you don’t need to pass serializer parameters when reading (and additional writing) when in-built serializers are used. Booklet stores this info on the initial file creation.
In most cases, the user should use python’s context manager “with” when reading and writing data. This will ensure data is properly written and locks are released on the file. If the context manager is not used, then the user must be sure to run the db.sync() (or db.close()) at the end of a series of writes to ensure the data has been fully written to disk. Only after the writes have been synced can additional reads occur. Make sure you close your file or you’ll run into file deadlocks!
Write data without using the context manager
import booklet
db = booklet.open('test.blt', 'n', value_serializer='pickle', key_serializer='str', n_buckets=12007)
db['test_key'] = ['one', 2, 'three', 4]
db['2nd_test_key'] = ['five', 6, 'seven', 8]
db.sync() # Normally not necessary if the user closes the file after writing
db.close() # Will also run sync as part of the closing process
Read data without using the context manager
db = booklet.open('test.blt') # 'r' is the default flag
test_data1 = db['test_key']
test_data2 = db['2nd_test_key']
db.close()
Custom serializers
import orjson
class Orjson:
def dumps(obj):
return orjson.dumps(obj, option=orjson.OPT_NON_STR_KEYS | orjson.OPT_OMIT_MICROSECONDS | orjson.OPT_SERIALIZE_NUMPY)
def loads(obj):
return orjson.loads(obj)
with booklet.open('test.blt', 'n', value_serializer=Orjson, key_serializer='str') as db:
db['test_key'] = ['one', 2, 'three', 4]
The Orjson class is actually already built into the package. You can pass the string ‘orjson’ to either serializer parameters to use the above serializer. This is just an example of a custom serializer.
Here’s another example with compression.
import orjson
import zstandard as zstd
class OrjsonZstd:
def dumps(obj):
return zstd.compress(orjson.dumps(obj, option=orjson.OPT_NON_STR_KEYS | orjson.OPT_OMIT_MICROSECONDS | orjson.OPT_SERIALIZE_NUMPY))
def loads(obj):
return orjson.loads(zstd.decompress(obj))
with booklet.open('test.blt', 'n', value_serializer=OrjsonZstd, key_serializer='str') as db:
db['big_test'] = list(range(1000000))
with booklet.open('test.blt', 'r', value_serializer=OrjsonZstd) as db:
big_test_data = db['big_test']
If you use a custom serializer, then you’ll always need to pass it to booklet.open for additional reading and writing.
The open flag follows the standard dbm options:
Value |
Meaning |
---|---|
'r' |
Open existing database for reading only (default) |
'w' |
Open existing database for reading and writing |
'c' |
Open database for reading and writing, creating it if it doesn’t exist |
'n' |
Always create a new, empty database, open for reading and writing |
Design
There are two groups in a booklet file plus some initial bytes for parameters (sub index). The sub index is 200 bytes long, but currently only 37 bytes are used. The two other groups are the bucket index group and the data block group. The bucket index group contains the “hash table”. This bucket index contains a fixed number of buckets (n_buckets) and each bucket contains a 6 byte integer of the position of the first data block associated with that bucket. When the user requests a value from a key input, the key is hashed and the modulus of the n_buckets is performed to determine which bucket to read. The 6 bytes is read from that bucket, converted to an integer, then booklet knows where the first data block is located in the file. The data block group contains all of the data blocks each of which contains the key hash, next data block pos, key length, value length, timestamp (if init with timestamps), key, and value (in this order).
The number of bytes per data block object includes: key hash: 13 next data block pos: 6 key length: 2 value length: 4 timestamp: either 0 (if no timestamps where init) or 7 key: variable value: variable
When the first data block pos is determined through the initial key hashing and bucket reading, the first 19 bytes (key hash and next data block pos) are read. Booklet then checks the next data block pos (ndbp). If the ndbp is 0, then it has been assigned the delete flag and is ignored. The key hash from the data block is compared to the key hash from the input. If they are the same, then this is the data block we want. If they are different, then we look again at the ndbp. If the ndbp is 1, then this is the last data block associated with the key hash and the input key hash doesn’t exist. If the ndbp is > 1, then we move to the next data block based on the ndbp and try the cycle again until either we hit a dead end or we find the same key hash.
When we find the identical key hash, Booklet reads 6 bytes (key len and value len) to determine how many bytes are needed to be read to get the key/value (since they are variable). Depending on whether the user wants the key, value, and/or timestamp, Booklet will read 7 bytes (timestamp len) plus the number of bytes for the key and value.
Deletes…
Limitations
The only current limitation is that the user should assign an appropriate n_buckets. This should be approximately the same number as the expected number of keys/values. The default is set at 12007. An automatic re-indexing should come eventually.
Benchmarks
From my initial tests, the performance is comparable to other very fast key-value databases (e.g. gdbm, lmdb).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file booklet-0.6.0.tar.gz
.
File metadata
- Download URL: booklet-0.6.0.tar.gz
- Upload date:
- Size: 23.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e079d7d0063c7f968e779dee38b9cc47662b891e84ebf301485039249a91b740 |
|
MD5 | 9b773afe7000d8daf6b90e976af423da |
|
BLAKE2b-256 | fe3eeba19c66136c999c5373fdc5928329e9a488e727e6e55e02f65a9f3a76c0 |
File details
Details for the file booklet-0.6.0-py2.py3-none-any.whl
.
File metadata
- Download URL: booklet-0.6.0-py2.py3-none-any.whl
- Upload date:
- Size: 26.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b02a3e3b25a1121da2bb48afeaef2c5b845c8bd2cc4368fb4b3ecb73de0d41f |
|
MD5 | a01ed7202f56ede5537d301c40dd27fb |
|
BLAKE2b-256 | 75b357357884f02cf304ab865c966c0ba7e0e91522de1d2e29e33e96f7555c19 |