Compact, read-only nested dictionary backed by succinct data structures
Project description
compact-tree
Compact, read-only nested dictionary backed by succinct data structures.
CompactTree stores a nested Python dict using a LOUDS-encoded trie with DAWG-style key/value deduplication, enabling low-memory random access and efficient serialization.
Features
- Memory-efficient: Uses succinct data structures (LOUDS trie + MarisaTrie deduplication)
- Fast lookups: O(1) rank/select operations via Poppy bit vectors
- High-performance builds: LRU-cached MarisaTrie lookups and O(1) label access for 183x faster construction with key reuse
- Serializable: Save and load from disk with efficient binary format
- Gzip compression: Optional gzip compression for even smaller files on disk
- Pickle support: Fully serializable via Python's
picklemodule - Read-only: Optimized for lookup-heavy workloads
- Storage-agnostic: Works with local files and remote storage via
fsspec - Dict-like interface: Supports
[],in,len(), iteration,repr(), andstr()
Installation
pip install savov-compact-tree
Or install from source:
git clone https://github.com/andrey-savov/compact-tree.git
cd compact-tree
pip install -e .
Quick Start
from compact_tree import CompactTree
# Build from a nested dict
tree = CompactTree.from_dict({
"a": {
"x": "1",
"y": "2"
},
"b": "3"
})
# Access like a normal dict
print(tree["a"]["x"]) # "1"
print(tree["b"]) # "3"
print("a" in tree) # True
print(len(tree)) # 2
print(list(tree)) # ["a", "b"]
# String representations
print(str(tree)) # {'a': {'x': '1', 'y': '2'}, 'b': '3'}
print(repr(tree)) # CompactTree.from_dict({'a': {'x': '1', ...}, 'b': '3'})
# Serialize to file
tree.serialize("tree.ctree")
# Load from file
loaded_tree = CompactTree("tree.ctree")
# Serialize with gzip compression
tree.serialize("tree.ctree.gz", storage_options={"compression": "gzip"})
loaded_gz = CompactTree("tree.ctree.gz", storage_options={"compression": "gzip"})
# Pickle support
import pickle
data = pickle.dumps(tree)
tree2 = pickle.loads(data)
# Convert back to plain dict
plain_dict = loaded_tree.to_dict()
How It Works
LOUDS (Level-Order Unary Degree Sequence)
LOUDS is a succinct tree representation that encodes an ordered tree into a single bit string using roughly 2n bits for n nodes (close to the information-theoretic minimum).
Encoding rule: Traverse the tree in breadth-first (level) order. For each node, write d 1-bits (where d is the node's number of children) followed by a single 0-bit.
Example for a root with children A (2 kids) and B (0 kids):
root -> 1 1 0 (2 children, then 0-terminator)
A -> 1 1 0 (2 children)
B -> 0 (leaf)
Navigation relies on rank and select queries:
| Operation | Description |
|---|---|
first_child(v) |
Find the (v-1)-th 0, check next position |
next_sibling(v) |
Find the 1-bit for node v, check next position |
MarisaTrie
MarisaTrie is a compact word-to-index mapping built on a LOUDS-encoded trie with path compression and minimal perfect hashing (MPH). CompactTree uses two MarisaTrie instances -- one for keys and one for values -- to provide DAWG-style deduplication.
- Path compression: single-child edges are merged for compactness
- Dense indexing: every unique word gets an index in
[0, N) - Reverse lookup: recover the original word from its index
DAWG-Style Deduplication
- Keys are collected, sorted, and deduplicated via a
MarisaTrie - Values (leaves) are similarly deduplicated via a second
MarisaTrie - Edge labels store integer IDs rather than raw strings
- Same key/value appearing multiple times is stored only once
Architecture
CompactTree
|
+-- louds : LOUDS bit-vector tree topology (Poppy rank/select)
+-- elbl : bytes edge labels (uint32 key ids, 4 bytes per node)
+-- vcol : bytes value column (uint32: value id or 0xFFFFFFFF for internal nodes)
+-- _key_trie : MarisaTrie key vocabulary (word <-> dense index)
+-- _val_trie : MarisaTrie value vocabulary (word <-> dense index)
Binary Format (v3)
Magic : 5 bytes "CTree"
Version : 8 bytes uint64 LE (always 3)
Header : 5 x 8 bytes lengths of: keys, values, louds, vcol, elbl
Payload : keys_bytes | val_bytes | louds_bytes | vcol_bytes | elbl_bytes
keys_bytes and val_bytes are serialised MarisaTrie instances. louds_bytes is the raw bitarray, vcol_bytes and elbl_bytes are packed uint32 arrays.
Dependencies
bitarray— Mutable bit arrayssuccinct(Poppy) — Rank/select in O(1)fsspec— Filesystem abstraction for local and remote storage
Testing
pytest test_compact_tree.py test_marisa_trie.py
Benchmarks
Run performance benchmarks with pytest-benchmark:
pytest test_compact_tree.py::TestLoadPerformance --benchmark-only -v
See BENCHMARK_RESULTS.md for detailed results and OPTIMIZATIONS.md for optimization history.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for details.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Based on LOUDS (Level-Order Unary Degree Sequence) tree representation
- Uses Poppy rank/select implementation from the
succinctlibrary - Inspired by DAWG (Directed Acyclic Word Graph) compression techniques
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file savov_compact_tree-1.1.0.tar.gz.
File metadata
- Download URL: savov_compact_tree-1.1.0.tar.gz
- Upload date:
- Size: 22.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c8179e1d83837e695a1809b97d40e070ed448e22bb83aef457932c61184a947
|
|
| MD5 |
4fdabec2a0104f7b67eba20890622cc3
|
|
| BLAKE2b-256 |
cee67794c1974356353e799a5dc0255fa13870d072f0849462046d537b442138
|
Provenance
The following attestation bundles were made for savov_compact_tree-1.1.0.tar.gz:
Publisher:
publish.yml on andrey-savov/compact-tree
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
savov_compact_tree-1.1.0.tar.gz -
Subject digest:
9c8179e1d83837e695a1809b97d40e070ed448e22bb83aef457932c61184a947 - Sigstore transparency entry: 954399709
- Sigstore integration time:
-
Permalink:
andrey-savov/compact-tree@8fa2898007d25709ee36b010307fb8a9e945d5c0 -
Branch / Tag:
refs/tags/1.1.0 - Owner: https://github.com/andrey-savov
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8fa2898007d25709ee36b010307fb8a9e945d5c0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file savov_compact_tree-1.1.0-py3-none-any.whl.
File metadata
- Download URL: savov_compact_tree-1.1.0-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e745fa96263d320c1eb05e3ac167987d38b7fae41dd5c916ac53c89abe39d42
|
|
| MD5 |
18c7eff84b3375eb49829a65adf5aa04
|
|
| BLAKE2b-256 |
00be1d301826fda95854fa187988f5da3258c693ed21cdef294eaa87237d0a4b
|
Provenance
The following attestation bundles were made for savov_compact_tree-1.1.0-py3-none-any.whl:
Publisher:
publish.yml on andrey-savov/compact-tree
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
savov_compact_tree-1.1.0-py3-none-any.whl -
Subject digest:
7e745fa96263d320c1eb05e3ac167987d38b7fae41dd5c916ac53c89abe39d42 - Sigstore transparency entry: 954399712
- Sigstore integration time:
-
Permalink:
andrey-savov/compact-tree@8fa2898007d25709ee36b010307fb8a9e945d5c0 -
Branch / Tag:
refs/tags/1.1.0 - Owner: https://github.com/andrey-savov
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8fa2898007d25709ee36b010307fb8a9e945d5c0 -
Trigger Event:
release
-
Statement type: