Skip to main content

Compact, read-only nested dictionary backed by succinct data structures

Project description

compact-tree

Tests License: MIT

Compact, read-only nested dictionary backed by succinct data structures.

CompactTree stores a nested Python dict using a LOUDS-encoded trie with DAWG-style key/value deduplication, enabling low-memory random access and efficient serialization.

Features

  • Memory-efficient: Uses succinct data structures (LOUDS trie + MarisaTrie deduplication)
  • Fast lookups: O(1) rank/select operations via Poppy bit vectors
  • High-performance builds: 2.2× faster from_dict at 173K keys (v1.2.0); eliminates LOUDS rank/select during build entirely via intermediate-trie word index
  • Serializable: Save and load from disk with efficient binary format
  • Gzip compression: Optional gzip compression for even smaller files on disk
  • Pickle support: Fully serializable via Python's pickle module
  • Read-only: Optimized for lookup-heavy workloads
  • Storage-agnostic: Works with local files and remote storage via fsspec
  • Dict-like interface: Supports [], in, len(), iteration, repr(), and str()

Installation

pip install savov-compact-tree

Or install from source:

git clone https://github.com/andrey-savov/compact-tree.git
cd compact-tree
pip install -e .

Quick Start

from compact_tree import CompactTree

# Build from a nested dict
tree = CompactTree.from_dict({
    "a": {
        "x": "1",
        "y": "2"
    },
    "b": "3"
})

# Access like a normal dict
print(tree["a"]["x"])   # "1"
print(tree["b"])        # "3"
print("a" in tree)      # True
print(len(tree))        # 2
print(list(tree))       # ["a", "b"]

# String representations
print(str(tree))        # {'a': {'x': '1', 'y': '2'}, 'b': '3'}
print(repr(tree))       # CompactTree.from_dict({'a': {'x': '1', ...}, 'b': '3'})

# Serialize to file
tree.serialize("tree.ctree")

# Load from file
loaded_tree = CompactTree("tree.ctree")

# Serialize with gzip compression
tree.serialize("tree.ctree.gz", storage_options={"compression": "gzip"})
loaded_gz = CompactTree("tree.ctree.gz", storage_options={"compression": "gzip"})

# Pickle support
import pickle
data = pickle.dumps(tree)
tree2 = pickle.loads(data)

# Convert back to plain dict
plain_dict = loaded_tree.to_dict()

How It Works

LOUDS (Level-Order Unary Degree Sequence)

LOUDS is a succinct tree representation that encodes an ordered tree into a single bit string using roughly 2n bits for n nodes (close to the information-theoretic minimum).

Encoding rule: Traverse the tree in breadth-first (level) order. For each node, write d 1-bits (where d is the node's number of children) followed by a single 0-bit.

Example for a root with children A (2 kids) and B (0 kids):

root  ->  1 1 0      (2 children, then 0-terminator)
A     ->  1 1 0      (2 children)
B     ->  0          (leaf)

Navigation relies on rank and select queries:

Operation Description
first_child(v) Find the (v-1)-th 0, check next position
next_sibling(v) Find the 1-bit for node v, check next position

MarisaTrie

MarisaTrie is a compact word-to-index mapping built on a LOUDS-encoded trie with path compression and minimal perfect hashing (MPH). CompactTree uses two MarisaTrie instances -- one for keys and one for values -- to provide DAWG-style deduplication.

  • Path compression: single-child edges are merged for compactness
  • Dense indexing: every unique word gets an index in [0, N)
  • Reverse lookup: recover the original word from its index
  • Bulk enumeration: to_dict() returns {word: index} for all words in O(N) via a single DFS, with a zero-LOUDS fast path for newly constructed tries
  • C-level LRU cache: per-instance functools.lru_cache on index() lookups; cache size set automatically from vocabulary size or via the vocabulary_size hint on from_dict()

DAWG-Style Deduplication

  • Keys are collected, sorted, and deduplicated via a MarisaTrie
  • Values (leaves) are similarly deduplicated via a second MarisaTrie
  • Edge labels store integer IDs rather than raw strings
  • Same key/value appearing multiple times is stored only once

Architecture

CompactTree
  |
  +-- louds      : LOUDS       bit-vector tree topology (Poppy rank/select)
  +-- elbl       : bytes       edge labels  (uint32 key ids, 4 bytes per node)
  +-- vcol       : bytes       value column (uint32: value id or 0xFFFFFFFF for internal nodes)
  +-- _key_trie  : MarisaTrie  key vocabulary (word <-> dense index)
  +-- _val_trie  : MarisaTrie  value vocabulary (word <-> dense index)

Binary Format (v4)

Magic   : 5 bytes   "CTree"
Version : 8 bytes   uint64 LE (always 4; v3 files are still readable)
Header  : 7 x 8 bytes  lengths of: keys, values, louds, vcol, elbl,
                         key_vocab_size, val_vocab_size
Payload : keys_bytes | val_bytes | louds_bytes | vcol_bytes | elbl_bytes

keys_bytes and val_bytes are serialised MarisaTrie instances. louds_bytes is the raw bitarray, vcol_bytes and elbl_bytes are packed uint32 arrays. key_vocab_size and val_vocab_size record the LRU cache sizes used during from_dict and are restored on load so query-time caches are immediately correctly sized.

Dependencies

  • bitarray — Mutable bit arrays
  • succinct (Poppy) — Rank/select in O(1)
  • fsspec — Filesystem abstraction for local and remote storage

Testing

pytest test_compact_tree.py test_marisa_trie.py

Benchmarks

Run performance benchmarks with pytest-benchmark:

pytest test_compact_tree.py::TestLoadPerformance --benchmark-only -v

See BENCHMARK_RESULTS.md for detailed results and OPTIMIZATIONS.md for optimization history.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Based on LOUDS (Level-Order Unary Degree Sequence) tree representation
  • Uses Poppy rank/select implementation from the succinct library
  • Inspired by DAWG (Directed Acyclic Word Graph) compression techniques

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

savov_compact_tree-1.2.1.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

savov_compact_tree-1.2.1-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file savov_compact_tree-1.2.1.tar.gz.

File metadata

  • Download URL: savov_compact_tree-1.2.1.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for savov_compact_tree-1.2.1.tar.gz
Algorithm Hash digest
SHA256 3851669783f0635b75e52d38a792bcc110c5dc598bcc09148646543fea4bd657
MD5 b4fe2257828a951a988cc2dd80933de3
BLAKE2b-256 8091563562648549ac3b7a14edb27ef5c01a5bb71844766c786187a16b764d07

See more details on using hashes here.

Provenance

The following attestation bundles were made for savov_compact_tree-1.2.1.tar.gz:

Publisher: publish.yml on andrey-savov/compact-tree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file savov_compact_tree-1.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for savov_compact_tree-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e891ca22d3fa04df434169a3d0efa6654b013256bc6cdc25c6cc47d1a28cda69
MD5 5b75c6373b8c09a74d1cf8a569143250
BLAKE2b-256 31b52bc94342bd5e3d7611ad08593e4604a7eac34386e76259014af82f8ec80b

See more details on using hashes here.

Provenance

The following attestation bundles were made for savov_compact_tree-1.2.1-py3-none-any.whl:

Publisher: publish.yml on andrey-savov/compact-tree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page