Token-native knowledge base for LLM scale
Project description
ContextFit
A token-native knowledge base designed for LLM scale.
ContextFit keeps everything—storage, indexing, search, relationships, traversal, and commonality detection—inside discrete token-ID space until the very last step, when you decode only the final retrieved token chunks for the LLM's output.
Why Token-Native?
- ~2× smaller storage than raw text (no repeated tokenization)
- Blazing-fast integer-only operations (no float embeddings)
- Hierarchical "geo-map-style" traversal for multi-hop reasoning
- Neural-network-like chunk relationships via token overlap graphs
- Automatic commonality discovery without vector spaces
- Direct LLM injection — feed
input_idsdirectly, no conversion
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ ContextFit │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Storage │ │ Index │ │ Graph │ │
│ │ │ │ │ │ │ │
│ │ Token Arrays│ │ Inverted │ │ Chunk Relationships │ │
│ │ Chunk Store │ │ Suffix/FM │ │ Community Detection │ │
│ │ Compression │ │ BM25 Tokens │ │ Commonality Mining │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
│ │
│ ┌─────────────────────────────┐ ┌─────────────────────────┐ │
│ │ Hierarchy │ │ Retrieval │ │
│ │ │ │ │ │
│ │ Level 0: Raw Chunks │ │ Query Tokenization │ │
│ │ Level 1+: Summary Clusters │ │ Graph Traversal │ │
│ │ Geo-Map Navigation │ │ Direct input_ids Output │ │
│ └─────────────────────────────┘ └─────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Semantic IDs (SIDs) ││
│ │ ││
│ │ Hierarchical token sequences → generative retrieval ││
│ │ Similar chunks share prefixes → trie-like navigation ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
Core Components
1. Storage Layer
- Token arrays (uint16/uint32 IDs)
- Memory-mapped files for large corpora
- Delta encoding + Zstd compression
- Chunk metadata headers
2. Index Layer
- Inverted Index: tokenID → [(chunkID, positions)] using Roaring bitmaps
- Suffix Array / FM-Index: Instant exact n-gram search
- BM25 on Tokens: TF-IDF scoring with token IDs as terms
- Binary postings pack: one compact
postings.bininstead of JSON-per-token files
3. Graph Layer
- Nodes = chunks (or Semantic IDs)
- Edges = token n-gram overlap, Jaccard similarity, co-occurrence
- MinHash + LSH for fast similarity without floats
- Community detection for commonality discovery
4. Hierarchy Layer
- Level 0: Raw token chunks (256–1024 tokens each)
- Level 1+: Clustered summaries as token sequences
- GraphRAG-style community summaries
- Integer pointers for zoom navigation
5. Retrieval Layer
- Tokenize query → search indexes → traverse graph → collect token IDs
- Feed directly as
input_idsto any LLM - No detokenization until final generation
6. Semantic IDs
- Assign each chunk a short hierarchical SID token sequence
- Similar chunks share prefixes via MinHash-band residual buckets
- Resolve generated/predicted SID prefixes through a trie with prefix backoff
- Retrieval mode:
--method sidor hybrid SID + BM25
7. SID Generator
- Predicts SID prefixes from query tokens without detokenizing
- Combines BM25 candidate chunks, MinHash similarity, and LSH neighbors
- Candidate chunks vote for hierarchical SID prefixes
- Returns generated SID predictions plus resolved chunk IDs
8. Learned SID Generator
- Trains a sparse token→SID associative model from stored chunks
- Uses beam search over valid SID prefixes
- No neural dependency yet; still token-native and deterministic
- CLI:
contextfit ingest ./docs --train-sid-generator
Getting Started
# Install dependencies
pip install -e .
# Ingest a knowledge base
contextfit ingest ./documents --tokenizer tiktoken
# Query
contextfit query "What is ContextFit?"
# Query through Semantic IDs
contextfit query "async retrieval" --method sid
# Agent-friendly machine-readable output
contextfit query "What is ContextFit?" --method hybrid --json
contextfit stats --json
# Run a deterministic sample benchmark
python examples/benchmark_sample_corpus.py --docs-per-topic 100 --json
# Run needle-in-a-haystack benchmark
python examples/benchmark_needle_haystack.py --needles 20 --distractors 200 --top-k 5 --json
# Ingest and train the learned SID generator
contextfit ingest ./documents --train-sid-generator
For installing on a MacBook/OpenClaw node, see docs/MACBOOK_CLI_DEPLOY.md.
For OpenClaw integration, including the contextfit_search tool and contextfit context engine plugin, see docs/OPENCLAW_INTEGRATION.md.
--json is intended for OpenClaw/agent use. Query JSON includes input_ids, retrieved chunk metadata, SID predictions, semantic IDs, and decoded previews.
Current Storage Layout
contextfit_kb/
chunks/
chunks.bin # zstd-compressed token-array records
index.json # chunk_id → byte offset/length
inverted/
meta.json # corpus/index metadata
postings.bin # compact binary token → roaring bitmap + positions pack
sid/
semantic_ids.json
learned_sid_generator.json
The inverted index now saves as a single binary postings pack by default. Legacy JSON-per-token indexes still load for compatibility.
Project Status
🚧 Early Development — Architecture phase
References
- TERAG: Token-Efficient GraphRAG (3–11% token reduction)
- Semantic IDs / Generative Retrieval
- GraphRAG community detection
- Letta's token-space learning
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contextfit-0.1.0.tar.gz.
File metadata
- Download URL: contextfit-0.1.0.tar.gz
- Upload date:
- Size: 678.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9224389fd81f92da67b348b62bb1f09c7775f66ad367a5d52c000ae5ddde6fe1
|
|
| MD5 |
567167f834df8054f0c677e074b751a8
|
|
| BLAKE2b-256 |
f4ab659dfb063c4f15515089c560ea0124149067f12b4bac90481ff44b94f305
|
File details
Details for the file contextfit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: contextfit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 94.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
831aa1e8e541b9a1fb7f5096c06cc5d1dc68a655cbc1e897dc18849fa74a31fe
|
|
| MD5 |
ae21cca64ac9d64df10afb08b9ede2fb
|
|
| BLAKE2b-256 |
28bf4e73c02e728f338dcd5a2bde825328516488dc320b96e45c08b57e55c0b6
|