Token-native knowledge base for LLM scale

These details have not been verified by PyPI

Project links

Project description

ContextFit

A token-native knowledge base designed for LLM scale.

ContextFit keeps everything—storage, indexing, search, relationships, traversal, and commonality detection—inside discrete token-ID space until the very last step, when you decode only the final retrieved token chunks for the LLM's output.

Why Token-Native?

~2× smaller storage than raw text (no repeated tokenization)
Blazing-fast integer-only operations (no float embeddings)
Hierarchical "geo-map-style" traversal for multi-hop reasoning
Neural-network-like chunk relationships via token overlap graphs
Automatic commonality discovery without vector spaces
Direct LLM injection — feed input_ids directly, no conversion

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         ContextFit                               │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │   Storage   │  │   Index     │  │        Graph            │  │
│  │             │  │             │  │                         │  │
│  │ Token Arrays│  │ Inverted    │  │ Chunk Relationships     │  │
│  │ Chunk Store │  │ Suffix/FM   │  │ Community Detection     │  │
│  │ Compression │  │ BM25 Tokens │  │ Commonality Mining      │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
│                                                                  │
│  ┌─────────────────────────────┐  ┌─────────────────────────┐   │
│  │        Hierarchy            │  │       Retrieval         │   │
│  │                             │  │                         │   │
│  │ Level 0: Raw Chunks         │  │ Query Tokenization      │   │
│  │ Level 1+: Summary Clusters  │  │ Graph Traversal         │   │
│  │ Geo-Map Navigation          │  │ Direct input_ids Output │   │
│  └─────────────────────────────┘  └─────────────────────────┘   │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    Semantic IDs (SIDs)                      ││
│  │                                                             ││
│  │  Hierarchical token sequences → generative retrieval        ││
│  │  Similar chunks share prefixes → trie-like navigation       ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Core Components

1. Storage Layer

Token arrays (uint16/uint32 IDs)
Memory-mapped files for large corpora
Delta encoding + Zstd compression
Chunk metadata headers

2. Index Layer

Inverted Index: tokenID → [(chunkID, positions)] using Roaring bitmaps
Suffix Array / FM-Index: Instant exact n-gram search
BM25 on Tokens: TF-IDF scoring with token IDs as terms
Binary postings pack: one compact postings.bin instead of JSON-per-token files

3. Graph Layer

Nodes = chunks (or Semantic IDs)
Edges = token n-gram overlap, Jaccard similarity, co-occurrence
MinHash + LSH for fast similarity without floats
Community detection for commonality discovery

4. Hierarchy Layer

Level 0: Raw token chunks (256–1024 tokens each)
Level 1+: Clustered summaries as token sequences
GraphRAG-style community summaries
Integer pointers for zoom navigation

5. Retrieval Layer

Tokenize query → search indexes → traverse graph → collect token IDs
Feed directly as input_ids to any LLM
No detokenization until final generation

6. Semantic IDs

Assign each chunk a short hierarchical SID token sequence
Similar chunks share prefixes via MinHash-band residual buckets
Resolve generated/predicted SID prefixes through a trie with prefix backoff
Retrieval mode: --method sid or hybrid SID + BM25

7. SID Generator

Predicts SID prefixes from query tokens without detokenizing
Combines BM25 candidate chunks, MinHash similarity, and LSH neighbors
Candidate chunks vote for hierarchical SID prefixes
Returns generated SID predictions plus resolved chunk IDs

8. Learned SID Generator

Trains a sparse token→SID associative model from stored chunks
Uses beam search over valid SID prefixes
No neural dependency yet; still token-native and deterministic
CLI: contextfit ingest ./docs --train-sid-generator

Getting Started

# Install dependencies
pip install -e .

# Ingest a knowledge base
contextfit ingest ./documents --tokenizer tiktoken

# Query
contextfit query "What is ContextFit?"

# Query through Semantic IDs
contextfit query "async retrieval" --method sid

# Agent-friendly machine-readable output
contextfit query "What is ContextFit?" --method hybrid --json
contextfit stats --json

# Run a deterministic sample benchmark
python examples/benchmark_sample_corpus.py --docs-per-topic 100 --json

# Run needle-in-a-haystack benchmark
python examples/benchmark_needle_haystack.py --needles 20 --distractors 200 --top-k 5 --json

# Ingest and train the learned SID generator
contextfit ingest ./documents --train-sid-generator

For installing on a MacBook/OpenClaw node, see docs/MACBOOK_CLI_DEPLOY.md.

For OpenClaw integration, including the contextfit_search tool and contextfit context engine plugin, see docs/OPENCLAW_INTEGRATION.md.

--json is intended for OpenClaw/agent use. Query JSON includes input_ids, retrieved chunk metadata, SID predictions, semantic IDs, and decoded previews.

Current Storage Layout

contextfit_kb/
  chunks/
    chunks.bin        # zstd-compressed token-array records
    index.json        # chunk_id → byte offset/length
  inverted/
    meta.json         # corpus/index metadata
    postings.bin      # compact binary token → roaring bitmap + positions pack
  sid/
    semantic_ids.json
    learned_sid_generator.json

The inverted index now saves as a single binary postings pack by default. Legacy JSON-per-token indexes still load for compatibility.

Project Status

🚧 Early Development — Architecture phase

References

TERAG: Token-Efficient GraphRAG (3–11% token reduction)
Semantic IDs / Generative Retrieval
GraphRAG community detection
Letta's token-space learning

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextfit-0.1.0.tar.gz (678.7 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

contextfit-0.1.0-py3-none-any.whl (94.1 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file contextfit-0.1.0.tar.gz.

File metadata

Download URL: contextfit-0.1.0.tar.gz
Upload date: May 11, 2026
Size: 678.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for contextfit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9224389fd81f92da67b348b62bb1f09c7775f66ad367a5d52c000ae5ddde6fe1`
MD5	`567167f834df8054f0c677e074b751a8`
BLAKE2b-256	`f4ab659dfb063c4f15515089c560ea0124149067f12b4bac90481ff44b94f305`

See more details on using hashes here.

File details

Details for the file contextfit-0.1.0-py3-none-any.whl.

File metadata

Download URL: contextfit-0.1.0-py3-none-any.whl
Upload date: May 11, 2026
Size: 94.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for contextfit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`831aa1e8e541b9a1fb7f5096c06cc5d1dc68a655cbc1e897dc18849fa74a31fe`
MD5	`ae21cca64ac9d64df10afb08b9ede2fb`
BLAKE2b-256	`28bf4e73c02e728f338dcd5a2bde825328516488dc320b96e45c08b57e55c0b6`

See more details on using hashes here.

contextfit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ContextFit

Why Token-Native?

Architecture

Core Components

1. Storage Layer

2. Index Layer

3. Graph Layer

4. Hierarchy Layer

5. Retrieval Layer

6. Semantic IDs

7. SID Generator

8. Learned SID Generator

Getting Started

Current Storage Layout

Project Status

References

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes