Skip to main content

Token-native knowledge base for LLM scale

Project description

ContextFit

A token-native knowledge base designed for LLM scale.

ContextFit keeps everything—storage, indexing, search, relationships, traversal, and commonality detection—inside discrete token-ID space until the very last step, when you decode only the final retrieved token chunks for the LLM's output.

Why Token-Native?

  • ~2× smaller storage than raw text (no repeated tokenization)
  • Blazing-fast integer-only operations (no float embeddings)
  • Hierarchical "geo-map-style" traversal for multi-hop reasoning
  • Neural-network-like chunk relationships via token overlap graphs
  • Automatic commonality discovery without vector spaces
  • Direct LLM injection — feed input_ids directly, no conversion

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         ContextFit                               │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │   Storage   │  │   Index     │  │        Graph            │  │
│  │             │  │             │  │                         │  │
│  │ Token Arrays│  │ Inverted    │  │ Chunk Relationships     │  │
│  │ Chunk Store │  │ Suffix/FM   │  │ Community Detection     │  │
│  │ Compression │  │ BM25 Tokens │  │ Commonality Mining      │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
│                                                                  │
│  ┌─────────────────────────────┐  ┌─────────────────────────┐   │
│  │        Hierarchy            │  │       Retrieval         │   │
│  │                             │  │                         │   │
│  │ Level 0: Raw Chunks         │  │ Query Tokenization      │   │
│  │ Level 1+: Summary Clusters  │  │ Graph Traversal         │   │
│  │ Geo-Map Navigation          │  │ Direct input_ids Output │   │
│  └─────────────────────────────┘  └─────────────────────────┘   │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    Semantic IDs (SIDs)                      ││
│  │                                                             ││
│  │  Hierarchical token sequences → generative retrieval        ││
│  │  Similar chunks share prefixes → trie-like navigation       ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Core Components

1. Storage Layer

  • Token arrays (uint16/uint32 IDs)
  • Memory-mapped files for large corpora
  • Delta encoding + Zstd compression
  • Chunk metadata headers

2. Index Layer

  • Inverted Index: tokenID → [(chunkID, positions)] using Roaring bitmaps
  • Suffix Array / FM-Index: Instant exact n-gram search
  • BM25 on Tokens: TF-IDF scoring with token IDs as terms
  • Binary postings pack: one compact postings.bin instead of JSON-per-token files

3. Graph Layer

  • Nodes = chunks (or Semantic IDs)
  • Edges = token n-gram overlap, Jaccard similarity, co-occurrence
  • MinHash + LSH for fast similarity without floats
  • Community detection for commonality discovery

4. Hierarchy Layer

  • Level 0: Raw token chunks (256–1024 tokens each)
  • Level 1+: Clustered summaries as token sequences
  • GraphRAG-style community summaries
  • Integer pointers for zoom navigation

5. Retrieval Layer

  • Tokenize query → search indexes → traverse graph → collect token IDs
  • Feed directly as input_ids to any LLM
  • No detokenization until final generation

6. Semantic IDs

  • Assign each chunk a short hierarchical SID token sequence
  • Similar chunks share prefixes via MinHash-band residual buckets
  • Resolve generated/predicted SID prefixes through a trie with prefix backoff
  • Retrieval mode: --method sid or hybrid SID + BM25

7. SID Generator

  • Predicts SID prefixes from query tokens without detokenizing
  • Combines BM25 candidate chunks, MinHash similarity, and LSH neighbors
  • Candidate chunks vote for hierarchical SID prefixes
  • Returns generated SID predictions plus resolved chunk IDs

8. Learned SID Generator

  • Trains a sparse token→SID associative model from stored chunks
  • Uses beam search over valid SID prefixes
  • No neural dependency yet; still token-native and deterministic
  • CLI: contextfit ingest ./docs --train-sid-generator

Getting Started

# Install dependencies
pip install -e .

# Ingest a knowledge base
contextfit ingest ./documents --tokenizer tiktoken

# Query
contextfit query "What is ContextFit?"

# Query through Semantic IDs
contextfit query "async retrieval" --method sid

# Agent-friendly machine-readable output
contextfit query "What is ContextFit?" --method hybrid --json
contextfit stats --json

# Run a deterministic sample benchmark
python examples/benchmark_sample_corpus.py --docs-per-topic 100 --json

# Run needle-in-a-haystack benchmark
python examples/benchmark_needle_haystack.py --needles 20 --distractors 200 --top-k 5 --json

# Ingest and train the learned SID generator
contextfit ingest ./documents --train-sid-generator

For installing on a MacBook/OpenClaw node, see docs/MACBOOK_CLI_DEPLOY.md.

For OpenClaw integration, including the contextfit_search tool and contextfit context engine plugin, see docs/OPENCLAW_INTEGRATION.md.

--json is intended for OpenClaw/agent use. Query JSON includes input_ids, retrieved chunk metadata, SID predictions, semantic IDs, and decoded previews.

Current Storage Layout

contextfit_kb/
  chunks/
    chunks.bin        # zstd-compressed token-array records
    index.json        # chunk_id → byte offset/length
  inverted/
    meta.json         # corpus/index metadata
    postings.bin      # compact binary token → roaring bitmap + positions pack
  sid/
    semantic_ids.json
    learned_sid_generator.json

The inverted index now saves as a single binary postings pack by default. Legacy JSON-per-token indexes still load for compatibility.

Project Status

🚧 Early Development — Architecture phase

References

  • TERAG: Token-Efficient GraphRAG (3–11% token reduction)
  • Semantic IDs / Generative Retrieval
  • GraphRAG community detection
  • Letta's token-space learning

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextfit-0.1.0.tar.gz (678.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contextfit-0.1.0-py3-none-any.whl (94.1 kB view details)

Uploaded Python 3

File details

Details for the file contextfit-0.1.0.tar.gz.

File metadata

  • Download URL: contextfit-0.1.0.tar.gz
  • Upload date:
  • Size: 678.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for contextfit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9224389fd81f92da67b348b62bb1f09c7775f66ad367a5d52c000ae5ddde6fe1
MD5 567167f834df8054f0c677e074b751a8
BLAKE2b-256 f4ab659dfb063c4f15515089c560ea0124149067f12b4bac90481ff44b94f305

See more details on using hashes here.

File details

Details for the file contextfit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: contextfit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 94.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for contextfit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 831aa1e8e541b9a1fb7f5096c06cc5d1dc68a655cbc1e897dc18849fa74a31fe
MD5 ae21cca64ac9d64df10afb08b9ede2fb
BLAKE2b-256 28bf4e73c02e728f338dcd5a2bde825328516488dc320b96e45c08b57e55c0b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page