Fast, local, zero-setup RAG with no embedding model — BM25 + the LLM you plug in.
Project description
Fast, local, zero-setup RAG with no embedding model. The LLM compiles your documents once, offline; queries stay pure BM25: ~1 ms, $0, no GPU, no vector database.
Performance · Quickstart · How to use · Evaluation · Cost · How it works
Performance
Can a retriever with no embedding model compete with dense embeddings? On BEIR scifact, yes.
The honest headline. In the cost-fair tier (retrieval only, no per-query model), Nrag doc2query×2 (no embeddings, no GPU, no vector DB) hits nDCG@10 0.7291 / MRR 0.7042. That statistically ties qwen3-embedding-4b (0.7308) and beats it on MRR, while clearly beating text-embedding-3-small and bge-m3, all at $0 per query. Only the 8B embedder is decisively ahead. Full 30-row, cost-tiered table with ablations: benchmarks/scifact_results.md.
And on BRIGHT, the reasoning-intensive benchmark where off-the-shelf dense collapses (the #1 MTEB model scores just 18.3), embedding-free is exactly where the frontier lives:
| nDCG@10 (avg) | BM25 zero-shot | off-the-shelf dense | BM25 + GPT-4 CoT | LATTICE (emb-free SOTA) |
|---|---|---|---|---|
| 14.3 | 18.3 ⬇ | 27.0 | 46.7 |
Compiled Retrieval's target: beat off-the-shelf dense at $0 query cost, and approach reasoning systems while staying ~1 ms.
Quickstart
pip install nrag
from nrag import Nrag
rag = Nrag(preset="fast") # pure lexical, no LLM, no setup, no models
rag.add_texts(["Dijkstra finds shortest paths.", "Tomato soup needs basil."])
print(rag.search("shortest path", k=1)[0].text)
# -> Dijkstra finds shortest paths.
That's the whole install: no model download, no Java, no GPU, no vector DB. LLM features are pure add-ons. Plug one in when you want compiled enrichment and grounded answers.
How to use
1 · Pure lexical (no LLM)
from nrag import Nrag, Document
rag = Nrag(preset="fast", path="./idx") # on-disk; omit path for in-memory
rag.add("docs/") # a dir, file, glob, texts, or Documents
rag.add("report.pdf") # needs nrag[pdf]
rag.add_texts(["first passage", "second passage"])
rag.add([Document(doc_id="d1", text="...", source="d1.md", metadata={"team": "billing"})])
for h in rag.search("how do refunds work?", k=5):
print(f"{h.score:.3f} {h.source}\n {h.text[:100]}")
rag.close()
2 · Plug in any LLM
The contract is one method, complete. Use the built-in OpenAI-compatible adapter, wrap a function, or pass nothing.
from nrag.llm import OpenAICompatLLM, CallableLLM
# Any OpenAI-compatible endpoint: OpenAI, Ollama, vLLM, llama.cpp, LM Studio, Together, Groq...
llm = OpenAICompatLLM(base_url="http://localhost:11434/v1/", model="llama3.2", api_key="ollama")
# ...or wrap any callable (called with just the prompt string by default)
llm = CallableLLM(lambda prompt: my_model(prompt))
# ...or no LLM at all; pure lexical retrieval still works
3 · Presets
| Preset | LLM | What runs | For |
|---|---|---|---|
fast |
no | pure lexical (BM25 + trigrams + title) | sub-10 ms retrieval, no LLM |
quality (default) |
✔ | + contextual indexing (offline) + query expansion + grounded answers | best general RAG |
compiled |
✔ | + the index-time compiler (CSC) + the adaptive router | reasoning corpora, $0 queries |
Any field is overridable: Nrag(preset="compiled", consensus_k=5, engine="sqlite").
4 · Compiled Retrieval
Retrieval intelligence is a compile-time problem, not a serve-time problem.
One cached offline pass per chunk emits an enrichment bundle: all plain text that lands in the lexical index, never in the cited text:
| Pillar | What the compiler adds | Prior art |
|---|---|---|
| blurb | a chunk-specific context sentence | Anthropic Contextual Retrieval, 2024 |
| questions | the queries this chunk answers | doc2query / docTTTTTquery |
| propositions | atomic, decontextualized facts | Dense X, EMNLP 2024 |
| reasoning | multi-hop bridges not lexically present | the BRIGHT-winning signal, precomputed |
CSC, Consensus Sparse Compilation: the compiler is sampled k times; a term's weight is its agreement across samples, a training-free learned-sparse weighter that doubles as a self-consistency filter (hallucinated terms appear once and are dropped; entailed terms recur and are promoted). Literal anchoring keeps every source-literal term (IDs, error codes) at a floor weight, so exact match is structurally protected.
llm = OpenAICompatLLM(base_url="...", model="...", api_key="...")
rag = Nrag(llm=llm, preset="compiled", path="./idx")
rag.compile("docs/") # offline, cached by content-hash
print(rag.query("does this scale to a billion rows?").answer)
rag = Nrag.open("./idx") # reopen with NO llm, it still serves
Adaptive router: the only query-time LLM use, and it's gated. The first lexical pass is ~1 ms and $0; a cheap confidence signal decides whether to spend one LLM call escalating (expansion + re-search). Short queries are treated as precise and never escalated, dodging the expansion precision trap.
rag.search("how can I get my money back?", k=5)
print(rag.last_route) # RouterDecision(escalate=..., reason='no_hits'|'low_margin'|'confident', ...)
5 · Persistent, incremental, portable
rag = Nrag.open("./idx", llm=llm)
rag.sync("docs/") # re-index only changed files; drop deleted ones
rag.remove("d1") # delete one document
Compile once, serve anywhere (air-gapped). The serving index is a plain lexical artifact. Bundle it and ship it to an on-prem / offline box:
nrag export --index ./idx --out ship.nrag.tgz # portable bundle (drops the LLM cache)
nrag import ship.nrag.tgz --index ./served # unpack on the target machine
nrag query "how do refunds work?" --index ./served # $0, ~1 ms, no model, no network
Or run the hosted compilation service. Clients POST docs and get back a serving bundle; no embedding model ever crosses the wire:
nrag serve --base-url http://localhost:11434/v1/ --model llama3.2 # POST /compile, GET /bundle/<job>
6 · Answers, citations, streaming
res = rag.query("How do refunds work?", k=8)
print(res.answer) # grounded answer (None if no LLM)
for c in res.citations:
print(c.marker, c.source, f"{c.score:.3f}")
for tok in rag.query_stream("How do refunds work?"):
print(tok, end="")
7 · Drop into LangChain / LlamaIndex
from nrag.integrations import to_langchain_retriever, to_llamaindex_retriever
lc = to_langchain_retriever(rag, k=5) # a LangChain BaseRetriever
li = to_llamaindex_retriever(rag, k=5) # a LlamaIndex BaseRetriever
8 · Command line
nrag compile ./docs --index ./idx --base-url http://localhost:11434/v1/ --model llama3.2
nrag query "how do refunds work?" --index ./idx
nrag stats --index ./idx
nrag tco --queries-per-month 5000000 --months 36 # cost model (below)
Evaluation
Nrag ships its own cost-tiered evaluation harness (nrag.eval). The rule: never compare a $0-per-query lexical system against one that pays a model per query without labelling the tier. Credibility is the moat.
The metrics module (pure-Python, no deps)
nrag.eval.ir_metrics implements the standard IR metrics with zero dependencies. Metric strings: ndcg@k, recall@k, precision@k, hit@k, mrr, map.
from nrag.eval import evaluate_run
qrels = {"q1": {"docA": 1, "docC": 1}} # ground-truth relevance
run = {"q1": {"docA": 9.1, "docB": 4.2, "docC": 2.0}} # your system's doc -> score
print(evaluate_run(qrels, run, metrics=("ndcg@10", "recall@10", "mrr")))
# {'ndcg@10': 0.92, 'recall@10': 1.0, 'mrr': 1.0}
Evaluate Nrag on your own labelled queries:
rag = Nrag(preset="fast"); rag.add("corpus/")
run = {}
for qid, text in my_queries.items():
scores = {}
for h in rag.search(text, k=100):
did = h.chunk_id.split("::", 1)[0] # chunk -> parent doc
scores[did] = max(scores.get(did, -1e9), h.score) # max-pool chunks
run[qid] = scores
print(evaluate_run(my_qrels, run, ("ndcg@10", "recall@100", "mrr")))
BEIR & BRIGHT runners
Install the extra (pip install "nrag[eval]"), then build a fresh index via a factory and score it:
from nrag.eval import run_beir, run_bright, run_bright_all
# BEIR: breadth / parity, scored against the published BM25 anchor
print(run_beir(lambda: Nrag(preset="fast"), dataset="scifact", split="test"))
# BRIGHT: the reasoning-intensive hero benchmark (needs an LLM for the compiled preset)
print(run_bright(lambda: Nrag(llm=llm, preset="compiled"), subset="biology"))
results = run_bright_all(lambda: Nrag(llm=llm, preset="compiled")) # all 12 subsets
What the harness taught us (findings, not vibes):
- Query-side expansion is a trap on precise retrieval: LLM query2doc (−0.015 nDCG) and RM3 (−0.14 to −0.19) both crater precision. Enrich the corpus, never the query, which is exactly why the router only expands weak/ambiguous queries.
- Anticipatory indexing (doc2query) is the cost-fair lever: the LLM writes, at index time, the claims each doc answers; queries stay pure BM25. Best no-embedding retrieval-only result, free at query time.
Live compiled-retrieval benchmark
python benchmarks/csc_eval.py baseline # pure-lexical, free
OPENROUTER_API_KEY=... python benchmarks/csc_eval.py smoke # compile a few docs; print bundle + weights
OPENROUTER_API_KEY=... python benchmarks/csc_eval.py compiled --index ./idx_csc --k 3
Reproducing
pip install "nrag[eval]"
export NRAG_LLM_BASE_URL=... NRAG_LLM_MODEL=... NRAG_LLM_API_KEY=... # any OpenAI-compatible endpoint
python -m pytest # 87 passing, 3 opt-in skipped
Cost
Evaluation isn't only quality; it's the bill. Nrag pays the smart compute once, at compile time; a dense + vector-DB stack pays it on every query, forever, plus RAM to hold vectors resident.
nrag tco --docs 1000000 --queries-per-month 5000000 --months 36
from nrag.tco import TCOInputs, compute_tco, format_report
print(format_report(TCOInputs(), compute_tco(TCOInputs())))
Every rate is an overridable input (defaults cite the strategy brief). Plug in your own numbers, get your own break-even.
How it works
flowchart LR
D[docs] -->|offline compile · cached| C{{Compiler · CSC}}
C --> A[Leg A<br/>BM25 + trigrams<br/>over enriched text]
C --> B[Leg B<br/>consensus sparse<br/>term weights]
Q([query]) --> R{Adaptive router}
R -->|confident · ~1 ms · $0| F[RRF / convex fusion]
R -.->|weak · 1 LLM call| E[expand + re-search]
E --> F
A --> F
B --> F
F --> H[ranked hits<br/>+ citations]
- Structure-aware chunking with span-exact offsets;
indexed_text(enriched, searched) is kept separate fromraw_text(clean, cited). - Two sparse legs, fused by RRF (zero-tuning) or convex combination: hybrid's complementarity with no dense leg.
- Offline compiler with a content-hash cache (re-indexing is free) and a cost guard.
- Adaptive router spends an LLM call only when the cheap path is unsure.
Engines
Swap the lexical backend without touching anything else:
| Engine | Install | Notes |
|---|---|---|
tantivy (default) |
core | fast, persistent, multi-field scoring |
sqlite |
core | FTS5, zero extra deps, portable single file |
bm25s |
nrag[bm25s] |
in-memory, pure-NumPy, fast batch |
rag = Nrag(preset="fast", engine="sqlite", path="./idx")
Install extras
pip install nrag # core: tantivy + stemmer + http client. No models, ever.
pip install "nrag[openai]" # openai SDK + tiktoken (exact token counts)
pip install "nrag[bm25s]" # in-memory bm25s engine
pip install "nrag[pdf,html]" # PDF text + fast HTML loaders
pip install "nrag[eval]" # ranx / pytrec_eval / BEIR / RAGAS / datasets
Design guarantees
- No LLM required: pure-lexical retrieval always works; LLM features disable by construction when no LLM is supplied.
- All LLM cost is offline (index-time, cached) or a single gated query-time call (the router).
- Portable & explainable: deterministic scores; the serving index is a plain directory you can archive and ship.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nrag-0.1.1.tar.gz.
File metadata
- Download URL: nrag-0.1.1.tar.gz
- Upload date:
- Size: 65.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91e606ed50aba218565b6729eb7be3f4ae3a1cd2e01b7ccd2b4e44111d6ab016
|
|
| MD5 |
ab91f6567ac965f05899844721bd7766
|
|
| BLAKE2b-256 |
547b8f150d256081275c62cb915ee844cb70bf008c657b29ebb02980f197045f
|
Provenance
The following attestation bundles were made for nrag-0.1.1.tar.gz:
Publisher:
python-publish.yml on NineNatthanarong/NRAG
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nrag-0.1.1.tar.gz -
Subject digest:
91e606ed50aba218565b6729eb7be3f4ae3a1cd2e01b7ccd2b4e44111d6ab016 - Sigstore transparency entry: 2033004412
- Sigstore integration time:
-
Permalink:
NineNatthanarong/NRAG@0496eef643886e74fa70744638ec326f72dff0e2 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/NineNatthanarong
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@0496eef643886e74fa70744638ec326f72dff0e2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file nrag-0.1.1-py3-none-any.whl.
File metadata
- Download URL: nrag-0.1.1-py3-none-any.whl
- Upload date:
- Size: 87.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
701c66b11d1979aa47d797f68fb0ba3b9bacd75aa0e5710194145184effff5b2
|
|
| MD5 |
e7b46764b1753f442b772adca0835fd6
|
|
| BLAKE2b-256 |
2e5eb5c4c2e7ffeb235022d74e99ea945f49434a8325f02f818457b206f95ee4
|
Provenance
The following attestation bundles were made for nrag-0.1.1-py3-none-any.whl:
Publisher:
python-publish.yml on NineNatthanarong/NRAG
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nrag-0.1.1-py3-none-any.whl -
Subject digest:
701c66b11d1979aa47d797f68fb0ba3b9bacd75aa0e5710194145184effff5b2 - Sigstore transparency entry: 2033004589
- Sigstore integration time:
-
Permalink:
NineNatthanarong/NRAG@0496eef643886e74fa70744638ec326f72dff0e2 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/NineNatthanarong
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@0496eef643886e74fa70744638ec326f72dff0e2 -
Trigger Event:
release
-
Statement type: