Open-source GraphRAG with self-validating trees, agentic retrieval, and pixel-level citations.
Project description
NanoIndex
Open-source agentic harness for long documents. Self-validating trees. Entity graphs. Karpathy-inspired LLM wikis. Cited answers down to the pixel.
| Benchmark | Accuracy |
|---|---|
| FinanceBench (84 SEC filings, avg 143 pages) | 94.5% |
| DocBench Legal (51 court filings, avg 54 pages) | 96.0% |
If NanoIndex is useful, a ⭐ helps others find it.
The problem
Most RAG systems chop documents into chunks and turn them into embeddings. Two things break.
Structure is lost. A 200-page filing has a table of contents, numbered sections, tables with rows and columns. Chunking throws all of that away. Section 3.2 is no longer inside Section 3. A balance sheet table gets split across two chunks. The hierarchy the author wrote is gone.
Multi-hop questions fail. Many real questions need data from multiple sections. Computing a ratio requires the income statement and the balance sheet. Checking a legal clause means reading the clause, its definitions, and its exceptions. A chunk retriever finds one section, not the three you need, because the question doesn't match all of them equally in embedding space.
The result: wrong answers with citations that say "chunk_47" instead of a page and location an auditor can verify.
Who is this for?
- Developers building RAG over long, structured documents (10-Ks, contracts, medical records)
- Teams where citation accuracy is a compliance or audit requirement
- Anyone hitting the limits of chunk-and-embed on multi-section documents
Not the right fit if: you're querying short documents (<10 pages) or need sub-second latency.
Part 1: Querying within a single long document
NanoIndex preserves document structure instead of destroying it. Nanonets OCR-3 extracts the table of contents, section hierarchy, and heading structure. NanoIndex builds a tree from these.
| Document type | Examples | How NanoIndex navigates |
|---|---|---|
| Structured | 10-K filings, contracts, research papers | Uses the table of contents. Agent reads the outline, goes straight to the right section. |
| Semi-structured | Earnings releases, quarterly reports | Disambiguates repetitive headings ("Reconciliation" x8 becomes "Reconciliation: Q2 2023 Segment Data"). |
| Unstructured | Transcripts, scans, flat reports | Splits by page, extracts entities (people, companies, dates, amounts). The entity graph becomes the map. |
When you ask a question, an LLM agent navigates this tree across multiple rounds. It reads page images directly. It verifies its calculations. It cites every answer with the exact page and pixel coordinates.
Quick start
pip install nanoindex
export NANONETS_API_KEY=your_key # free at docstrange.nanonets.com (10K pages)
export ANTHROPIC_API_KEY=your_key # or OPENAI_API_KEY, GOOGLE_API_KEY
from nanoindex import NanoIndex
# Pick your LLM
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6")
# ni = NanoIndex(llm="openai:gpt-5.4")
# ni = NanoIndex(llm="gemini:gemini-2.5-flash")
# ni = NanoIndex(llm="ollama:llama3") # fully local
# Index a document
tree = ni.index("10k_filing.pdf")
answer = ni.ask("What was the free cash flow?", tree)
print(answer.content) # computed answer with reasoning
print(answer.citations[0].pages) # [52]
print(answer.citations[0].bounding_boxes) # exact coordinates on the page
Build entity graph (optional)
By default, index() builds only the tree. To also extract entities and relationships:
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6", build_graph=True)
tree = ni.index("10k_filing.pdf") # tree + entity graph
graph = ni.get_graph(tree) # 921 entities, 103 relationships
The entity graph enables fast_vision and agentic_graph_vision modes. Without it, agentic_vision (the default) works fine using tree navigation alone.
Save and reload trees
Index once, query many times. Trees and graphs are JSON files you can save and load:
from nanoindex.utils.tree_ops import save_tree, load_tree, load_graph
# Save after indexing
save_tree(tree, "3M_2018_10K.json")
# Load later — no re-indexing needed
tree = load_tree("3M_2018_10K.json")
graph = load_graph("3M_2018_10K_graph.json")
answer = ni.ask("What was the operating margin?", tree)
Query modes
| Mode | LLM calls | Best for |
|---|---|---|
agentic_vision (default) |
5-8 | Highest accuracy. Agent navigates tree, reads page images. |
agentic_graph_vision |
4-6 | Entity graph seeds the search, agent reasons from there. |
fast_vision |
2 | Simple fact lookups. Cheapest. |
Part 2: Querying across multiple documents (Karpathy-inspired wiki)
The harder problem is synthesis across documents: "How has 3M's revenue changed over 5 years?" or "Which company in my portfolio has the highest ROA?"
Inspired by Karpathy's LLM wiki pattern, NanoIndex compiles documents into a persistent, interlinked wiki that gets richer with every source you add and every question you ask.
from nanoindex.kb import KnowledgeBase
kb = KnowledgeBase("./sec-filings")
kb.add("3M_2018_10K.pdf") # extracts entities, builds concept pages
kb.add("3M_2019_10K.pdf") # updates existing concepts, flags changes
kb.add("3M_2020_10K.pdf") # cross-references across all three years
answer = kb.ask("How has 3M's revenue changed from 2018 to 2020?")
kb.lint() # find contradictions, stale claims, orphan pages
Add pre-built trees and graphs directly:
from nanoindex.utils.tree_ops import load_tree, load_graph
tree = load_tree("3M_2018_10K.json")
graph = load_graph("3M_2018_10K_graph.json")
kb.add_tree(tree, graph)
The wiki is a directory of markdown files. Open it in Obsidian and browse concept pages with [[backlinks]], entity graphs, and an activity log.
Three layers:
- Raw sources — your PDFs, immutable, never modified
- The wiki — markdown pages with cross-references. The LLM writes and maintains all of it.
- The schema — how the wiki is structured, what entity types to track, domain conventions
How it compares
| Chunk + Embed | Microsoft GraphRAG | PageIndex | NanoIndex | |
|---|---|---|---|---|
| Indexing | Chunk text, embed | LLM per chunk | LLM per page | 1 OCR API call |
| Structure | Lost | Lost | Tree | Tree + entity graph |
| Navigation | Similarity search | Map-reduce | LLM tree walk | Multi-round agent |
| Multi-document | Vector DB | No | No | Wiki with [[backlinks]] |
| Citations | Chunk ID | None | Page number | Pixel coordinates |
| Vision | No | No | No | Page images to LLM |
| Cost per doc | Low | High | High | Low |
Roadmap
- Agentic extraction self-correcting structured extraction for tables and forms (invoice line items, insurance loss runs, bank statement reconciliation)
- Real-world long document benchmarks bank statement reconciliation, insurance loss run extraction, multi-document contract analysis
- Streaming tree building real-time tree construction as pages are parsed
- Multi-agent wiki multiple agents maintaining different sections of the wiki concurrently
CLI
nanoindex index report.pdf -o tree.json
nanoindex ask report.pdf "What was the revenue?"
nanoindex viz tree.json
Development
git clone https://github.com/nanonets/nanoindex.git && cd nanoindex
uv sync --extra dev && uv run pytest # or: pip install -e ".[dev]" && pytest
Entity extraction: pip install nanoindex[gliner] (CPU) or pip install nanoindex[gliner-gpu] (GPU).
Apache 2.0. Built on Nanonets OCR-3.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nanoindex-0.4.0.tar.gz.
File metadata
- Download URL: nanoindex-0.4.0.tar.gz
- Upload date:
- Size: 162.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7c597a45e5d2ef37391c6a3eb6902bcb49c352ca3293abfaa162f9505ba03b4
|
|
| MD5 |
d3b9a3570c3508e23cf15cb10658e62f
|
|
| BLAKE2b-256 |
6e07d602973dfd3d8d039b6a571e7f82117ec2dae22bcf741a2b668b62669deb
|
File details
Details for the file nanoindex-0.4.0-py3-none-any.whl.
File metadata
- Download URL: nanoindex-0.4.0-py3-none-any.whl
- Upload date:
- Size: 163.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67d28eb2945d781ef57aa0212d06a2e7d66dc204f7cfe231eb5a230b8516618d
|
|
| MD5 |
9c7cfbfeedece4adbba6f79ca4c21c07
|
|
| BLAKE2b-256 |
82aa0953fa9175441ef76cc26f35681c21b7b06255f06d4a21d8738652facfec
|