Skip to main content

Open-source GraphRAG with self-validating trees, agentic retrieval, and pixel-level citations.

Project description

NanoIndex

NanoIndex

Open-source agentic harness for long documents. Self-validating trees. Entity graphs. Karpathy-inspired LLM wikis. Cited answers down to the pixel.

Benchmark Accuracy
FinanceBench (84 SEC filings, avg 143 pages) 94.5%
DocBench Legal (51 court filings, avg 54 pages) 96.0%

If NanoIndex is useful, a star helps others find it.

NanoIndex


The problem

Most RAG systems chop documents into chunks and turn them into embeddings. Two things break.

Structure is lost. A 200-page filing has a table of contents, numbered sections, tables with rows and columns. Chunking throws all of that away. Section 3.2 is no longer inside Section 3. A balance sheet table gets split across two chunks. The hierarchy the author wrote is gone.

Multi-hop questions fail. Many real questions need data from multiple sections. Computing a ratio requires the income statement and the balance sheet. Checking a legal clause means reading the clause, its definitions, and its exceptions. A chunk retriever finds one section, not the three you need, because the question doesn't match all of them equally in embedding space.

The result: wrong answers with citations that say "chunk_47" instead of a page and location an auditor can verify.


Who is this for?

  • Developers building RAG over long, structured documents (10-Ks, contracts, medical records)
  • Teams where citation accuracy is a compliance or audit requirement
  • Anyone hitting the limits of chunk-and-embed on multi-section documents

Not the right fit if: you're querying short documents (<10 pages) or need sub-second latency.


Part 1: Querying within a single long document

NanoIndex preserves document structure instead of destroying it. Nanonets OCR-3 extracts the table of contents, section hierarchy, and heading structure. NanoIndex builds a tree from these.

NanoIndex Pipeline

Document type Examples How NanoIndex navigates
Structured 10-K filings, contracts, research papers Uses the table of contents. Agent reads the outline, goes straight to the right section.
Semi-structured Earnings releases, quarterly reports Disambiguates repetitive headings ("Reconciliation" x8 becomes "Reconciliation: Q2 2023 Segment Data").
Unstructured Transcripts, scans, flat reports Splits by page, extracts entities (people, companies, dates, amounts). The entity graph becomes the map.

When you ask a question, an LLM agent navigates this tree across multiple rounds. It reads page images directly. It verifies its calculations. It cites every answer with the exact page and pixel coordinates.

Quick start

pip install nanoindex
export NANONETS_API_KEY=your_key    # free at docstrange.nanonets.com (10K pages)
export ANTHROPIC_API_KEY=your_key   # or OPENAI_API_KEY, GOOGLE_API_KEY
from nanoindex import NanoIndex

# Pick your LLM
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6")
# ni = NanoIndex(llm="openai:gpt-5.4")
# ni = NanoIndex(llm="gemini:gemini-2.5-flash")
# ni = NanoIndex(llm="ollama:llama3")  # fully local

# Index a document
tree = ni.index("10k_filing.pdf")
answer = ni.ask("What was the free cash flow?", tree)

print(answer.content)                     # computed answer with reasoning
print(answer.citations[0].pages)          # [52]
print(answer.citations[0].bounding_boxes) # exact coordinates on the page

Build entity graph (optional)

By default, index() builds only the tree. To also extract entities and relationships:

ni = NanoIndex(llm="anthropic:claude-sonnet-4-6")
ni.config.build_graph = True
tree = ni.index("10k_filing.pdf")  # tree + entity graph
graph = ni.get_graph(tree)         # 921 entities, 103 relationships

The entity graph enables fast_vision and agentic_graph_vision modes. Without it, agentic_vision (the default) works fine using tree navigation alone.

Save and reload trees

Index once, query many times. Trees and graphs are JSON files you can save and load:

from nanoindex.utils.tree_ops import save_tree, load_tree, load_graph

# Save after indexing
save_tree(tree, "3M_2018_10K.json")

# Load later — no re-indexing needed
tree = load_tree("3M_2018_10K.json")
graph = load_graph("3M_2018_10K_graph.json")
answer = ni.ask("What was the operating margin?", tree)

Query modes

Mode LLM calls Best for
agentic_vision (default) 5-8 Highest accuracy. Agent navigates tree, reads page images.
agentic_graph_vision 4-6 Entity graph seeds the search, agent reasons from there.
fast_vision 2 Simple fact lookups. Cheapest.

Part 2: Querying across multiple documents (Karpathy-inspired wiki)

The harder problem is synthesis across documents: "How has 3M's revenue changed over 5 years?" or "Which company in my portfolio has the highest ROA?"

Inspired by Karpathy's LLM wiki pattern, NanoIndex compiles documents into a persistent, interlinked wiki that gets richer with every source you add and every question you ask.

from nanoindex.kb import KnowledgeBase

kb = KnowledgeBase("./sec-filings")
kb.add("3M_2018_10K.pdf")     # extracts entities, builds concept pages
kb.add("3M_2019_10K.pdf")     # updates existing concepts, flags changes
kb.add("3M_2020_10K.pdf")     # cross-references across all three years

answer = kb.ask("How has 3M's revenue changed from 2018 to 2020?")
kb.lint()  # find contradictions, stale claims, orphan pages

Add pre-built trees and graphs directly:

from nanoindex.utils.tree_ops import load_tree, load_graph

tree = load_tree("3M_2018_10K.json")
graph = load_graph("3M_2018_10K_graph.json")
kb.add_tree(tree, graph)

The wiki is a directory of markdown files. Open it in Obsidian and browse concept pages with [[backlinks]], entity graphs, and an activity log.

Three layers:

  • Raw sources — your PDFs, immutable, never modified
  • The wiki — markdown pages with cross-references. The LLM writes and maintains all of it.
  • The schema — how the wiki is structured, what entity types to track, domain conventions

How it compares

Chunk + Embed Microsoft GraphRAG PageIndex NanoIndex
Indexing Chunk text, embed LLM per chunk LLM per page 1 OCR API call
Structure Lost Lost Tree Tree + entity graph
Navigation Similarity search Map-reduce LLM tree walk Multi-round agent
Multi-document Vector DB No No Wiki with [[backlinks]]
Citations Chunk ID None Page number Pixel coordinates
Vision No No No Page images to LLM
Cost per doc Low High High Low

Roadmap

  • Agentic extraction self-correcting structured extraction for tables and forms (invoice line items, insurance loss runs, bank statement reconciliation)
  • Real-world long document benchmarks bank statement reconciliation, insurance loss run extraction, multi-document contract analysis
  • Streaming tree building real-time tree construction as pages are parsed
  • Multi-agent wiki multiple agents maintaining different sections of the wiki concurrently

CLI

nanoindex index report.pdf -o tree.json
nanoindex ask report.pdf "What was the revenue?"
nanoindex viz tree.json

Development

git clone https://github.com/nanonets/nanoindex.git && cd nanoindex
uv sync --extra dev && uv run pytest    # or: pip install -e ".[dev]" && pytest

Entity extraction: pip install nanoindex[gliner] (CPU) or pip install nanoindex[gliner-gpu] (GPU).


Apache 2.0. Built on Nanonets OCR-3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoindex-0.3.2.tar.gz (157.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nanoindex-0.3.2-py3-none-any.whl (158.8 kB view details)

Uploaded Python 3

File details

Details for the file nanoindex-0.3.2.tar.gz.

File metadata

  • Download URL: nanoindex-0.3.2.tar.gz
  • Upload date:
  • Size: 157.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for nanoindex-0.3.2.tar.gz
Algorithm Hash digest
SHA256 f934a2323558fb9d997b2325ec5b5b028395df96242d80f600728c297c1070a7
MD5 afc4783dd41aaa1cbc34403de452b3b7
BLAKE2b-256 bcc93529f97d152ea5fcfe93bd5a66d5b5f99b01f6eb9e48c67e2d862250a7fb

See more details on using hashes here.

File details

Details for the file nanoindex-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: nanoindex-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 158.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for nanoindex-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9c1715c4e63ab7ca29d3c8b3da61e9788eb2f3b0c8caf4863bb8879fca1f57aa
MD5 e4f9ee6d4ad7130ed2730c66105fc8bd
BLAKE2b-256 d50ec9a215c7443780d4e3d3eb31da95b548882e56b1ad53c4184e43f950ee11

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page