Turn any PDF into searchable trees and knowledge graphs. No vectors, no chunks.
Project description
NanoIndex
Turn any PDF into searchable trees and visual knowledge graphs. Ask questions, get answers with page citations.
| Benchmark | Documents | Avg Pages | Accuracy |
|---|---|---|---|
| FinanceBench (SEC 10-K filings) | 84 | 143 | 94.5% |
| DocBench Legal (court filings, legislation) | 51 | 54 | 96.0% |
No vector databases. No chunk tuning. No embeddings.
NanoIndex reads your document, understands its structure (headings, sections, tables, figures), and builds a tree you can search with plain English. Built on Nanonets OCR-3 for extraction. Fully open source.
Quick Start
1. Install
pip install nanoindex
2. Set your API keys
export NANONETS_API_KEY=your_key # Get free at docstrange.nanonets.com/app (10K pages free)
export OPENAI_API_KEY=your_key # Or ANTHROPIC_API_KEY, GOOGLE_API_KEY, GROQ_API_KEY
3. Go
from nanoindex import NanoIndex
ni = NanoIndex()
tree = ni.index("report.pdf")
answer = ni.ask("What was the revenue?", tree)
print(answer.content)
That's it. Keys auto-detected from env. LLM auto-selected from available keys.
What Can You Do With It
Ask questions, get cited answers
answer = ni.ask("What was Q3 gross margin?", tree)
print(answer.content) # "Gross margin was 42.3% in Q3..."
print(answer.citations) # [Citation(title="Income Statement", pages=[45, 46])]
From the command line
nanoindex index report.pdf -o tree.json
nanoindex ask report.pdf "What was the revenue?"
nanoindex viz tree.json
Pick your LLM
ni = NanoIndex(llm="openai:gpt-4o")
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6")
ni = NanoIndex(llm="gemini:gemini-2.5-flash")
ni = NanoIndex(llm="groq:llama-3.3-70b-versatile")
ni = NanoIndex(llm="ollama:llama3")
Or just set the env var and NanoIndex picks the right one:
export ANTHROPIC_API_KEY=... # NanoIndex uses Claude automatically
Save and reuse trees
from nanoindex.utils.tree_ops import save_tree, load_tree
save_tree(tree, "my_tree.json")
tree = load_tree("my_tree.json") # instant, no API call
Search across multiple documents
from nanoindex import DocumentStore
store = DocumentStore()
for pdf in ["q1.pdf", "q2.pdf", "q3.pdf"]:
store.add(ni.index(pdf))
answer = ni.multi_ask("Compare revenue across quarters", store)
Build a Knowledge Base
from nanoindex import KnowledgeBase
kb = KnowledgeBase("./my-research")
kb.add("report1.pdf")
kb.add("report2.pdf")
answer = kb.ask("How do these compare?") # answers filed back into wiki
Open the my-research/ folder in Obsidian to browse the compiled wiki with [[backlinks]].
How It Works
Indexing: PDF to tree + graph
Querying: Agentic Mode (default)
Querying: Fast Mode (graph-based, cheaper)
Query Modes
| Mode | How it works | Best for |
|---|---|---|
agentic_vision (default) |
LLM navigates full tree + reads page images | Highest accuracy |
agentic |
Same without images | Text-heavy docs |
fast |
Graph entity lookup, LLM sees ~20 nodes | Cheapest, fastest |
fast_vision |
Same + page images | Charts and figures |
# Default — agentic with vision
answer = ni.ask("What was revenue?", tree, pdf_path="report.pdf")
# Fast mode — graph-based, 3x cheaper
answer = ni.ask("What was revenue?", tree, mode="fast")
Entity Graph
Every indexed document gets a knowledge graph built automatically using spaCy NLP (free, local, no API calls). Entities (companies, people, dates, money) and relationships are extracted from the tree.
If a reasoning LLM is configured, graph quality is enhanced with LLM-extracted entities on top of spaCy.
The graph powers:
- Fast mode retrieval (entity keyword match + relationship expansion)
- Knowledge Base concept articles
- The Entities tab in the visualization dashboard
Bounding Boxes and Citations
Every answer includes citations with exact bounding box coordinates:
for citation in answer.citations:
print(f"Section: {citation.title}, Pages: {citation.pages}")
for bb in citation.bounding_boxes:
print(f" Page {bb.page}: ({bb.x:.2f}, {bb.y:.2f}) — {bb.text}")
The citation resolver matches answer text back to specific regions on the PDF page, so you can highlight exactly where the answer came from.
Open-Source Mode (No API Key for Parsing)
ni = NanoIndex(parser="pymupdf")
tree = ni.index("report.pdf") # no API key needed
PyMuPDF gives basic text and table extraction. The tree will be simpler (no heading detection), but works for quick experiments. For production, use Nanonets OCR-3.
Benchmarks
| Benchmark | Documents | Avg Pages | Accuracy |
|---|---|---|---|
| FinanceBench (SEC 10-K filings) | 84 | 143 | 94.5% |
| DocBench Legal (court filings, legislation) | 51 | 54 | 96.0% |
Evidence page retrieval: 93.3%
How It Compares
| Traditional RAG | NanoIndex | |
|---|---|---|
| Indexing | Chunk + embed + vector DB | Extract + build tree |
| Retrieval | Similarity search | LLM reasons over structure |
| Tables | Poorly handled | Natively extracted |
| Figures | Not supported | Vision mode |
| Scanned docs | Needs separate OCR | Built-in |
| Structure-aware | No | Yes |
| Citations | Approximate | Exact page + bounding box |
Development
git clone https://github.com/nanonets/nanoindex.git
cd nanoindex
pip install -e ".[dev]"
pytest
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nanoindex-0.1.25.tar.gz.
File metadata
- Download URL: nanoindex-0.1.25.tar.gz
- Upload date:
- Size: 121.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83a2289405a986ed0c58a72c5275a33babd3f565ea889ba24e9254092fe66a2e
|
|
| MD5 |
f52bf795333cea26e9bae1bf58d15202
|
|
| BLAKE2b-256 |
f3d0ffa2cdfee13d700a2716ee8c01bf9b00da6fd7c6537b562c743c5fac7307
|
File details
Details for the file nanoindex-0.1.25-py3-none-any.whl.
File metadata
- Download URL: nanoindex-0.1.25-py3-none-any.whl
- Upload date:
- Size: 124.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7e031df02f59214d35c7bb4662d6ac3328b3dd0d0185e6e2cbb89eaebb0ce1c
|
|
| MD5 |
71c2252d029eca4b5d1381afe8678a83
|
|
| BLAKE2b-256 |
00de0c1f196996f74c24ba476ae1e56d10d43ac4a98f77b0be8ae1a6665a6431
|