Skip to main content

Turn any PDF into searchable trees and knowledge graphs. No vectors, no chunks.

Project description

NanoIndex

Turn any PDF into searchable trees and visual knowledge graphs. Ask questions, get answers with page citations.

Benchmark Documents Avg Pages Accuracy
FinanceBench (SEC 10-K filings) 84 143 94.5%
DocBench Legal (court filings, legislation) 51 54 96.0%

No vector databases. No chunk tuning. No embeddings.

NanoIndex reads your document, understands its structure (headings, sections, tables, figures), and builds a tree you can search with plain English. Built on Nanonets OCR-3 for extraction. Fully open source.


Quick Start

1. Install

pip install nanoindex

2. Set your API keys

export NANONETS_API_KEY=your_key    # Get free at docstrange.nanonets.com/app (10K pages free)
export OPENAI_API_KEY=your_key      # Or ANTHROPIC_API_KEY, GOOGLE_API_KEY, GROQ_API_KEY

3. Go

from nanoindex import NanoIndex

ni = NanoIndex()
tree = ni.index("report.pdf")
answer = ni.ask("What was the revenue?", tree)
print(answer.content)

That's it. Keys auto-detected from env. LLM auto-selected from available keys.


What Can You Do With It

Ask questions, get cited answers

answer = ni.ask("What was Q3 gross margin?", tree)
print(answer.content)     # "Gross margin was 42.3% in Q3..."
print(answer.citations)   # [Citation(title="Income Statement", pages=[45, 46])]

From the command line

nanoindex index report.pdf -o tree.json
nanoindex ask report.pdf "What was the revenue?"
nanoindex viz tree.json

Pick your LLM

ni = NanoIndex(llm="openai:gpt-4o")
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6")
ni = NanoIndex(llm="gemini:gemini-2.5-flash")
ni = NanoIndex(llm="groq:llama-3.3-70b-versatile")
ni = NanoIndex(llm="ollama:llama3")

Or just set the env var and NanoIndex picks the right one:

export ANTHROPIC_API_KEY=...   # NanoIndex uses Claude automatically

Save and reuse trees

from nanoindex.utils.tree_ops import save_tree, load_tree

save_tree(tree, "my_tree.json")
tree = load_tree("my_tree.json")  # instant, no API call

Search across multiple documents

from nanoindex import DocumentStore

store = DocumentStore()
for pdf in ["q1.pdf", "q2.pdf", "q3.pdf"]:
    store.add(ni.index(pdf))

answer = ni.multi_ask("Compare revenue across quarters", store)

Build a Knowledge Base

from nanoindex import KnowledgeBase

kb = KnowledgeBase("./my-research")
kb.add("report1.pdf")
kb.add("report2.pdf")

answer = kb.ask("How do these compare?")  # answers filed back into wiki

Open the my-research/ folder in Obsidian to browse the compiled wiki with [[backlinks]].


How It Works

Indexing: PDF to tree + graph

Ingestion Pipeline

Querying: Agentic Mode (default)

Query Pipeline - Agentic Mode

Querying: Fast Mode (graph-based, cheaper)

Query Pipeline - Fast Mode


Query Modes

Mode How it works Best for
agentic_vision (default) LLM navigates full tree + reads page images Highest accuracy
agentic Same without images Text-heavy docs
fast Graph entity lookup, LLM sees ~20 nodes Cheapest, fastest
fast_vision Same + page images Charts and figures
# Default — agentic with vision
answer = ni.ask("What was revenue?", tree, pdf_path="report.pdf")

# Fast mode — graph-based, 3x cheaper
answer = ni.ask("What was revenue?", tree, mode="fast")

Entity Graph

Every indexed document gets a knowledge graph built automatically using spaCy NLP (free, local, no API calls). Entities (companies, people, dates, money) and relationships are extracted from the tree.

If a reasoning LLM is configured, graph quality is enhanced with LLM-extracted entities on top of spaCy.

The graph powers:

  • Fast mode retrieval (entity keyword match + relationship expansion)
  • Knowledge Base concept articles
  • The Entities tab in the visualization dashboard

Bounding Boxes and Citations

Every answer includes citations with exact bounding box coordinates:

for citation in answer.citations:
    print(f"Section: {citation.title}, Pages: {citation.pages}")
    for bb in citation.bounding_boxes:
        print(f"  Page {bb.page}: ({bb.x:.2f}, {bb.y:.2f}) — {bb.text}")

The citation resolver matches answer text back to specific regions on the PDF page, so you can highlight exactly where the answer came from.


Open-Source Mode (No API Key for Parsing)

ni = NanoIndex(parser="pymupdf")
tree = ni.index("report.pdf")  # no API key needed

PyMuPDF gives basic text and table extraction. The tree will be simpler (no heading detection), but works for quick experiments. For production, use Nanonets OCR-3.


Benchmarks

Benchmark Documents Avg Pages Accuracy
FinanceBench (SEC 10-K filings) 84 143 94.5%
DocBench Legal (court filings, legislation) 51 54 96.0%

Evidence page retrieval: 93.3%

FinanceBench Architecture


How It Compares

Traditional RAG NanoIndex
Indexing Chunk + embed + vector DB Extract + build tree
Retrieval Similarity search LLM reasons over structure
Tables Poorly handled Natively extracted
Figures Not supported Vision mode
Scanned docs Needs separate OCR Built-in
Structure-aware No Yes
Citations Approximate Exact page + bounding box

Development

git clone https://github.com/nanonets/nanoindex.git
cd nanoindex
pip install -e ".[dev]"
pytest

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoindex-0.1.25.tar.gz (121.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nanoindex-0.1.25-py3-none-any.whl (124.0 kB view details)

Uploaded Python 3

File details

Details for the file nanoindex-0.1.25.tar.gz.

File metadata

  • Download URL: nanoindex-0.1.25.tar.gz
  • Upload date:
  • Size: 121.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for nanoindex-0.1.25.tar.gz
Algorithm Hash digest
SHA256 83a2289405a986ed0c58a72c5275a33babd3f565ea889ba24e9254092fe66a2e
MD5 f52bf795333cea26e9bae1bf58d15202
BLAKE2b-256 f3d0ffa2cdfee13d700a2716ee8c01bf9b00da6fd7c6537b562c743c5fac7307

See more details on using hashes here.

File details

Details for the file nanoindex-0.1.25-py3-none-any.whl.

File metadata

  • Download URL: nanoindex-0.1.25-py3-none-any.whl
  • Upload date:
  • Size: 124.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for nanoindex-0.1.25-py3-none-any.whl
Algorithm Hash digest
SHA256 b7e031df02f59214d35c7bb4662d6ac3328b3dd0d0185e6e2cbb89eaebb0ce1c
MD5 71c2252d029eca4b5d1381afe8678a83
BLAKE2b-256 00de0c1f196996f74c24ba476ae1e56d10d43ac4a98f77b0be8ae1a6665a6431

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page