Skip to main content

Recursive Neural-Symbolic Retriever - Hierarchical document retrieval with font-based structure analysis

Project description

RNSR - Recursive Neural-Symbolic Retriever

A state-of-the-art document retrieval system that preserves hierarchical structure for superior RAG performance. Combines PageIndex, Recursive Language Models (RLM), Knowledge Graphs, and Tree of Thoughts navigation.

Overview

RNSR combines neural and symbolic approaches to achieve accurate document understanding:

  • Font Histogram Algorithm - Automatically detects document hierarchy from font sizes (no training required)
  • Skeleton Index Pattern - Lightweight summaries with KV store for efficient retrieval
  • Tree-of-Thoughts Navigation - LLM reasons about document structure to find answers
  • RLM Unified Extraction - LLM writes extraction code, grounded in actual text
  • Knowledge Graph - Entity and relationship storage for cross-document linking
  • Self-Reflection Loop - Iterative answer improvement through self-critique
  • Adaptive Learning - System learns from your document workload over time

Key Features

Feature Description
Hierarchical Extraction Preserves document structure (sections, subsections, paragraphs)
RLM Unified Extractor LLM writes extraction code + ToT validation (grounded, no hallucination)
Provenance System Every answer traces back to exact document citations
LLM Response Cache Semantic-aware caching for 10x cost/speed improvement
Self-Reflection Iterative self-correction improves answer quality
Reasoning Memory Learns successful query patterns for faster future queries
Query Clarification Detects ambiguous queries and asks clarifying questions
Table/Chart Parsing SQL-like queries over tables, chart trend analysis
Adaptive Learning 6 registries that learn from usage and persist to disk
Multi-Document Detection Automatically splits bundled PDFs
Vision Mode OCR-free analysis for scanned documents and charts

Installation

# Clone the repository
git clone https://github.com/theeufj/RNSR.git
cd RNSR

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install with all LLM providers
pip install -e ".[all]"

# Or install with specific provider
pip install -e ".[openai]"      # OpenAI only
pip install -e ".[anthropic]"   # Anthropic only
pip install -e ".[gemini]"      # Google Gemini only

Quick Start

1. Set up API keys

Create a .env file:

cp .env.example .env
# Edit .env with your API keys
# Choose your preferred LLM provider
OPENAI_API_KEY=sk-...
# or
ANTHROPIC_API_KEY=sk-ant-...
# or
GOOGLE_API_KEY=AI...

# Optional: Override default models
LLM_PROVIDER=anthropic
SUMMARY_MODEL=claude-sonnet-4-5

2. Use the Python API

from rnsr import RNSRClient

# Simple one-line Q&A
client = RNSRClient()
answer = client.ask("contract.pdf", "What are the payment terms?")
print(answer)

# Advanced navigation with verification and self-reflection
result = client.ask_advanced(
    "complex_report.pdf",
    "Compare liability clauses in sections 5 and 8",
    enable_verification=True,
    enable_self_reflection=True,
    max_recursion_depth=3,
)

3. Run the Demo UI

python demo.py
# Open http://localhost:7860 in your browser

New Features

Provenance System

Every answer includes traceable citations:

from rnsr.agent import ProvenanceTracker, format_citations_for_display

tracker = ProvenanceTracker(kv_store=kv_store, skeleton=skeleton)
record = tracker.create_provenance_record(
    answer="The payment terms are net 30.",
    question="What are the payment terms?",
    variables=navigation_variables,
)

print(f"Confidence: {record.aggregate_confidence:.0%}")
print(format_citations_for_display(record.citations))
# Output:
# **Sources:**
# 1. [contract.pdf] Section: Payment Terms, Page 5: "Payment shall be due within 30 days..."

LLM Response Caching

Automatic caching reduces costs and latency:

from rnsr.agent import wrap_llm_with_cache, get_global_cache

# Wrap any LLM function with caching
cached_llm = wrap_llm_with_cache(llm.complete, ttl_seconds=3600)

# Use cached LLM - repeated prompts hit cache
response = cached_llm("What is 2+2?")  # Calls LLM
response = cached_llm("What is 2+2?")  # Returns cached (instant)

# Check cache stats
print(get_global_cache().get_stats())
# {'entries': 150, 'hits': 89, 'hit_rate': 0.59}

Self-Reflection Loop

Answers are automatically critiqued and improved:

from rnsr.agent import SelfReflectionEngine, reflect_on_answer

# Quick one-liner
result = reflect_on_answer(
    answer="The contract expires in 2024.",
    question="When does the contract expire?",
    evidence="Contract dated 2023, 2-year term...",
)

print(f"Improved: {result.improved}")
print(f"Final answer: {result.final_answer}")
print(f"Iterations: {result.total_iterations}")

Reasoning Chain Memory

The system learns from successful queries:

from rnsr.agent import get_reasoning_memory, find_similar_chains

# Find similar past queries
matches = find_similar_chains("What is the liability cap?")
for match in matches:
    print(f"Similar query: {match.chain.query}")
    print(f"Similarity: {match.similarity:.0%}")
    print(f"Past answer: {match.chain.answer}")

Table Parsing

Extract and query tables from documents:

from rnsr.ingestion import TableParser, TableQueryEngine

parser = TableParser()
tables = parser.parse_from_text(document_text)

# SQL-like queries
engine = TableQueryEngine(tables[0])
results = engine.select(
    columns=["Name", "Amount"],
    where={"Status": "Active"},
    order_by="Amount",
)

# Aggregations
total = engine.aggregate("Amount", "sum")

Query Clarification

Handle ambiguous queries gracefully:

from rnsr.agent import QueryClarifier, needs_clarification

# Check if query needs clarification
is_ambiguous, analysis = needs_clarification(
    "What does it say about the clause?"
)

if is_ambiguous:
    print(f"Ambiguity: {analysis.ambiguity_type}")
    print(f"Clarifying question: {analysis.suggested_clarification}")
    # "What does 'it' refer to in your question?"

Adaptive Learning

RNSR learns from your document workload. All learned data persists in ~/.rnsr/:

~/.rnsr/
├── learned_entity_types.json       # New entity types discovered
├── learned_relationship_types.json # New relationship types
├── learned_normalization.json      # Title/suffix patterns
├── learned_stop_words.json         # Domain-specific stop words
├── learned_header_thresholds.json  # Document-type font thresholds
├── learned_query_patterns.json     # Successful query patterns
├── reasoning_chains.json           # Successful reasoning chains
└── llm_cache.db                    # LLM response cache

The more you use RNSR, the better it gets at understanding your domain.

How It Works

Document Ingestion Pipeline

PDF → Font Analysis → Header Classification → Tree Building → Skeleton Index
         ↓                    ↓                    ↓              ↓
   Detect font sizes   Classify H1/H2/H3    Build hierarchy   Create summaries
                                                  ↓
                                        Multi-doc detection
                                        (page number resets)

Query Processing

Question → Clarify → Pre-Filter → Tree Navigation → Answer → Self-Reflect → Verify
              ↓           ↓              ↓             ↓           ↓           ↓
        Ask if ambig  Keyword scan  ToT reasoning  Synthesize  Critique   Fact-check
                                         ↓                        ↓
                                  Sub-LLM recursion        Improve answer
                                  (complex queries)        (if issues)

Entity Extraction (RLM Unified)

Document → LLM writes code → Execute on DOC_VAR → ToT validation → Cross-validate
              ↓                     ↓                   ↓               ↓
     Generates regex/Python   Grounded results   Probability scores  Entity↔Relationship
                                    ↓
                            All tied to exact text spans

RLM Navigation Architecture (ToT + REPL Integration)

RNSR uses a unique combination of Tree of Thoughts (ToT) reasoning and a REPL (Read-Eval-Print Loop) environment for document navigation. This is what sets RNSR apart from naive RAG approaches:

The Problem with Naive RAG: Traditional RAG splits documents into chunks, embeds them, and retrieves based on similarity. This loses hierarchical structure and often retrieves irrelevant chunks for complex queries.

RNSR's RLM Navigation Solution:

Query → NavigationREPL → LLM Generates Code → Execute → Findings → ToT Validation → Answer
           ↓                    ↓                ↓          ↓            ↓
    Expose document       search_tree()      Find relevant  Store     Verify with
    as environment        navigate_to()      nodes          findings  probabilities
                          get_content()

How it works:

  1. Document as Environment: The document tree is exposed as a programmable environment through NavigationREPL. The LLM can write Python code to search, navigate, and extract information.

  2. Code Generation Navigation: Instead of keyword matching, the LLM writes code like:

    # LLM-generated code to find CEO salary
    results = search_tree(r"CEO|chief executive|compensation|salary")
    for match in results[:3]:
        navigate_to(match.node_id)
        content = get_node_content(match.node_id)
        if "salary" in content.lower():
            store_finding("ceo_salary", content, match.node_id)
    ready_to_synthesize()
    
  3. Iterative Search: The LLM can execute multiple rounds of code, drilling deeper into promising sections, just like a human would browse a document.

  4. ToT Validation: Findings are validated using Tree of Thoughts - each potential answer gets a probability score based on how well it matches the query and document evidence.

  5. Grounded Answers: All answers are tied to specific document sections. If the LLM can't find reliable information, it honestly reports "Unable to find reliable information" rather than hallucinating.

Available NavigationREPL Functions:

Function Description
search_content(pattern) Regex search within current node
search_children(pattern) Search direct children
search_tree(pattern) Search entire subtree with relevance scoring
navigate_to(node_id) Move to a specific section
go_back() Return to previous section
go_to_root() Return to document root
get_node_content(node_id) Get full text of a section
store_finding(key, content, node_id) Save relevant information
ready_to_synthesize() Signal that enough info has been gathered

Why This Outperforms Naive RAG:

  • Hierarchical Understanding: RNSR understands that "Section 42" might contain the CEO salary even if the query doesn't mention "Section 42"
  • Multi-hop Reasoning: Can navigate from a table of contents to a specific subsection to find buried information
  • Document Length Agnostic: Works equally well on 10-page and 1000-page documents - the LLM navigates to relevant sections rather than trying to fit everything in context
  • No Hallucination: If information isn't found through code execution, the system admits it rather than making up answers

Architecture

rnsr/
├── agent/                   # Query processing
│   ├── rlm_navigator.py     # Main navigation agent (RLM + ToT)
│   ├── nav_repl.py          # NavigationREPL for code-based navigation (NEW)
│   ├── repl_env.py          # Base REPL environment
│   ├── provenance.py        # Citation tracking
│   ├── llm_cache.py         # Response caching
│   ├── self_reflection.py   # Answer improvement
│   ├── reasoning_memory.py  # Chain memory
│   ├── query_clarifier.py   # Ambiguity handling
│   ├── graph.py             # LangGraph workflow
│   └── variable_store.py    # Context management
├── extraction/              # Entity/relationship extraction
│   ├── rlm_unified_extractor.py  # Best extractor (NEW)
│   ├── learned_types.py     # Adaptive type learning
│   ├── entity_linker.py     # Cross-document linking
│   └── models.py            # Entity/Relationship models
├── indexing/                # Index construction
│   ├── skeleton_index.py    # Summary generation
│   ├── knowledge_graph.py   # Entity/relationship storage
│   ├── kv_store.py          # SQLite/in-memory storage
│   └── semantic_search.py   # Optional vector search
├── ingestion/               # Document processing
│   ├── pipeline.py          # Main ingestion orchestrator
│   ├── font_histogram.py    # Font-based structure detection
│   ├── header_classifier.py # H1/H2/H3 classification
│   ├── table_parser.py      # Table extraction (NEW)
│   ├── chart_parser.py      # Chart interpretation (NEW)
│   └── tree_builder.py      # Hierarchical tree construction
├── llm.py                   # Multi-provider LLM abstraction
├── client.py                # High-level API
└── models.py                # Data structures

API Reference

High-Level API

from rnsr import RNSRClient

client = RNSRClient(
    llm_provider="anthropic",  # or "openai", "gemini"
    llm_model="claude-sonnet-4-5"
)

# Simple query
answer = client.ask("document.pdf", "What is the main topic?")

# Vision mode (for scanned docs)
answer = client.ask_vision("scanned.pdf", "What does the chart show?")

Low-Level API

from rnsr import (
    ingest_document,
    build_skeleton_index,
    run_rlm_navigator,
    SQLiteKVStore
)
from rnsr.extraction import RLMUnifiedExtractor
from rnsr.agent import ProvenanceTracker, SelfReflectionEngine

# Step 1: Ingest document
result = ingest_document("document.pdf")
print(f"Extracted {result.tree.total_nodes} nodes")

# Step 2: Build index
kv_store = SQLiteKVStore("./data/index.db")
skeleton = build_skeleton_index(result.tree, kv_store)

# Step 3: Extract entities (grounded, no hallucination)
extractor = RLMUnifiedExtractor()
extraction = extractor.extract(
    node_id="section_1",
    doc_id="document",
    header="Introduction",
    content="..."
)

# Step 4: Query with provenance
answer = run_rlm_navigator(
    question="What are the key findings?",
    skeleton=skeleton,
    kv_store=kv_store
)

# Step 5: Get citations
tracker = ProvenanceTracker(kv_store=kv_store)
record = tracker.create_provenance_record(answer, question, variables)

Configuration

Environment Variables

Variable Description Default
LLM_PROVIDER Primary LLM provider auto (detect from keys)
SUMMARY_MODEL Model for summarization Provider default
AGENT_MODEL Model for navigation Provider default
EMBEDDING_MODEL Embedding model text-embedding-3-small
KV_STORE_PATH SQLite database path ./data/kv_store.db
LOG_LEVEL Logging verbosity INFO
RNSR_LLM_CACHE_PATH Custom cache location ~/.rnsr/llm_cache.db
RNSR_REASONING_MEMORY_PATH Custom memory location ~/.rnsr/reasoning_chains.json

Supported Models

Provider Models
OpenAI gpt-5.2, gpt-5-mini, gpt-5-nano, gpt-4.1, gpt-4o-mini
Anthropic claude-opus-4-5, claude-sonnet-4-5, claude-haiku-4-5
Gemini gemini-3-pro-preview, gemini-3-flash-preview, gemini-2.5-pro, gemini-2.5-flash

Benchmarks

RNSR is designed for complex document understanding tasks:

  • Multi-document PDFs - Automatically detects and separates bundled documents
  • Hierarchical queries - "Compare section 3.2 with section 5.1"
  • Cross-reference questions - "What does the appendix say about the claim in section 2?"
  • Entity extraction - Grounded extraction with ToT validation (no hallucination)
  • Table queries - "What is the total for Q4 2024?"

Sample Documents

RNSR includes sample documents for testing and demonstration:

Synthetic Documents (samples/)

File Type Features Demonstrated
sample_contract.md Legal Contract Entities (people, orgs), relationships, payment tables, legal terms
sample_financial_report.md Financial Report Financial tables, metrics, executive names, quarterly data
sample_research_paper.md Academic Paper Citations, hierarchical sections, technical content, tables

Real Test Documents (rnsr/test-documents/)

Legal documents from the Djokovic visa case (public court records) for testing with actual PDFs:

  • Affidavits and court applications
  • Legal submissions and orders
  • Interview transcripts

Using Sample Documents

from pathlib import Path
from rnsr.ingestion import TableParser
from rnsr.extraction import CandidateExtractor

# Parse a sample document
sample = Path("samples/sample_contract.md").read_text()

# Extract tables
parser = TableParser()
tables = parser.parse_from_text(sample)
print(f"Found {len(tables)} tables")

# Extract entities
extractor = CandidateExtractor()
candidates = extractor.extract_candidates(sample)
print(f"Found {len(candidates)} entity candidates")

Testing

Test Suite Overview

RNSR has comprehensive test coverage with 281+ tests:

# Run all tests
pytest tests/ -v

# Run specific feature tests
pytest tests/test_provenance.py tests/test_llm_cache.py -v

# Run end-to-end workflow tests
pytest tests/test_e2e_workflow.py -v

# Run with coverage
pytest tests/ --cov=rnsr --cov-report=html

Test Categories

Test File Tests Coverage
test_e2e_workflow.py 18 Full pipeline: ingestion → extraction → KG → query → provenance
test_provenance.py 17 Citations, contradictions, provenance records
test_llm_cache.py 17 Cache get/set, TTL, persistence
test_self_reflection.py 13 Critique, refinement, iteration limits
test_reasoning_memory.py 15 Chain storage, similarity matching
test_query_clarifier.py 19 Ambiguity detection, clarification
test_table_parser.py 26 Markdown/ASCII tables, SQL-like queries
test_chart_parser.py 16 Chart detection, trend analysis
test_rlm_unified.py 13 REPL execution, code cleaning
test_learned_types.py 13 Adaptive learning registries

End-to-End Workflow Tests

The test_e2e_workflow.py demonstrates the complete pipeline:

# Tests cover:
# 1. Document Ingestion - Parse structure and tables
# 2. Entity Extraction - Pattern-based grounded extraction  
# 3. Knowledge Graph - Store entities and relationships
# 4. Query Processing - Ambiguity detection, table queries
# 5. Provenance - Citations and evidence tracking
# 6. Self-Reflection - Answer improvement loop
# 7. Reasoning Memory - Learn from successful queries
# 8. LLM Cache - Response caching
# 9. Adaptive Learning - Type discovery
# 10. Full Workflow - Contract and financial analysis

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run linting
ruff check .

# Type checking
mypy rnsr/

Requirements

  • Python 3.10+
  • At least one LLM API key (OpenAI, Anthropic, or Gemini)

License

MIT License - see LICENSE for details.

Contributing

See CONTRIBUTING.md for guidelines.

Research

RNSR is inspired by:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rnsr-0.1.5.tar.gz (324.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rnsr-0.1.5-py3-none-any.whl (335.3 kB view details)

Uploaded Python 3

File details

Details for the file rnsr-0.1.5.tar.gz.

File metadata

  • Download URL: rnsr-0.1.5.tar.gz
  • Upload date:
  • Size: 324.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for rnsr-0.1.5.tar.gz
Algorithm Hash digest
SHA256 7658348d9d07713693145f4baad3b053ded4ea7e8695757c604efb42f18295f7
MD5 c91b3ba90863380f9f145ec1f60f88cb
BLAKE2b-256 1ace031341a4c76b641f0b07fb65fe3242a7cda8a13b9112eeaa24284425b9a8

See more details on using hashes here.

File details

Details for the file rnsr-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: rnsr-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 335.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for rnsr-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2e2ef8d0eee78bdeecb791869af516fedbbb304df26524610fe57907aa39e657
MD5 715de20725179c07e014f3f1ead38168
BLAKE2b-256 8e3233dad3e97c84b290bcb13c26b1143c81829dec7df27cc456678ec976c6a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page