Skip to main content

AI-powered document querying with citations

Project description

DocNav: AI-Powered Document Querying with Citations

PyPI version Python versions License Downloads

DocNav is a professional, industry-grade document management and querying system that enables you to ask questions about your documents and get accurate answers with source citations. Built for both CLI and Python API usage.

โœจ Features

  • ๐Ÿ“š Multi-format Support: PDF, DOCX, TXT, MD, CSV, Excel, PowerPoint
  • ๐Ÿง  Smart Chunking: Intelligent document segmentation for better context
  • ๐Ÿ” Vector Search: Fast similarity-based document retrieval
  • ๐Ÿค– Multiple LLMs: OpenAI, Gemini, Claude support
  • ๐Ÿ“ Citations: Answers include source document references
  • โšก Fast Processing: Parallel document processing with progress bars
  • ๐ŸŽฏ Industry Ready: Production-grade with error handling and logging
  • ๐Ÿ”ง Flexible: CLI tool and Python API

๐Ÿš€ Quick Start

Installation

# Basic installation
pip install docnav

# Full installation with all dependencies
pip install docnav[full]

# With OCR support for scanned PDFs
pip install docnav[full,ocr]

# Development installation
pip install docnav[dev]

CLI Usage

# Create a new corpus
docnav new mydocs

# Add documents
docnav add mydocs documents/ reports.pdf

# Query your documents
docnav query mydocs "What are the main findings?"

# Use different LLM providers
docnav query mydocs "Summarize the budget" --provider gemini --model gemini-2.5-flash
docnav query mydocs "Extract key dates" --provider claude --model claude-3-haiku-20240307

# List documents
docnav list mydocs

# Get statistics
docnav stats mydocs

# Quick query without creating corpus
docnav quick document.pdf "What is this about?"

Python API Usage

from docnav import Corpus, DocumentChunk

# Create or load a corpus
corpus = Corpus("mydocs")

# Add documents
corpus.add(["document.pdf", "report.docx"])

# Ask questions
answer = corpus.ask("What are the main findings?")
print(answer.text)

# Access sources
for source in answer.sources:
    print(f"Source: {source.metadata['file_name']}")
    print(f"Content: {source.text[:200]}...")

# List all documents
documents = corpus.list()
for doc in documents:
    print(f"{doc['file_name']} ({doc['chunks']} chunks)")

# Get statistics
stats = corpus.stats()
print(f"Total documents: {stats['total_documents']}")
print(f"Total chunks: {stats['total_chunks']}")

๐Ÿ“‹ Commands Reference

Corpus Management

  • docnav new <name> - Create new corpus
  • docnav add <corpus> <files> - Add documents to corpus
  • docnav list <corpus> - List documents in corpus
  • docnav stats <corpus> - Show corpus statistics
  • docnav remove <corpus> <file> - Remove specific document
  • docnav clear <corpus> - Clear entire corpus
  • docnav corpora - List all available corpora

Querying

  • docnav query <corpus> "<question>" - Ask question about corpus
  • docnav quick <file> "<question>" - Quick query single document

Options

  • --provider <openai|gemini|claude> - LLM provider
  • --model <model_name> - Specific model to use
  • --api-key <key> - API key (overrides environment)
  • --top-k <number> - Number of chunks to consider (default: 5)
  • --use-ocr - Use OCR for scanned PDFs
  • --details - Show detailed information

๐Ÿ”ง Configuration

Environment Variables

Set these for different LLM providers:

# OpenAI
export OPENAI_API_KEY="your-openai-key"

# Google Gemini
export GOOGLE_API_KEY="your-gemini-key"

# Anthropic Claude
export ANTHROPIC_API_KEY="your-claude-key"

Default Models

  • OpenAI: gpt-3.5-turbo
  • Gemini: gemini-2.5-flash
  • Claude: claude-3-haiku-20240307

๐Ÿ“ Storage

DocNav stores corpora in ~/.docnav/corpora/ by default:

~/.docnav/
โ”œโ”€โ”€ corpora/
โ”‚   โ”œโ”€โ”€ mydocs/
โ”‚   โ”‚   โ”œโ”€โ”€ corpus_index.pkl
โ”‚   โ”‚   โ””โ”€โ”€ metadata.json
โ”‚   โ””โ”€โ”€ another_corpus/
โ”‚       โ”œโ”€โ”€ corpus_index.pkl
โ”‚       โ””โ”€โ”€ metadata.json

๐ŸŽฏ Advanced Usage

Custom Chunking

from docnav import Corpus

# Custom chunk size
corpus = Corpus("mydocs", chunk_size=2000)

# Add with custom chunking
corpus.add(["large_document.pdf"], chunk_size=1500)

Filtering Queries

# Query with metadata filters
answer = corpus.ask(
    "Budget information",
    where={"type": "pdf", "file_name": "budget_report.pdf"}
)

Batch Processing

# Process multiple files efficiently
files = [
    "reports/q1.pdf",
    "reports/q2.pdf", 
    "reports/q3.pdf"
]
corpus.add(files, use_ocr=True)

๐Ÿ”Œ API Integration

OpenAI Integration

# Using OpenAI with custom model
answer = corpus.ask(
    "Analyze the trends",
    llm_provider="openai",
    llm_model="gpt-4-turbo",
    api_key="your-key"
)

Gemini Integration

# Using Google Gemini
answer = corpus.ask(
    "Extract insights",
    llm_provider="gemini", 
    llm_model="gemini-2.5-flash",
    api_key="your-gemini-key"
)

Claude Integration

# Using Anthropic Claude
answer = corpus.ask(
    "Summarize findings",
    llm_provider="claude",
    llm_model="claude-3-sonnet-20240229",
    api_key="your-claude-key"
)

๐Ÿ› ๏ธ Development

Setup Development Environment

# Clone repository
git clone https://github.com/Mukesh-Anand-G/DocNav.git
cd DocNav

# Install in development mode
pip install -e .[dev]

# Run tests
pytest

# Format code
black docnav/

Project Structure

docnav/
โ”œโ”€โ”€ docnav/
โ”‚   โ”œโ”€โ”€ __init__.py      # Package initialization
โ”‚   โ”œโ”€โ”€ core.py          # Core functionality
โ”‚   โ”œโ”€โ”€ cli.py           # Command-line interface
โ”‚   โ””โ”€โ”€ handlers.py      # CLI command handlers
โ”œโ”€โ”€ setup.py             # Package setup
โ”œโ”€โ”€ pyproject.toml       # Modern Python packaging
โ”œโ”€โ”€ requirements.txt     # Dependencies
โ””โ”€โ”€ README.md           # This file

๐Ÿ“Š Performance

  • Processing Speed: ~1000 pages/minute (depends on hardware)
  • Memory Usage: ~50MB for 1000 documents
  • Search Latency: <100ms for typical queries
  • Supported Formats: 10+ document types

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • OpenAI for GPT models
  • Google for Gemini models
  • Anthropic for Claude models
  • Sentence Transformers team for embedding models
  • All contributors and users

๐Ÿ—บ๏ธ Roadmap

  • Web interface
  • Real-time document monitoring
  • Advanced filtering
  • Graph visualization
  • Plugin system
  • Multi-language support

Made with โค๏ธ by [Mukesh Anand G]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docnav-1.0.1.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docnav-1.0.1-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file docnav-1.0.1.tar.gz.

File metadata

  • Download URL: docnav-1.0.1.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for docnav-1.0.1.tar.gz
Algorithm Hash digest
SHA256 a52bfc765a6d14f34411d8a52e26bc6f13a700efdbd09422ec3c5bcdd31a47d5
MD5 c7ccb3c5b2a2c1f6de9c5ee7ebeb6983
BLAKE2b-256 4fee3c43219bdcfde986a2a6121bd23fc4810b18054d122f0cc61d1d54dc975d

See more details on using hashes here.

File details

Details for the file docnav-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: docnav-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for docnav-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 51714713e5650f01a4bf05814fdccd12a5560216c9dee7d1a28ba74103c5fccd
MD5 8aefd7117827399e16de4760911446f3
BLAKE2b-256 8778b07275606da27b34ae66ce90fe010700962ac6c0eae6759450521e3cda3b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page