AI-powered document querying with citations
Project description
DocNav: AI-Powered Document Querying with Citations
DocNav is a professional, industry-grade document management and querying system that enables you to ask questions about your documents and get accurate answers with source citations. Built for both CLI and Python API usage.
โจ Features
- ๐ Multi-format Support: PDF, DOCX, TXT, MD, CSV, Excel, PowerPoint
- ๐ง Smart Chunking: Intelligent document segmentation for better context
- ๐ Vector Search: Fast similarity-based document retrieval
- ๐ค Multiple LLMs: OpenAI, Gemini, Claude support
- ๐ Citations: Answers include source document references
- โก Fast Processing: Parallel document processing with progress bars
- ๐ฏ Industry Ready: Production-grade with error handling and logging
- ๐ง Flexible: CLI tool and Python API
๐ Quick Start
Installation
# Basic installation
pip install docnav
# Full installation with all dependencies
pip install docnav[full]
# With OCR support for scanned PDFs
pip install docnav[full,ocr]
# Development installation
pip install docnav[dev]
CLI Usage
# Create a new corpus
docnav new mydocs
# Add documents
docnav add mydocs documents/ reports.pdf
# Query your documents
docnav query mydocs "What are the main findings?"
# Use different LLM providers
docnav query mydocs "Summarize the budget" --provider gemini --model gemini-2.5-flash
docnav query mydocs "Extract key dates" --provider claude --model claude-3-haiku-20240307
# List documents
docnav list mydocs
# Get statistics
docnav stats mydocs
# Quick query without creating corpus
docnav quick document.pdf "What is this about?"
Python API Usage
from docnav import Corpus, DocumentChunk
# Create or load a corpus
corpus = Corpus("mydocs")
# Add documents
corpus.add(["document.pdf", "report.docx"])
# Ask questions
answer = corpus.ask("What are the main findings?")
print(answer.text)
# Access sources
for source in answer.sources:
print(f"Source: {source.metadata['file_name']}")
print(f"Content: {source.text[:200]}...")
# List all documents
documents = corpus.list()
for doc in documents:
print(f"{doc['file_name']} ({doc['chunks']} chunks)")
# Get statistics
stats = corpus.stats()
print(f"Total documents: {stats['total_documents']}")
print(f"Total chunks: {stats['total_chunks']}")
๐ Commands Reference
Corpus Management
docnav new <name>- Create new corpusdocnav add <corpus> <files>- Add documents to corpusdocnav list <corpus>- List documents in corpusdocnav stats <corpus>- Show corpus statisticsdocnav remove <corpus> <file>- Remove specific documentdocnav clear <corpus>- Clear entire corpusdocnav corpora- List all available corpora
Querying
docnav query <corpus> "<question>"- Ask question about corpusdocnav quick <file> "<question>"- Quick query single document
Options
--provider <openai|gemini|claude>- LLM provider--model <model_name>- Specific model to use--api-key <key>- API key (overrides environment)--top-k <number>- Number of chunks to consider (default: 5)--use-ocr- Use OCR for scanned PDFs--details- Show detailed information
๐ง Configuration
Environment Variables
Set these for different LLM providers:
# OpenAI
export OPENAI_API_KEY="your-openai-key"
# Google Gemini
export GOOGLE_API_KEY="your-gemini-key"
# Anthropic Claude
export ANTHROPIC_API_KEY="your-claude-key"
Default Models
- OpenAI:
gpt-3.5-turbo - Gemini:
gemini-2.5-flash - Claude:
claude-3-haiku-20240307
๐ Storage
DocNav stores corpora in ~/.docnav/corpora/ by default:
~/.docnav/
โโโ corpora/
โ โโโ mydocs/
โ โ โโโ corpus_index.pkl
โ โ โโโ metadata.json
โ โโโ another_corpus/
โ โโโ corpus_index.pkl
โ โโโ metadata.json
๐ฏ Advanced Usage
Custom Chunking
from docnav import Corpus
# Custom chunk size
corpus = Corpus("mydocs", chunk_size=2000)
# Add with custom chunking
corpus.add(["large_document.pdf"], chunk_size=1500)
Filtering Queries
# Query with metadata filters
answer = corpus.ask(
"Budget information",
where={"type": "pdf", "file_name": "budget_report.pdf"}
)
Batch Processing
# Process multiple files efficiently
files = [
"reports/q1.pdf",
"reports/q2.pdf",
"reports/q3.pdf"
]
corpus.add(files, use_ocr=True)
๐ API Integration
OpenAI Integration
# Using OpenAI with custom model
answer = corpus.ask(
"Analyze the trends",
llm_provider="openai",
llm_model="gpt-4-turbo",
api_key="your-key"
)
Gemini Integration
# Using Google Gemini
answer = corpus.ask(
"Extract insights",
llm_provider="gemini",
llm_model="gemini-2.5-flash",
api_key="your-gemini-key"
)
Claude Integration
# Using Anthropic Claude
answer = corpus.ask(
"Summarize findings",
llm_provider="claude",
llm_model="claude-3-sonnet-20240229",
api_key="your-claude-key"
)
๐ ๏ธ Development
Setup Development Environment
# Clone repository
git clone https://github.com/Mukesh-Anand-G/DocNav.git
cd DocNav
# Install in development mode
pip install -e .[dev]
# Run tests
pytest
# Format code
black docnav/
Project Structure
docnav/
โโโ docnav/
โ โโโ __init__.py # Package initialization
โ โโโ core.py # Core functionality
โ โโโ cli.py # Command-line interface
โ โโโ handlers.py # CLI command handlers
โโโ setup.py # Package setup
โโโ pyproject.toml # Modern Python packaging
โโโ requirements.txt # Dependencies
โโโ README.md # This file
๐ Performance
- Processing Speed: ~1000 pages/minute (depends on hardware)
- Memory Usage: ~50MB for 1000 documents
- Search Latency: <100ms for typical queries
- Supported Formats: 10+ document types
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- OpenAI for GPT models
- Google for Gemini models
- Anthropic for Claude models
- Sentence Transformers team for embedding models
- All contributors and users
๐บ๏ธ Roadmap
- Web interface
- Real-time document monitoring
- Advanced filtering
- Graph visualization
- Plugin system
- Multi-language support
Made with โค๏ธ by [Mukesh Anand G]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters