Advanced Retrieval Framework — dependency-free retrieval pipeline toolkit
Project description
ARF - Advanced Retrieval Framework
ARF (Advanced Retrieval Framework) is a production-ready RAG system designed to minimize cost and hallucination based on R-Flow. Optimized for legal document search and analysis across multiple domains.
Table of Contents
- Summary
- Live Demo
- Overview
- Architecture
- Evaluation & Benchmarks
- MLP Reranker
- Installation
- Configuration
- Usage
- Components
- Development
- Contributing
Summary
What makes ARF different from other RAG systems:
Most RAG pipelines rely on expensive LLM calls to rerank and verify retrieval results. ARF proves this is unnecessary. A lightweight MLP reranker (128-64-32 neurons, <5ms, $0.00/query) trained on domain-specific features outperforms LLM-based reranking (GPT-4o, ~500ms, $0.004/query) by a wide margin.
Key innovations:
- Learned retrieval > LLM reranking — A small MLP trained on 3,600 labeled pairs achieves +40% MRR over LLM verification, at zero cost. The MLP sees the entire candidate distribution and learns cross-feature relevance patterns that per-document LLM scoring cannot capture.
- R-Flow pipeline — Each stage filters candidates so the next stage does less work:
- Keyword matching — Structured pattern detection (e.g., "Article I Section 8", "14th Amendment") maps directly to known documents, bypassing semantic search entirely for exact references.
- Threshold gates (ABC Gates) — Score-based routing:
≥ 0.85→ accept immediately,< 0.70→ reject immediately,0.70–0.85→ pass to next stage. Eliminates ~60% of candidates without any LLM call. - MLP Reranker — A 128-64-32 MLP trained on 3,600 labeled pairs scores borderline candidates in <5ms at $0.00. Confidently accepts (p ≥ 0.6) or rejects (p ≤ 0.4) ~80% of remaining candidates.
- LLM Fallback — Only the ~20% of candidates where the MLP is uncertain (0.4 < p < 0.6) go to the LLM verifier. This is the only stage that costs money, and it handles the smallest batch.
- Domain-specific thresholds — Each legal domain (US Constitution, CFR, US Code, USCIS Policy) has independently tuned thresholds and bias maps, avoiding one-size-fits-all degradation.
- Aggressive caching — In-memory + MongoDB caching makes repeated/similar queries cost $0.00 with 335ms latency (faster than raw MongoDB Atlas at 410ms). Cost stays flat as query volume grows.
- Automated retraining — Monthly pipeline exports new LLM judgments from production, retrains the MLP, and only deploys if performance improves.
🚀 Live Demo
Experience ARF in action: KnowYourRights.ai
KnowYourRights.ai - AI-powered legal rights search and case intake platform powered by ARF
Overview
ARF is a production-ready RAG framework built for legal document retrieval across 6 domains: US Constitution, US Code, Code of Federal Regulations, USCIS Policy Manual, Supreme Court Cases, and Client Cases.
Core Capabilities
- Multi-Strategy Retrieval — Semantic vector search (Voyage-3-large, 1024d) + keyword matching + alias search + exact patterns, combined per domain
- MLP Reranker — Learned second-stage reranker that outperforms LLM verification (MRR 0.933 vs 0.665) at zero cost
- R-Flow Pipeline — Multi-stage filtering eliminates unnecessary computation: only ~20% of candidates reach the LLM
- Domain-Specific Tuning — Each domain has independent thresholds, bias maps, and field mappings
- Aggressive Caching — Embedding, result, and summary caching; repeated queries cost $0.00
- Bilingual Support — English/Spanish query processing and response generation
- Automated Retraining — Monthly MLP retraining from production LLM judgments
Supported Domains
| Domain | Collection | Features |
|---|---|---|
| US Constitution | us_constitution |
Alias search, keyword matching, structured articles/sections |
| US Code | us_code |
Large-scale (54 titles), clause-level search |
| Code of Federal Regulations | code_of_federal_regulations |
Hierarchical part/chapter/section, section-level search |
| USCIS Policy Manual | uscis_policy |
Automatic weekly updates, CFR reference tracking |
| Supreme Court Cases | supreme_court_cases |
Case-to-constitutional provision mapping |
| Client Cases | client_cases |
SQL-based private case search |
Architecture
System Components
ARF/
├── RAG_interface.py # Main orchestrator class
├── config.py # Configuration and collection definitions
├── rag_dependencies/ # Core RAG components
│ ├── mongo_manager.py # MongoDB connection and query management
│ ├── vector_search.py # MongoDB Atlas Vector Search implementation
│ ├── query_manager.py # Query processing and normalization
│ ├── query_processor.py # End-to-end query pipeline
│ ├── alias_manager.py # Alias/keyword search for US Constitution
│ ├── keyword_matcher.py # Structured keyword matching
│ ├── llm_verifier.py # LLM-based result reranking
│ ├── mlp_reranker.py # MLP-based learned reranker (cost optimizer)
│ ├── feature_extractor.py # Feature engineering for MLP reranker
│ ├── openai_service.py # OpenAI API integration
│ └── ai_service.py # AI service abstraction
├── models/ # Trained ML models
│ └── mlp_reranker.joblib # Trained MLP reranker model
├── benchmarks/ # Evaluation and benchmarking
│ ├── run_eval.py # Full evaluation runner
│ ├── run_baseline.py # Baseline measurement (before MLP)
│ ├── run_ablation_full.py # Full benchmark (7 strategies)
│ ├── run_benchmark.py # Basic strategy comparison
│ ├── train_reranker.py # MLP training pipeline
│ ├── retrain_monthly.py # Automated monthly retraining
│ ├── cost_comparison.py # Cost savings analysis
│ ├── metrics.py # Retrieval metrics (P@k, R@k, MRR, NDCG)
│ ├── cost_tracker.py # Query cost tracking
│ ├── hallucination_eval.py # Faithfulness evaluation
│ ├── benchmark_queries.json # Benchmark query dataset
│ └── eval_dataset.json # Labeled evaluation dataset (200+ queries)
└── preprocess/ # Data ingestion scripts
├── us_constitution/ # US Constitution ingestion
├── us_code/ # US Code ingestion
├── cfr/ # CFR ingestion
├── uscis_policy_manual/ # USCIS Policy Manual ingestion
├── supreme_court_cases/ # Supreme Court cases ingestion
└── [other sources]/ # Additional data sources
Query Processing Flow
Pipeline Stages
- Query Input — Normalize, detect language (en/es), generate embedding
- Cache Check — Return cached results if available (zero API calls)
- Multi-Strategy Search — Semantic vector search + alias/keyword matching
- MLP Reranking — Feature extraction (15 features) + MLP scoring + blended reranking
- LLM Fallback — Only for MLP-uncertain candidates (~20%)
- Summary & Cache — Generate bilingual summary, cache for reuse
Installation
Prerequisites
- Python 3.10+
- MongoDB Atlas account with vector search enabled
- OpenAI API key
- Voyage AI API key
Setup
-
Clone the repository:
git clone <repository-url> cd arf
-
Install dependencies:
pip install -e ".[dev]"
-
Configure environment variables: Create a
.envfile (or.env.local,.env.dev,.env.production) with:OPENAI_API_KEY=your_openai_api_key VOYAGE_API_KEY=your_voyage_api_key MONGO_URI=your_mongodb_atlas_connection_string
-
Set up MongoDB Atlas:
- Create vector search indexes on your collections
- Index name:
vector_index(default) - Vector field:
embedding - Dimensions: 1024
Configuration
Collection Configuration
Collections are defined in config.py with domain-specific settings:
COLLECTION = {
"US_CONSTITUTION_SET": {
"db_name": "public",
"main_collection_name": "us_constitution",
"document_type": "US Constitution",
"use_alias_search": True,
"use_keyword_matcher": True,
"thresholds": DOMAIN_THRESHOLDS["us_constitution"],
# ... additional settings
},
# ... other collections
}
Domain-Specific Thresholds
Each domain has optimized thresholds for:
query_search: Initial semantic search thresholdalias_search: Alias matching thresholdRAG_SEARCH_min: Minimum score to continue processingLLM_VERIFication: Threshold for LLM rerankingRAG_SEARCH: High-confidence result thresholdconfident: Threshold for saving summariesFILTER_GAP: Maximum score gap between resultsLLM_SCORE: LLM reranking score adjustment
Environment Selection
The framework supports multiple environments:
--production: Uses.env.production--dev: Uses.env.dev--local: Uses.env.local- Auto-detection: Based on Docker environment and file existence
Usage
Basic Usage
from RAG_interface import RAG
from config import COLLECTION
# Initialize RAG for a specific collection
rag = RAG(COLLECTION["US_CONSTITUTION_SET"], debug_mode=False)
# Process a query
results, query = rag.process_query(
query="What does the 14th Amendment say about equal protection?",
language="en"
)
# Get summary for a specific result
summary = rag.process_summary(
query=query,
result_list=results,
index=0,
language="en"
)
Advanced Usage
# With jurisdiction filtering
results, query = rag.process_query(
query="immigration policy",
jurisdiction="federal",
language="en"
)
# Bilingual summary
insight_en, insight_es = rag.process_summary_bilingual(
query=query,
result_list=results,
index=0,
language="es" # Returns both English and Spanish
)
# SQL-based client case search
rag_sql = RAG(COLLECTION["CLIENT_CASES"], debug_mode=False)
results = rag_sql.process_query(
query="asylum case",
filtered_cases=["case_id_1", "case_id_2"]
)
Query Processing Options
skip_pre_checks: Skip initial query validationskip_cases_search: Skip Supreme Court case searchfiltered_cases: Filter results to specific case IDs (SQL path)jurisdiction: Filter by jurisdictionlanguage: Query language ("en" or "es")
Components
RAG Interface (RAG_interface.py)
Main orchestrator class that wires all subsystems together:
- Collection configuration management
- Domain-specific threshold selection
- Component initialization
- Public API for query processing
Query Processor (query_processor.py)
End-to-end query processing pipeline:
- Query normalization and expansion
- Multi-stage search execution
- Result filtering and ranking
- Summary generation and caching
- Case-to-document mapping
Vector Search (vector_search.py)
MongoDB Atlas Vector Search implementation:
- Native
$vectorSearchaggregation - Score bias adjustments
- Efficient similarity search
- Error handling and retries
Query Manager (query_manager.py)
Query processing utilities:
- Text normalization
- Pattern matching
- Query rephrasing
- Domain detection
Alias Manager (alias_manager.py)
Alias-based search for US Constitution:
- Keyword/alias embeddings
- Fast alias matching
- Score boosting for exact matches
Keyword Matcher (keyword_matcher.py)
Structured keyword matching:
- Article/section pattern matching
- Hierarchical document navigation
- Exact match detection
LLM Verifier (llm_verifier.py)
LLM-based result verification (fallback for MLP-uncertain candidates):
- Only invoked for ~20% of borderline candidates (MLP handles the rest)
- Relevance scoring (0-9) with multiplier-based score adjustment
- Sequential or parallel verification modes
MLP Reranker (mlp_reranker.py)
Learned reranker that reduces LLM verification costs:
- scikit-learn MLPClassifier (128-64-32 hidden layers)
- Isotonic calibration for well-calibrated output probabilities
- Configurable uncertainty threshold for LLM fallback
- <5ms inference time per batch
Feature Extractor (feature_extractor.py)
Extracts 15-dimensional feature vectors for query-document pairs:
| Feature | Description |
|---|---|
semantic_score |
Raw cosine similarity from vector search |
bm25_score |
Term-frequency based relevance approximation |
alias_match |
Whether query matches a document alias |
keyword_match |
Whether query matches via keyword pattern |
domain_type |
Encoded domain (0-3) |
document_length |
Log-scaled character count |
query_length |
Query character count |
section_depth |
Depth in legal hierarchy |
embedding_cosine_similarity |
Direct embedding cosine similarity |
match_type |
0=none, 1=partial, 2=exact |
score_gap_from_top |
Gap from highest-scored document |
query_term_coverage |
Fraction of query terms in document |
title_similarity |
Jaccard similarity between query and title |
has_nested_content |
Whether document has clauses/sections |
bias_adjustment |
Domain-specific bias applied |
Mongo Manager (mongo_manager.py)
MongoDB connection and query management:
- Database connections
- Collection access
- Query caching
- User query history
Development
Running Tests
# Unit + integration tests (no API keys needed)
pytest tests/ -v
# Live integration tests (requires API keys + MongoDB)
ARF_LIVE_TESTS=1 pytest tests/test_integration.py -v
# Lint check
ruff check config.py config_schema.py rag_dependencies/ tests/
# Validate config schemas
python config_schema.py
Docker
Run tests, lint, or the full framework without installing anything locally:
# Build the image (installs all dependencies + runs lint check)
docker build -t arf .
# Run tests
docker compose up tests
# Run lint only
docker compose run lint
# Validate config
docker compose run validate-config
# Run benchmarks (requires .env with API keys)
docker compose --profile benchmark up benchmark
# Interactive shell
docker compose run arf bash
Note: Copy
.env.exampleto.envand fill in your API keys before running services that require MongoDB or OpenAI access.
Adding New Data Sources
- Create a new directory in
preprocess/ - Implement fetch and ingest scripts
- Add collection configuration to
config.py - Define domain-specific thresholds
- Create vector search indexes in MongoDB Atlas
Debugging
Enable debug mode for detailed logging:
rag = RAG(COLLECTION["US_CONSTITUTION_SET"], debug_mode=True)
Evaluation & Benchmarks
Benchmark
Measured on 15 US Constitution benchmark queries. Each strategy runs in its own isolated RAG instance — strategies build incrementally to show the marginal gain of each layer. Latency measured with in-memory query cache enabled.
| Strategy | MRR | P@1 | P@5 | R@5 | NDCG@5 | LLM% | Latency |
|---|---|---|---|---|---|---|---|
| Semantic Only | 0.665 | 0.600 | 0.147 | 0.613 | 0.603 | 0% | 410 ms |
| + Keyword | 0.665 | 0.600 | 0.147 | 0.613 | 0.603 | 0% | 437 ms |
| + Threshold (ABC Gates) | 0.665 | 0.600 | 0.147 | 0.613 | 0.603 | 100% | 453 ms |
| + MLP Reranker | 0.933 | 0.933 | 0.267 | 0.900 | 0.908 | 0% | 714 ms |
| + MLP + LLM Fallback | 0.933 | 0.933 | 0.253 | 0.867 | 0.882 | 20% | 768 ms |
| Full ARF Pipeline (cold) | 0.489 | 0.400 | 0.133 | 0.580 | 0.503 | — | 807 ms |
| Full ARF Pipeline (cached) | 0.679 | 0.571 | 0.171 | 0.743 | 0.682 | 0% | 335 ms |
Key findings:
- Semantic search provides a solid baseline (MRR 0.665) — but the right answer is often not at rank 1.
- Keyword matching adds no measurable gain on this query set (US Constitution queries are predominantly semantic, not keyword-based).
- Threshold filtering adds quality gates but no ranking improvement — and incurs 100% LLM verification calls in the borderline band.
- MLP Reranker is the breakthrough: MRR jumps from 0.665 to 0.933 (+40%), P@1 from 0.600 to 0.933, and R@5 from 0.613 to 0.900 — with zero LLM calls. The MLP learns which features predict relevance and reranks candidates by blending semantic score with learned probability.
- In-memory cache brings cached query latency to 335 ms — faster than raw MongoDB Atlas (410 ms) — with zero API calls and $0.00 cost.
- MLP + LLM Fallback matches MLP-only quality while using LLM verification on only 20% of candidates (those the MLP is uncertain about). This is the production configuration.
MLP model: 128-64-32 MLP with isotonic calibration. Trained on 3,600 labeled query-document pairs across 4 legal domains. F1=0.940, AUC-ROC=0.983.
Run
python benchmarks/run_ablation_full.py --productionto reproduce.
Cost, Latency & Quality Over Volume
As query volume grows and cache warms, ARF latency drops from ~800ms (cold) to 335ms (cached) — faster than raw MongoDB Atlas (410ms). Cached queries cost $0.00 (zero API calls). MRR improves from 0.489 (cold) to 0.679 (cached) as verified results are reused.
Cost Analysis
Key findings from real measurement:
- Cached queries cost $0.00 — zero API calls, 335ms avg latency with in-memory cache (280ms min). All 20 "cold" queries hit cache from prior runs.
- Similar queries mostly miss cache (29% hit rate) — rephrased queries go through the full pipeline including Voyage batch embedding (~47 texts/call for alias search). OpenAI chat/moderation calls were zero on this run because threshold gates resolved all queries without LLM reranking.
- Voyage embedding is the only cost — $0.000926 total for 200 queries. The batch embedding (~47 texts/call) is the real cost driver, not LLM calls.
Cost at Scale (Measured + Extrapolated)
Query Volume Cache Hit Rate Total API Cost Cost/Query
────────────────────────────────────────────────────────────────
20 (cold, cached) 100% $0.000000 $0.000000
200 (20+180 sim) 36% $0.000926 $0.000005
1000 (100+900 sim) ~36% ~$0.005 ~$0.000005
Cost thesis: ARF's API cost is dominated by Voyage batch embedding ($0.06/1M tokens). Cached queries cost $0.00 — zero external calls. At scale, cost grows only with the number of genuinely new queries that miss cache. For 1,000 queries where ~36% hit cache, total cost is ~$0.005 (half a cent). The MLP reranker runs locally at $0.00, and LLM reranking is reserved as a fallback for uncertain candidates (~20% of borderline cases).
MLP Reranker
A lightweight learned reranker (128-64-32 MLP, <5ms, $0.00/query) that replaces expensive LLM verification calls. Trained on 3,600 labeled query-document pairs across 4 legal domains. The LLM is reserved as a fallback for only ~20% of uncertain candidates.
Architecture
┌──────────────────────┐
│ Vector Search │
│ (MongoDB Atlas) │
└──────────┬───────────┘
│ candidates with scores
┌──────────▼───────────┐
│ Feature Extractor │
│ (15 features) │
└──────────┬───────────┘
│ feature vectors
┌──────────▼───────────┐
│ MLP Reranker │
│ (128→64→32 MLP) │
│ + isotonic calib. │
│ F1=0.940 AUC=0.983 │
└──────────┬───────────┘
┌──────┼──────┐
p≥0.6│ 0.4<p<0.6 │p≤0.4
│ │ │
Accept ┌──▼──┐ Reject
(free) │ LLM │ (free)
│Verif│
│(20%)│
└──┬──┘
Accept/Reject
Why the MLP Wins
The MLP blends 15 features into a single relevance probability that captures signals the LLM cannot efficiently process:
- Semantic score + BM25 score — combines dense and sparse retrieval signals
- Match type (exact/partial/none) — structural pattern the LLM ignores
- Score gap from top — relative positioning in the candidate list
- Section depth — legal hierarchy structure (Title > Chapter > Section)
- Domain type — domain-specific relevance patterns
The LLM verifier sees one document at a time and rates it 0-9. The MLP sees the entire candidate distribution and learns which features predict relevance across domains.
How It Works
For each candidate document, the pipeline:
- Extracts 15 features — semantic score, BM25, keyword/alias match, document structure, query-document similarity
- MLP predicts — Calibrated probability of relevance (0-1)
- Blends scores —
0.4 * semantic_score + 0.6 * mlp_probabilityfor reranking - Routes by confidence:
- p >= 0.6: Accept without LLM call (free, instant)
- p <= 0.4: Reject without LLM call (free, instant)
- 0.4 < p < 0.6: Uncertain — escalate to LLM verifier (~20% of candidates)
Training
# Train from evaluation dataset (requires MongoDB)
python benchmarks/train_reranker.py --dataset benchmarks/eval_dataset.json \
--features-cache benchmarks/features_cache.json --production
# Retrain from cached features (no MongoDB needed)
python benchmarks/train_reranker.py --retrain --features-cache benchmarks/features_cache.json
The pipeline compares 3 architectures and picks the best:
| Model | F1 | AUC-ROC | Precision | Recall |
|---|---|---|---|---|
| Logistic Regression | 0.931 | 0.981 | 0.962 | 0.902 |
| MLP (64, 32) | 0.935 | 0.983 | 0.972 | 0.902 |
| MLP (128, 64, 32) | 0.948 | 0.988 | 0.977 | 0.921 |
Automated Monthly Retraining
python benchmarks/retrain_monthly.py --production --dry-run # Check what would change
python benchmarks/retrain_monthly.py --production # Retrain and deploy
The retraining pipeline:
- Exports recent LLM verifier judgments from MongoDB (last 30 days)
- Generates features for new query-document pairs
- Merges with existing training data (deduplicates)
- Retrains MLP on expanded dataset
- Only deploys new model if F1 >= old model
Running Benchmarks
# Full benchmark (all strategies compared incrementally)
python benchmarks/run_ablation_full.py --production
# Baseline measurement (current pipeline without MLP)
python benchmarks/run_baseline.py --production
# With hallucination evaluation
python benchmarks/run_baseline.py --production --eval-faithfulness
Metrics reported: P@k, R@k, MRR, NDCG@k, latency (p50/p95/p99), LLM call frequency, cost-per-query.
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Code Style
- Follow PEP 8 Python style guide
- Use type hints where appropriate
- Add docstrings to public functions
- Include logging for important operations
License
This project is licensed under the MIT License — see LICENSE for details.
Acknowledgments
- MongoDB Atlas for vector search capabilities
- Voyage AI for embedding models
- OpenAI for LLM services
For detailed information on data ingestion, see preprocess/README.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file advanced_rag_framework-0.2.0.tar.gz.
File metadata
- Download URL: advanced_rag_framework-0.2.0.tar.gz
- Upload date:
- Size: 150.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82fcd74115f72acb04c752ee2e113c8cdedf30bf4f3aaab1b4179828f144f4c2
|
|
| MD5 |
b08de37e6fd2759fe73b274e8deaaa47
|
|
| BLAKE2b-256 |
1bcf4680b2ec0d00e5a5f83191fa17922448e2969b3754e3e8469fa49b2e6294
|
File details
Details for the file advanced_rag_framework-0.2.0-py3-none-any.whl.
File metadata
- Download URL: advanced_rag_framework-0.2.0-py3-none-any.whl
- Upload date:
- Size: 144.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a6957f406cbece571c25f3eca9fe890c75288cb493c55b1836e024fd71f8c02
|
|
| MD5 |
421448c5c35e74fbab4c94eabf5922fd
|
|
| BLAKE2b-256 |
75605c0f131a8bf2a766b829bb4cf6beb3a7409359ad3b6f5d92f5bea8c0f9d4
|