Multi-agent RAG system for RBAC-secured financial document Q&A. 72.7% on FinanceBench. Ships a CLI client + self-hostable FastAPI backend.
Reason this release was yanked:
Install path broken on Apple Silicon — superseded by 0.1.5. See https://github.com/Rishabhmannu/financebench-rag-agent/releases/tag/v0.1.5
Project description
FinanceBench RAG Agent
A multi-agent RAG system for role-based access-controlled financial document Q&A. Achieves 72.7% correctness pass rate on the public FinanceBench benchmark using selective agentic retrieval, a BGE cross-encoder reranker, and a self-hosted LLM observability stack.
Try it
pip install financebench-rag-agent
financebench setup # brings up the 4-service docker stack, seeds a sample corpus
financebench login -u analyst # password analyst123
financebench chat
Multi-party HITL approval workflow and conversation memory have their own walkthroughs in docs/cli.md. Self-hosting the backend (env vars, full vs minimal stack, production hardening) is in docs/deploy.md.
Architecture
The mermaid diagram below renders as a flowchart on GitHub. PyPI's markdown renderer doesn't support mermaid — readers there see the source and the prose summary that follows.
flowchart TD
Q([Query + JWT]) --> RBAC[rbac_gate<br/>JWT to Qdrant filter]
RBAC --> Guard[guardrails<br/>regex to LLM Guard to LLM classifier]
Guard -->|blocked| Block([blocked])
Guard --> Route{router}
Route -->|simple_lookup| Direct[retrieval → reranker → grader → generator]
Route -->|research_required| Agent[[research_agent subgraph<br/>decompose → retrieve → grade → sufficiency → synthesize<br/>5-turn cap]]
Direct --> Halu[hallucination_checker]
Agent --> Halu
Halu -->|ungrounded, retry up to 2| Direct
Halu --> HITL{hitl_gate}
HITL -->|amount above role threshold| Pause([pause for human approval])
HITL --> Out([Answer + sources])
A router classifies each query as a simple lookup or research-required. Simple lookups take the fast direct path (retrieval → BGE reranker → grader → Claude generator); research queries enter a multi-turn subgraph that decomposes the question, retrieves per sub-question, grades sufficiency, and synthesizes a final answer. RBAC is enforced at the Qdrant payload-filter level — agentic queries cannot bypass access control. High-stakes answers (above a per-role dollar threshold) pause via LangGraph's interrupt() for multi-party human approval, with state checkpointed to Postgres so the workflow survives container restarts.
Tech stack
- Backend — FastAPI · LangGraph · Qdrant · PostgreSQL · Redis · PyJWT
- Client —
financebenchCLI: typer · rich · prompt_toolkit · httpx-sse · token-streaming over SSE - LLMs — Claude Sonnet 4.6 · gpt-4o-mini · Llama 3.3 (via Groq, optional)
- Retrieval — OpenAI text-embedding-3-small or voyage-finance-2 · BGE-reranker-v2-m3 cross-encoder
- Observability — self-hosted LiteLLM proxy + Langfuse v3 + Redis semantic cache (full stack only)
- Safety — Microsoft Presidio PII detection · LLM Guard · LLM classifier (3-layer cascade)
- Evaluation — RAGAS · DeepEval · custom LLM correctness judge
Evaluation results
Evaluated on the FinanceBench benchmark (150 questions across 32 companies):
| Metric | Value |
|---|---|
| Correctness pass rate | 72.7% (109/150) |
| Refusal rate | 6.7% (10/150) |
| RAGAS faithfulness | 0.747 |
| DeepEval faithfulness | 0.844 |
| DeepEval contextual recall | 0.768 |
Per-slice pass rate: lookup 68.6% (n=86), multi-hop 84.6% (n=13), calc 76.5% (n=51).
The correctness judge is a Claude Sonnet 4.6 + structured-prompt setup calibrated to Cohen's κ = 0.932 against an 89-question hand-labeled set with an adversarial leniency guard. Full methodology, per-judge scores, and reproduction commands in docs/evaluation.md.
Comparison with published systems on FinanceBench
| System | Approach | Accuracy |
|---|---|---|
| Mafin 2.5 / PageIndex | Vectorless reasoning over hierarchical document tree | 98.7% |
| DANA | Domain-aware neurosymbolic agent with deterministic operators | 94.3% |
| GPT-4-Turbo · long context (128k) | Whole-document prompting | ~79% |
| Claude-2 · long context (100k) | Whole-document prompting | ~76% |
| This project | Multi-agent RAG with selective research-agent subgraph + RBAC + HITL | 72.7% |
| FinanceBench paper baselines | Vector retrieval + GPT-4 / Llama-2 | 38–43% |
| GPT-4-Turbo · top-k vector RAG | Standard retrieval, no agent | ~19% |
Long-context approaches score higher but are not enterprise-deployable — 10-K filings frequently exceed 128k tokens, and whole-document prompting is impractical at scale due to latency and cost. The 72.7% here is measured on a production-shaped pipeline (fixed institutional corpus, batched retrieval, RBAC at the storage layer, HITL on high-stakes outputs).
Known limitations
- Not deployed to production — runs locally via
docker compose up -d. No public URL or live traffic. - CLI is the canonical client today. A Next.js web frontend is in progress in
web/but not wired into the deployment story. - Below the top-published systems (Mafin 2.5 at 98.7%, DANA at 94.3%) — see comparison table above for context.
Running from source
git clone https://github.com/Rishabhmannu/financebench-rag-agent.git
cd financebench-rag-agent
pip install -e ".[backend,dev]" && cp .env.example .env # backend extras + dev tools
financebench setup # docker compose + seed corpus
For self-hosting the full 11-service stack (LiteLLM + Langfuse), upgrade flows, and production hardening, see docs/deploy.md and docs/upgrade.md.
Documentation
- docs/cli.md — CLI reference, slash commands, multi-party HITL workflow
- docs/deploy.md — Self-host: stack profiles, env vars, backup, hardening
- docs/upgrade.md — Upgrade cookbook by change type
- docs/evaluation.md — Methodology, results, reproduction
- docs/engineering-log.md — Engineering decisions and tradeoffs
- docs/setup.md — Test accounts, environment, dev commands
- docs/architecture.md · docs/api-reference.md · docs/rbac-matrix.md · web/README.md
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file financebench_rag_agent-0.1.3.tar.gz.
File metadata
- Download URL: financebench_rag_agent-0.1.3.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5191b020271c469cb423b36ec95cfd4ef2edddd055c366281108eb6499c0d5a
|
|
| MD5 |
da35ec7672d7d3515e03a64a2743158f
|
|
| BLAKE2b-256 |
55a500d75609e7394f9082d57147eb6c8f0e4f0d4e9cf5a44c527db1400d7258
|
File details
Details for the file financebench_rag_agent-0.1.3-py3-none-any.whl.
File metadata
- Download URL: financebench_rag_agent-0.1.3-py3-none-any.whl
- Upload date:
- Size: 205.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83e7f8181522055778a5c414f41329bdea327917a6d19eb4831684b1baff415b
|
|
| MD5 |
5dabd3f9f6e5cec4810b8b927de6e766
|
|
| BLAKE2b-256 |
2f7f8a28f91ef27b4dae7432d5e50c504b15ebfc788f64dbb0d28578c6166dd1
|