GPU-first RAG engine with cross-encoder reranking and quality validation. One engine, any domain.
Project description
What is this?
Most RAG engines are built to impress in demos. This one is built to run in production.
Henko is the core retrieval engine powering every Suyven product. It's GPU-first, domain-agnostic, and designed so that anyone — a startup, an enterprise, a solo dev — can plug in their knowledge and get a system that actually works. One engine. Unlimited domains. Zero compromise.
It's fast because it has to be. It's precise because anything less is useless. And it's modular because the future will require things we haven't thought of yet.
"The base layer of every intelligent system we'll ever build." — Suyven
Why it exists
Knowledge is scattered. Teams drown in documents, wikis, PDFs, codebases. LLMs hallucinate. Semantic search alone isn't enough.
Henko exists because retrieval is the hardest part, and most solutions skip the hard work. We didn't.
- GPU-accelerated embeddings at 38x CPU speed
- Hybrid search combining semantic understanding with keyword precision
- Multi-domain isolation so your knowledge stays clean
- Auto-evaluation that flags failures before your users do
Benchmarks
Measured on NVIDIA RTX 5070 · CUDA 12.8 · PyTorch 2.12
| Metric | Result |
|---|---|
| NDCG@10 | 0.909 — 209-query ground-truth suite |
| Embedding throughput | 2,960 chunks/s — 38x CPU baseline |
| FP16 VRAM savings | −48% vs FP32 (671 MB vs 1,290 MB) |
| FP16 retrieval fidelity | 99.3% Recall@10 — near-zero quality loss |
| Reranker latency | 8.8 ms/query — FP16 on GPU |
| Test coverage | 209/209 passing |
Tech Stack
| Layer | Technology |
|---|---|
| API | FastAPI 0.115 + SSE streaming + Uvicorn |
| Embeddings | BAAI/bge-m3 · 568M params · 1024-dim · multilingual · FP16 |
| Reranker | cross-encoder/ms-marco-MiniLM-L-6-v2 · FP16 on GPU |
| Vector Store | ChromaDB 1.5+ · cosine HNSW · multi-collection |
| Hybrid Search | BM25Okapi + Reciprocal Rank Fusion (RRF) |
| LLM | Any OpenAI-compatible endpoint · default: Groq llama-3.3-70b |
| GPU Monitoring | pynvml 13.0+ |
| Infrastructure | Docker · CPU + GPU variants |
| Runtime | Python 3.12 · PyTorch 2.12+ · CUDA 12.8 |
Architecture
henko/
├── api.py # FastAPI + SSE streaming
├── app.py # Streamlit UI + GPU dashboard
├── ingest.py # CLI ingestion (3-phase pipeline)
├── query.py # CLI query tool
│
├── rag/ # Core pipeline — 21 modules
│ ├── config.py # Centralized config (env vars + Docker secrets)
│ ├── llm.py # LLM abstraction (Ollama + OpenAI-compatible)
│ ├── agents.py # 4-agent pipeline (Router → Retriever → Generator → Evaluator)
│ ├── orchestrator.py # RoutePlan + hybrid search (BM25 + dense) + RRF fusion
│ ├── model_registry.py # Embed/reranker singleton registry
│ ├── index_registry.py # ChromaDB collection registry (static + dynamic)
│ ├── domain_registry.py # Domain CRUD + isolation
│ ├── store.py # Embedding + ChromaDB storage (FP16 GPU)
│ ├── pipeline.py # Shared read+chunk pipeline
│ ├── chunker.py # Character + paragraph + sentence chunking
│ ├── loader.py # Multi-format reader (MD, TXT, PDF, PY, JSONL)
│ ├── eval.py # Auto-evaluation + quality flagging + query log
│ ├── gap_tracker.py # Knowledge gap analysis from query logs
│ ├── monitoring.py # GPU telemetry via pynvml
│ ├── observability.py # Structured logging + metrics + request tracing
│ ├── security.py # Auth, CORS, rate limiting, input validation
│ ├── self_improve.py # Auto-improvement from eval data
│ └── vector_store.py # Vector store abstraction
│
├── finetune/ # Embedding fine-tuning (LoRA, A/B testing)
├── tests/ # 209-test pytest suite
├── benchmarks/ # Performance & quality benchmarks
├── docs/ # Architecture, roadmap, benchmark reports
├── frontend/ # React + Vite frontend
├── loadtest/ # Locust load testing
├── scripts/ # Deployment scripts
│
└── data/
├── chroma/ # ChromaDB persistent storage (HNSW)
├── domains/ # Domain configs + isolated indexes
├── eval/ # Query evaluation logs (JSONL)
└── knowledge/ # Source documents for ingestion
How it works
Ingestion — 3 phases
Documents → Parallel chunk (8 workers) → GPU embed (FP16) → ChromaDB index
Files are discovered automatically. Chunks are deduplicated by MD5. The embedding model loads once and stays warm.
Query — 5 stages
Query → Route → Dense retrieval → BM25 → RRF fusion → Rerank → Answer
- Planning — deterministic routing, no LLM calls, zero latency overhead
- Dense retrieval — bge-m3 semantic search, 6x overfetch
- BM25 — keyword fallback, catches names, acronyms, exact matches
- Hybrid merge — RRF fusion + source diversity cap (max 3 chunks/source)
- Cross-encoder reranking — ms-marco-MiniLM, +7.0% Precision@5
Optional: LLM query expansion with late RRF fusion (~200ms on Groq).
The 4-agent pipeline
| Agent | Role |
|---|---|
| Router | Classifies complexity, picks strategy (dense / hybrid / category-filtered) |
| Retriever | Multi-tool reasoning — semantic search, entity search, sub-query decomposition, adjacent chunk expansion |
| Generator | Streams tokens, adapts prompt to retrieval quality in real time |
| Evaluator | Flags issues, escalates strategy on retry (dense → hybrid → no-category dense) |
Key design decisions
| Decision | Why |
|---|---|
| FP16 embeddings | +109% throughput, −48% VRAM, 99.3% recall parity — no downside |
| Overfetch ×6 | Tested 4, 6, 8, 10 — 6 is optimal. 8+ hurts NDCG |
| Max 3 chunks/source | Prevents one document from dominating context. 2 was too aggressive |
| ms-marco reranker | Outperforms bge-reranker-v2 on this corpus |
| RRF over score averaging | Rank-based, robust against score scale differences across retrievers |
| Groq for LLM | 70B quality at speed. GPU VRAM stays reserved for embed + reranker |
| Domain isolation | Specialized indexes prevent cross-domain contamination |
| ReACT retriever | Heuristic reasoning — catches entity queries without LLM cost |
| No fine-tuning yet | Evidence-gated. Only when NDCG plateaus on real production queries |
Quick start
# Clone
git clone https://github.com/suyven-core/rag-engine
cd rag-engine
# Configure
cp .env.example .env
# Edit .env with your LLM API key
# Run (GPU)
docker compose -f docker-compose.gpu.yml up
# Ingest your knowledge
python ingest.py --domain my-domain --path ./data/knowledge/
# Query
python query.py --domain my-domain "What is...?"
Configuration
# Embeddings
EMBED_MODEL=BAAI/bge-m3
EMBED_BATCH=256
# Chunking
CHUNK_SIZE=600
CHUNK_OVERLAP=80
# Retrieval
TOP_K=5
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
OVERFETCH_FACTOR=6
RERANKER_BATCH_SIZE=32
# LLM
LLM_PROVIDER=openai
LLM_MODEL=llama-3.3-70b-versatile
LLM_API_URL=https://api.groq.com/openai/v1
LLM_API_KEY=<your-key>
# Fallback LLM
FALLBACK_PROVIDER=openai
FALLBACK_MODEL=gemini-2.5-flash
FALLBACK_API_URL=https://generativelanguage.googleapis.com/v1beta/openai
FALLBACK_API_KEY=<your-key>
# Security
AUTH_ENABLED=false
API_KEYS=key1,key2
RATE_LIMIT_RPM=60
RATE_LIMIT_BURST=10
MAX_QUERY_LENGTH=2000
MAX_TOP_K=20
CORS_ORIGINS=http://localhost:5173
# Observability
LOG_FORMAT=text # use "json" in production
WORKERS=8
Docker secrets at /run/secrets/<NAME> override all env vars.
API reference
# Orchestrator
plan(query, category, top_k) → RoutePlan
execute_search(query, route, category, use_expansion) → [results]
format_context(results) → str
# Embeddings & storage
get_embed_model() → SentenceTransformer
embed_batch(texts) → [[float], ...]
add_chunks(col, path, chunks, knowledge_dir) → (added, skipped)
# Registries
get_index(name) → chromadb.Collection
get_embed_model(name) → SentenceTransformer
get_reranker(name) → CrossEncoder
# Domains
create_domain(name, description, system_prompt, categories) → DomainConfig
get_domain(slug) → DomainConfig
list_domains() → [DomainConfig]
# Evaluation
compute_flags(record) → [flags]
log_eval(record) → None
analyze_gaps(entries, top_n) → GapReport
# LLM
quick_complete(prompt, ...) → str
stream_chat(query, context, system_prompt, ...) → Generator[str]
Roadmap
- Multi-modal ingestion (images, tables, charts)
- Embedding fine-tuning when production data justifies it
- GraphRAG layer for entity-relationship queries
- REST API authentication + multi-tenant key management
- Hosted version — plug in your docs, get an endpoint
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file suyven_rag-0.1.0.tar.gz.
File metadata
- Download URL: suyven_rag-0.1.0.tar.gz
- Upload date:
- Size: 126.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c231e5f29109ca8ead5e478192da58afe00574956ebe4581d5367588110dca7
|
|
| MD5 |
090ffce8cf54b38155c6d14bea5b6de3
|
|
| BLAKE2b-256 |
7e3c633b1a8a63e08d8c782bc2f5c0cd9d3fb12102d77968e2c890e6a8cbb8d9
|
File details
Details for the file suyven_rag-0.1.0-py3-none-any.whl.
File metadata
- Download URL: suyven_rag-0.1.0-py3-none-any.whl
- Upload date:
- Size: 122.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d230a24ffc2e88d80d5b042ec3341de02ef69b428c883c510f93b65bcd219ef
|
|
| MD5 |
f26871ec8180e1798df4e7c250993711
|
|
| BLAKE2b-256 |
18b4c8a43d3f35b41cbcf08060420f967d9827a2af5b91f61d7493232763a88b
|