suyven-rag

GPU-first RAG engine with cross-encoder reranking and quality validation. One engine, any domain.

These details have not been verified by PyPI

Project links

Project description

The knowledge base that adapts to anything.

By Suyven · Built for production, not demos.

What is this?

Most RAG engines are built to impress in demos. This one is built to run in production.

Henko is the core retrieval engine powering every Suyven product. It's GPU-first, domain-agnostic, and designed so that anyone — a startup, an enterprise, a solo dev — can plug in their knowledge and get a system that actually works. One engine. Unlimited domains. Zero compromise.

It's fast because it has to be. It's precise because anything less is useless. And it's modular because the future will require things we haven't thought of yet.

"The base layer of every intelligent system we'll ever build." — Suyven

Why it exists

Knowledge is scattered. Teams drown in documents, wikis, PDFs, codebases. LLMs hallucinate. Semantic search alone isn't enough.

Henko exists because retrieval is the hardest part, and most solutions skip the hard work. We didn't.

GPU-accelerated embeddings at 38x CPU speed
Hybrid search combining semantic understanding with keyword precision
Multi-domain isolation so your knowledge stays clean
Auto-evaluation that flags failures before your users do

Benchmarks

Measured on NVIDIA RTX 5070 · CUDA 12.8 · PyTorch 2.12

Metric	Result
NDCG@10	0.909 — 209-query ground-truth suite
Embedding throughput	2,960 chunks/s — 38x CPU baseline
FP16 VRAM savings	−48% vs FP32 (671 MB vs 1,290 MB)
FP16 retrieval fidelity	99.3% Recall@10 — near-zero quality loss
Reranker latency	8.8 ms/query — FP16 on GPU
Test coverage	209/209 passing

Tech Stack

Layer	Technology
API	FastAPI 0.115 + SSE streaming + Uvicorn
Embeddings	BAAI/bge-m3 · 568M params · 1024-dim · multilingual · FP16
Reranker	cross-encoder/ms-marco-MiniLM-L-6-v2 · FP16 on GPU
Vector Store	ChromaDB 1.5+ · cosine HNSW · multi-collection
Hybrid Search	BM25Okapi + Reciprocal Rank Fusion (RRF)
LLM	Any OpenAI-compatible endpoint · default: Groq llama-3.3-70b
GPU Monitoring	pynvml 13.0+
Infrastructure	Docker · CPU + GPU variants
Runtime	Python 3.12 · PyTorch 2.12+ · CUDA 12.8

Architecture

henko/
├── api.py                    # FastAPI + SSE streaming
├── app.py                    # Streamlit UI + GPU dashboard
├── ingest.py                 # CLI ingestion (3-phase pipeline)
├── query.py                  # CLI query tool
│
├── rag/                      # Core pipeline — 21 modules
│   ├── config.py             # Centralized config (env vars + Docker secrets)
│   ├── llm.py                # LLM abstraction (Ollama + OpenAI-compatible)
│   ├── agents.py             # 4-agent pipeline (Router → Retriever → Generator → Evaluator)
│   ├── orchestrator.py       # RoutePlan + hybrid search (BM25 + dense) + RRF fusion
│   ├── model_registry.py     # Embed/reranker singleton registry
│   ├── index_registry.py     # ChromaDB collection registry (static + dynamic)
│   ├── domain_registry.py    # Domain CRUD + isolation
│   ├── store.py              # Embedding + ChromaDB storage (FP16 GPU)
│   ├── pipeline.py           # Shared read+chunk pipeline
│   ├── chunker.py            # Character + paragraph + sentence chunking
│   ├── loader.py             # Multi-format reader (MD, TXT, PDF, PY, JSONL)
│   ├── eval.py               # Auto-evaluation + quality flagging + query log
│   ├── gap_tracker.py        # Knowledge gap analysis from query logs
│   ├── monitoring.py         # GPU telemetry via pynvml
│   ├── observability.py      # Structured logging + metrics + request tracing
│   ├── security.py           # Auth, CORS, rate limiting, input validation
│   ├── self_improve.py       # Auto-improvement from eval data
│   └── vector_store.py       # Vector store abstraction
│
├── finetune/                 # Embedding fine-tuning (LoRA, A/B testing)
├── tests/                    # 209-test pytest suite
├── benchmarks/               # Performance & quality benchmarks
├── docs/                     # Architecture, roadmap, benchmark reports
├── frontend/                 # React + Vite frontend
├── loadtest/                 # Locust load testing
├── scripts/                  # Deployment scripts
│
└── data/
    ├── chroma/               # ChromaDB persistent storage (HNSW)
    ├── domains/              # Domain configs + isolated indexes
    ├── eval/                 # Query evaluation logs (JSONL)
    └── knowledge/            # Source documents for ingestion

How it works

Ingestion — 3 phases

Documents → Parallel chunk (8 workers) → GPU embed (FP16) → ChromaDB index

Files are discovered automatically. Chunks are deduplicated by MD5. The embedding model loads once and stays warm.

Query — 5 stages

Query → Route → Dense retrieval → BM25 → RRF fusion → Rerank → Answer

Planning — deterministic routing, no LLM calls, zero latency overhead
Dense retrieval — bge-m3 semantic search, 6x overfetch
BM25 — keyword fallback, catches names, acronyms, exact matches
Hybrid merge — RRF fusion + source diversity cap (max 3 chunks/source)
Cross-encoder reranking — ms-marco-MiniLM, +7.0% Precision@5

Optional: LLM query expansion with late RRF fusion (~200ms on Groq).

The 4-agent pipeline

Agent	Role
Router	Classifies complexity, picks strategy (dense / hybrid / category-filtered)
Retriever	Multi-tool reasoning — semantic search, entity search, sub-query decomposition, adjacent chunk expansion
Generator	Streams tokens, adapts prompt to retrieval quality in real time
Evaluator	Flags issues, escalates strategy on retry (dense → hybrid → no-category dense)

Key design decisions

Decision	Why
FP16 embeddings	+109% throughput, −48% VRAM, 99.3% recall parity — no downside
Overfetch ×6	Tested 4, 6, 8, 10 — 6 is optimal. 8+ hurts NDCG
Max 3 chunks/source	Prevents one document from dominating context. 2 was too aggressive
ms-marco reranker	Outperforms bge-reranker-v2 on this corpus
RRF over score averaging	Rank-based, robust against score scale differences across retrievers
Groq for LLM	70B quality at speed. GPU VRAM stays reserved for embed + reranker
Domain isolation	Specialized indexes prevent cross-domain contamination
ReACT retriever	Heuristic reasoning — catches entity queries without LLM cost
No fine-tuning yet	Evidence-gated. Only when NDCG plateaus on real production queries

Quick start

# Clone
git clone https://github.com/suyven-core/rag-engine
cd rag-engine

# Configure
cp .env.example .env
# Edit .env with your LLM API key

# Run (GPU)
docker compose -f docker-compose.gpu.yml up

# Ingest your knowledge
python ingest.py --domain my-domain --path ./data/knowledge/

# Query
python query.py --domain my-domain "What is...?"

Configuration

# Embeddings
EMBED_MODEL=BAAI/bge-m3
EMBED_BATCH=256

# Chunking
CHUNK_SIZE=600
CHUNK_OVERLAP=80

# Retrieval
TOP_K=5
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
OVERFETCH_FACTOR=6
RERANKER_BATCH_SIZE=32

# LLM
LLM_PROVIDER=openai
LLM_MODEL=llama-3.3-70b-versatile
LLM_API_URL=https://api.groq.com/openai/v1
LLM_API_KEY=<your-key>

# Fallback LLM
FALLBACK_PROVIDER=openai
FALLBACK_MODEL=gemini-2.5-flash
FALLBACK_API_URL=https://generativelanguage.googleapis.com/v1beta/openai
FALLBACK_API_KEY=<your-key>

# Security
AUTH_ENABLED=false
API_KEYS=key1,key2
RATE_LIMIT_RPM=60
RATE_LIMIT_BURST=10
MAX_QUERY_LENGTH=2000
MAX_TOP_K=20
CORS_ORIGINS=http://localhost:5173

# Observability
LOG_FORMAT=text   # use "json" in production
WORKERS=8

Docker secrets at /run/secrets/<NAME> override all env vars.

API reference

# Orchestrator
plan(query, category, top_k) → RoutePlan
execute_search(query, route, category, use_expansion) → [results]
format_context(results) → str

# Embeddings & storage
get_embed_model() → SentenceTransformer
embed_batch(texts) → [[float], ...]
add_chunks(col, path, chunks, knowledge_dir) → (added, skipped)

# Registries
get_index(name) → chromadb.Collection
get_embed_model(name) → SentenceTransformer
get_reranker(name) → CrossEncoder

# Domains
create_domain(name, description, system_prompt, categories) → DomainConfig
get_domain(slug) → DomainConfig
list_domains() → [DomainConfig]

# Evaluation
compute_flags(record) → [flags]
log_eval(record) → None
analyze_gaps(entries, top_n) → GapReport

# LLM
quick_complete(prompt, ...) → str
stream_chat(query, context, system_prompt, ...) → Generator[str]

Roadmap

Multi-modal ingestion (images, tables, charts)
Embedding fine-tuning when production data justifies it
GraphRAG layer for entity-relationship queries
REST API authentication + multi-tenant key management
Hosted version — plug in your docs, get an endpoint

Henko is built by Suyven

The base layer. Everything else is built on top of this.

angela@suyven.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

suyven_rag-0.1.0.tar.gz (126.9 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

suyven_rag-0.1.0-py3-none-any.whl (122.1 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file suyven_rag-0.1.0.tar.gz.

File metadata

Download URL: suyven_rag-0.1.0.tar.gz
Upload date: Apr 4, 2026
Size: 126.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for suyven_rag-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8c231e5f29109ca8ead5e478192da58afe00574956ebe4581d5367588110dca7`
MD5	`090ffce8cf54b38155c6d14bea5b6de3`
BLAKE2b-256	`7e3c633b1a8a63e08d8c782bc2f5c0cd9d3fb12102d77968e2c890e6a8cbb8d9`

See more details on using hashes here.

File details

Details for the file suyven_rag-0.1.0-py3-none-any.whl.

File metadata

Download URL: suyven_rag-0.1.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 122.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for suyven_rag-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8d230a24ffc2e88d80d5b042ec3341de02ef69b428c883c510f93b65bcd219ef`
MD5	`f26871ec8180e1798df4e7c250993711`
BLAKE2b-256	`18b4c8a43d3f35b41cbcf08060420f967d9827a2af5b91f61d7493232763a88b`

See more details on using hashes here.

suyven-rag 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What is this?

Why it exists

Benchmarks

Tech Stack

Architecture

How it works

Ingestion — 3 phases

Query — 5 stages

The 4-agent pipeline

Key design decisions

Quick start

Configuration

API reference

Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes