Production RAG pipelines without the abstraction tax
Project description
rag-kit
Production RAG pipelines without the abstraction tax.
from ragkit import RAGPipeline
from ragkit.embedders import OpenAIEmbedder
from ragkit.generators import GroqGenerator
rag = RAGPipeline(
embedder=OpenAIEmbedder(api_key="sk-..."),
generator=GroqGenerator(api_key="gsk_..."),
)
rag.ingest("handbook.pdf")
result = rag.query("What is the refund policy?")
print(result.answer)
# → "Refunds are available within 7 days of purchase. Contact support@..."
print(result.sources)
# → [Chunk(text="Refunds are available...", metadata={"source": "handbook.pdf"})]
No LangChain. No magic. Every line of the pipeline is readable Python you can modify.
Before You Start
What you need
| Requirement | Minimum | How to check |
|---|---|---|
| Python | 3.10+ | python --version |
| pip | any | pip --version |
| An LLM API key | Groq (free) | console.groq.com |
| An embedding API key | OpenAI or free local model | platform.openai.com |
Don't have Python? Download it from python.org. Pick any version ≥ 3.10.
What is an API key?
An API key is a password that lets your code talk to an AI service (like Groq or OpenAI). You get one by creating a free account on their website.
- Groq — free, no credit card needed. Go to console.groq.com → API Keys → Create. Looks like
gsk_abc123... - OpenAI — needs a paid account for embeddings. Go to platform.openai.com → Create new secret key. Looks like
sk-abc123... - No money? Use
LocalEmbedderinstead ofOpenAIEmbedder— runs on your own computer, completely free. See Quick Start #3.
Set up your environment (recommended)
# Create a virtual environment so rag-kit doesn't conflict with other packages
python -m venv venv
# Activate it
source venv/bin/activate # Mac / Linux
venv\Scripts\activate # Windows
# Install rag-kit with the providers you want
pip install rag-kit[openai,groq]
# Put your API keys in a .env file (never commit this to git)
cp .env.example .env
# Open .env and fill in your keys
What is RAG?
LLMs like GPT-4 and Llama3 are trained on public internet data up to a cutoff date. They know nothing about:
- Your company's internal documents
- Data created after their training cutoff
- Private knowledge bases
RAG (Retrieval-Augmented Generation) solves this by giving the LLM the right context at query time, instead of baking knowledge into model weights.
User asks: "What's our refund policy?"
Without RAG:
LLM: "I don't have information about your specific refund policy."
(or worse — it hallucinates a plausible-sounding policy)
With RAG:
1. Retrieve: find the paragraph about refunds from your policy PDF
2. Generate: "Here is the relevant section: [paragraph]. Based on this, ..."
LLM: "Refunds are available within 7 days. Contact support@yourco.com."
This is what every AI assistant with "chat with your docs" capability uses under the hood — Notion AI, GitHub Copilot's context, Cursor, Claude Projects, all of them.
How RAG Works — The Full Pipeline
Understanding this pipeline is more valuable than any certification.
INGESTION (run once per document)
──────────────────────────────────────────────────────────────────
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ LOADER │──▶│ CHUNKER │──▶│ EMBEDDER │──▶│ STORE │
│ │ │ │ │ │ │ │
│ PDF/TXT/URL │ │ Split text │ │ text → vec │ │ Save vectors │
│ → Document │ │ into Chunks │ │ (numbers) │ │ to DB/memory │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
QUERYING (run for every user question)
──────────────────────────────────────────────────────────────────
User Question
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ EMBEDDER │──▶│ RETRIEVER │──▶│ (RERANKER) │──▶│ GENERATOR │
│ │ │ │ │ optional │ │ │
│ query → vec │ │ find similar │ │ LLM re-scores│ │ LLM answers │
│ │ │ chunks by │ │ for precision│ │ using chunks │
│ │ │ vector dist │ │ │ │ as context │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│
▼
Answer + Sources
Let's go through each step.
Step 1: Loading
The loader converts any input (PDF, URL, plain text) into a uniform Document object. This abstraction means the rest of the pipeline doesn't care whether your document was a PDF or a webpage.
from ragkit.loaders import load_pdf, load_url, load_text
doc = load_pdf("report.pdf")
# doc.text = "Q1 revenue was $4.2M..."
# doc.source = "report.pdf"
doc = load_url("https://docs.yourapp.com/api")
# doc.text = "API Reference\n\nEndpoints:\n..."
# doc.source = "https://docs.yourapp.com/api"
Step 2: Chunking
Every LLM has a context window — the maximum amount of text it can read at one time. Think of it like working memory: a human can hold a few paragraphs in their head while answering a question, but not an entire book.
Most models allow 8,000–128,000 tokens (roughly 6,000–96,000 words). A 200-page PDF is ~100,000 words. Even if it fit, sending the entire document every time a user asks a question would be very slow and very expensive.
Chunking solves this: instead of sending everything, you only send the 3–5 most relevant pieces.
Chunking splits the document into small, overlapping pieces that can be retrieved and fed to the LLM individually.
Why overlap?
Original text: "...the payment is processed. Refunds take 3-5 business days to appear..."
↑ chunk boundary without overlap
Chunk 1: "...the payment is processed."
Chunk 2: "Refunds take 3-5 business days to appear..."
Without overlap, "Refunds take 3-5..." has no context for what payment method, what product, what country. With overlap:
Chunk 1: "...the payment is processed."
Chunk 2: "processed. Refunds take 3-5 business days to appear..."
↑ tail of chunk 1 gives context
Three chunking strategies:
| Strategy | How it splits | Best for |
|---|---|---|
fixed |
Every N characters, always | Uniform text, simple baseline |
recursive |
Paragraphs → sentences → words → chars | Most documents (default) |
semantic |
Where meaning shifts (needs embedder) | High-precision knowledge bases |
from ragkit.chunkers import recursive_chunker, fixed_chunker, semantic_chunker
chunks = recursive_chunker(doc, chunk_size=500, overlap=50)
# chunks[0].text = "First paragraph..."
# chunks[0].metadata = {"source": "report.pdf", "chunk_index": 0, "total_chunks": 42}
How to pick chunk_size:
- Too small (< 100 chars): chunks lose context, embeddings become noisy
- Too large (> 1000 chars): chunks cover multiple topics, retrieval is imprecise
- Sweet spot: 300–700 characters for most documents
Step 3: Embedding
An embedding converts text into a list of numbers (a vector) that represents its meaning. Similar texts produce similar vectors.
"refund policy" → [0.21, -0.54, 0.88, 0.12, ...] (1536 numbers)
"money back" → [0.19, -0.51, 0.85, 0.14, ...] (similar direction)
"pizza recipes" → [-0.72, 0.33, -0.41, 0.65, ...] (different direction)
This is what makes semantic search work. "Refund" and "money back" don't share any words, but their embeddings are close — so a search for "refund policy" will find a chunk that says "money back guarantee."
Keyword search (like SQL LIKE '%refund%') would miss it.
from ragkit.embedders import OpenAIEmbedder, LocalEmbedder
# Cloud: best quality, small cost
embedder = OpenAIEmbedder(api_key="sk-...")
vector = embedder.embed("refund policy")
# → list of 1536 floats
# Local: free, private, slightly lower quality
embedder = LocalEmbedder(model_name="all-MiniLM-L6-v2")
vector = embedder.embed("refund policy")
# → list of 384 floats, runs on your CPU
Choosing an embedding model:
| Model | Dims | Cost | Quality | Use when |
|---|---|---|---|---|
text-embedding-3-small |
1536 | ~$0.00002/1K tokens | Excellent | Default choice |
text-embedding-3-large |
3072 | ~$0.00013/1K tokens | Best | Legal/medical precision |
all-MiniLM-L6-v2 (local) |
384 | Free | Good | Privacy, no API key |
BAAI/bge-small-en (local) |
384 | Free | Very good | Best free model |
Step 4: Vector Store
Once you have vectors, you need to store them so you can search them later.
from ragkit.stores import MemoryStore, SupabaseStore
# Development: fast, in-process, no setup
store = MemoryStore()
# Production: persistent, searchable across restarts, scales to millions
store = SupabaseStore(url="https://xxxx.supabase.co", key="eyJ...")
How vector search works:
The store computes cosine similarity between your query vector and every stored chunk vector. Chunks most similar in direction to the query are returned.
query vector: [0.21, -0.54, 0.88, ...]
chunk A: [0.19, -0.51, 0.85, ...] → similarity: 0.98 ✓ very similar
chunk B: [0.20, -0.52, 0.87, ...] → similarity: 0.97 ✓ similar
chunk C: [-0.72, 0.33, -0.41, ...] → similarity: 0.12 ✗ unrelated
MemoryStore does a linear scan (O(n)) — fine for < 10K chunks.
SupabaseStore uses an HNSW index — sub-10ms at millions of vectors.
What is HNSW?
Hierarchical Navigable Small World. A graph-based index where each node connects to its nearest neighbors. Search navigates the graph instead of scanning every vector. O(log n) instead of O(n). You never need to build or maintain it — pgvector handles it automatically.
Step 5: Retrieval
Given a query vector, return the most relevant chunks.
from ragkit.retrievers import topk_retriever, mmr_retriever
# Simple: return the 5 most similar chunks
chunks = topk_retriever(store, query_vector, top_k=5)
# Advanced: return 5 diverse chunks (avoids redundant results)
chunks = mmr_retriever(store, query_vector, top_k=5, lambda_mult=0.5)
Top-K vs MMR:
Top-K returns the 5 most similar chunks. If your document says "refund" 10 times across different sections, you'll get 5 near-duplicate chunks. The LLM gets confused by repetition and wastes context.
MMR (Maximal Marginal Relevance) picks chunks one at a time, penalizing choices that are too similar to what was already picked. Each selected chunk must contribute new information.
Query: "refund policy"
Top-K results: MMR results:
1. "Refunds in 7 days" 1. "Refunds in 7 days" ← most relevant
2. "Refunds in 7 days" 2. "Contact support for..." ← new info
3. "7 day refund limit" 3. "Cancellations vs refunds" ← new info
4. "7 day refund limit" 4. "Razorpay processes..." ← new info
5. "Refunds available" 5. "Exceptions to refunds" ← new info
Use MMR when your documents have repetitive content. Use Top-K otherwise.
Step 6 (Optional): Reranking
Vector similarity measures "are these about the same topic?" — not "does this directly answer the question?"
The reranker reads each chunk and the query, then scores relevance directly. More expensive (1 LLM call per chunk), but significantly more precise.
from ragkit.rerankers import llm_reranker
# Initial retrieval: 8 candidates by vector similarity
candidates = topk_retriever(store, query_vector, top_k=8)
# Rerank: LLM scores each on 1-10 relevance, return top 3
final = llm_reranker(candidates, query="refund policy", llm=generator, top_k=3)
Use reranking when:
- Answer accuracy matters more than speed/cost
- You're seeing the LLM use slightly-wrong chunks
- Your documents have many similar-sounding sections
Skip reranking when:
- You're building a high-QPS API (latency will hurt)
- Your queries are simple and retrieval quality is already good
Step 7: Generation
Feed the retrieved chunks + the user's question to an LLM.
from ragkit.generators import GroqGenerator
generator = GroqGenerator(api_key="gsk_...")
answer = generator.generate(
query="What is the refund policy?",
chunks=retrieved_chunks,
)
The generator formats chunks into a numbered context block and passes it to the LLM with a system prompt that says: "Answer using ONLY the provided context. Never make up information."
This grounding instruction is critical. Without it, the LLM will blend retrieved facts with its training data and hallucinate confidently.
The answer will cite [1], [2], etc. corresponding to the numbered chunks. Show these citations to your users so they can verify the source.
Installation
# Minimal (choose your providers):
pip install rag-kit[groq] # + Groq LLM
pip install rag-kit[openai] # + OpenAI embeddings + GPT
pip install rag-kit[anthropic] # + Claude
# Add-ons:
pip install rag-kit[pdf] # PDF loading
pip install rag-kit[url] # URL/webpage loading
pip install rag-kit[supabase] # Persistent vector store
pip install rag-kit[local] # Local embeddings (no API key)
# Everything:
pip install rag-kit[all]
Quick Start
1. Basic (in-memory, no persistence)
import os
from ragkit import RAGPipeline
from ragkit.embedders import OpenAIEmbedder
from ragkit.generators import GroqGenerator
rag = RAGPipeline(
embedder=OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"]),
generator=GroqGenerator(api_key=os.environ["GROQ_API_KEY"]),
)
rag.ingest("company_handbook.pdf") # PDF
rag.ingest("https://docs.myapp.com") # URL
rag.ingest_text("Prices: Pro = ₹399/mo") # Raw string
result = rag.query("What are the pricing tiers?")
print(result.answer)
for chunk in result.sources:
print(f" Source: {chunk.metadata['source']}")
2. Production (Supabase persistence)
from ragkit import RAGPipeline
from ragkit.embedders import OpenAIEmbedder
from ragkit.generators import GroqGenerator
from ragkit.stores import SupabaseStore
rag = RAGPipeline(
embedder=OpenAIEmbedder(api_key="sk-..."),
generator=GroqGenerator(api_key="gsk_..."),
store=SupabaseStore(url="https://xxxx.supabase.co", key="eyJ..."),
)
First, run the setup SQL in your Supabase SQL editor:
from ragkit.stores.supabase import SETUP_SQL
print(SETUP_SQL) # copy and run this in Supabase
3. Free (local embeddings, no API key for embedding)
from ragkit import RAGPipeline
from ragkit.embedders import LocalEmbedder
from ragkit.generators import GroqGenerator # Groq free tier is generous
rag = RAGPipeline(
embedder=LocalEmbedder("all-MiniLM-L6-v2"), # runs on your CPU, free
generator=GroqGenerator(api_key="gsk_..."), # Groq free tier
)
4. Advanced (MMR + reranking for maximum quality)
from ragkit import RAGPipeline
from ragkit.embedders import OpenAIEmbedder
from ragkit.generators import GroqGenerator
from ragkit.stores import SupabaseStore
rag = RAGPipeline(
embedder=OpenAIEmbedder(api_key="sk-..."),
generator=GroqGenerator(api_key="gsk_...", model="llama3-70b-8192"),
store=SupabaseStore(url="...", key="..."),
chunker="recursive",
chunk_size=600,
chunk_overlap=80,
retriever="mmr", # diverse retrieval
top_k=6,
reranker=True, # LLM re-scores for precision
)
API Reference
RAGPipeline
RAGPipeline(
embedder, # Required: OpenAIEmbedder | LocalEmbedder
generator, # Required: GroqGenerator | OpenAIGenerator | AnthropicGenerator
store=None, # MemoryStore() by default; pass SupabaseStore for persistence
chunker="recursive", # "fixed" | "recursive" | "semantic"
chunk_size=500, # characters per chunk
chunk_overlap=50, # overlap between consecutive chunks
retriever="topk", # "topk" | "mmr"
top_k=5, # number of chunks to retrieve
min_score=0.0, # discard chunks below this similarity (0.0–1.0)
reranker=None, # True to enable LLM reranking
)
| Method | Returns | Description |
|---|---|---|
.ingest(source) |
int |
Load, chunk, embed, store a file/URL. Returns chunk count. |
.ingest_text(text, source_label) |
int |
Same but from a raw string. |
.query(question) |
QueryResult |
Embed query, retrieve, generate, return answer + sources. |
QueryResult
result.answer # str — the LLM's answer, grounded in retrieved chunks
result.sources # list[Chunk] — the chunks that were used
Chunk
chunk.text # str — the text content
chunk.metadata["source"] # str — file path or URL
chunk.metadata["chunk_index"] # int — position in original document
chunk.metadata["total_chunks"] # int — total chunks in this document
chunk.metadata["strategy"] # str — "fixed" | "recursive" | "semantic"
Common Patterns
Show citations in a chat UI
result = rag.query(user_message)
response_parts = [result.answer, "\n\n**Sources:**"]
for i, chunk in enumerate(result.sources, 1):
source = chunk.metadata.get("source", "unknown")
preview = chunk.text[:120].strip().replace("\n", " ")
response_parts.append(f"[{i}] {source}: _{preview}..._")
final_response = "\n".join(response_parts)
Ingest only new documents (avoid duplicates)
already_ingested = {"report_q1.pdf", "handbook.pdf"}
for file in Path("docs").glob("*.pdf"):
if file.name not in already_ingested:
n = rag.ingest(str(file))
print(f"Ingested {file.name}: {n} chunks")
Filter by source document
With SupabaseStore, you can search only within a specific document:
results = store.search(
query_embedding,
top_k=5,
filter={"source": "hr-policy.pdf"},
)
Streaming responses
# GroqGenerator supports streaming
from groq import Groq
client = Groq(api_key="gsk_...")
stream = client.chat.completions.create(
model="llama3-8b-8192",
messages=[...],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Choosing Your Stack
| Need | Pick |
|---|---|
| Fastest setup | MemoryStore + OpenAIEmbedder + GroqGenerator |
| Zero API cost | MemoryStore + LocalEmbedder + GroqGenerator (free tier) |
| Production persistence | SupabaseStore + any embedder/generator |
| Maximum accuracy | SupabaseStore + OpenAIEmbedder + mmr retriever + reranker + GPT-4o |
| Private / on-premise | MemoryStore + LocalEmbedder + local Ollama generator |
What You've Learned
If you read this far and ran the examples, you understand:
- RAG architecture — loader → chunker → embedder → store → retriever → generator
- Chunking strategies — fixed, recursive, semantic; why overlap matters
- Embeddings — what they are, how cosine similarity works, how to pick a model
- Vector search — how HNSW indexing works at scale
- Retrieval strategies — Top-K vs MMR, when diversity matters
- Reranking — LLM-as-judge, when precision > speed
- Generation — how to write grounding prompts, how to show citations
This is the complete knowledge stack behind every "chat with your docs" product, every enterprise knowledge base, and every AI assistant with document understanding. No paid course required.
What's Next
Now that you understand RAG, the natural next steps:
-
Agents — instead of one retrieval+generation step, let the LLM decide when to retrieve and what to do with the result. See agent-loop.
-
Memory — give your RAG system episodic memory (remember past conversations) and semantic memory (retrieve relevant facts from prior sessions). See mem-store.
-
Evals — measure whether your RAG pipeline is actually answering correctly. See eval-bench.
Troubleshooting
These are the errors every beginner hits. Fixes are here so you don't lose an hour to them.
ModuleNotFoundError: No module named 'ragkit'
You haven't installed the library yet, or your virtual environment isn't activated.
# Make sure your venv is active first
source venv/bin/activate # Mac/Linux
venv\Scripts\activate # Windows
# Then install
pip install rag-kit[openai,groq]
ModuleNotFoundError: No module named 'openai' (or groq, supabase, etc.)
rag-kit has zero mandatory dependencies. You only get the extras you ask for.
pip install rag-kit[openai] # for OpenAIEmbedder / OpenAIGenerator
pip install rag-kit[groq] # for GroqGenerator
pip install rag-kit[supabase] # for SupabaseStore
pip install rag-kit[pdf] # for load_pdf()
pip install rag-kit[url] # for load_url()
pip install rag-kit[local] # for LocalEmbedder
pip install rag-kit[all] # everything at once
AuthenticationError / 401 Unauthorized
Your API key is wrong or not set.
# Bad — key hardcoded with a typo or expired key
embedder = OpenAIEmbedder(api_key="sk-abc123WRONG")
# Good — read from environment variable
import os
embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])
Double-check:
- You copied the full key (they're long — don't cut it off)
- The key is for the right service (OpenAI key ≠ Groq key)
- Your
.envfile is loaded before your script runs:export OPENAI_API_KEY=sk-... # Mac/Linux — run this in your terminal first set OPENAI_API_KEY=sk-... # Windows CMD $env:OPENAI_API_KEY="sk-..." # Windows PowerShell
SyntaxError or TypeError on Python 3.9 or below
rag-kit uses list[str] and dict | None type hints, which require Python 3.10+.
python --version # must show 3.10, 3.11, 3.12, or 3.13
# If you're on 3.9 or below, upgrade Python at python.org
result.answer is "I couldn't find relevant information"
This means no chunks passed the similarity threshold. Three possible causes:
1. Your document wasn't ingested yet.
rag.ingest("your_file.pdf") # do this before rag.query()
result = rag.query("your question")
2. The question phrasing is too different from the document language.
Try rephrasing. "What is the money-back guarantee?" retrieves better than "refund" if the document uses the phrase "money-back guarantee."
3. min_score is set too high.
# Default min_score is 0.0 — everything passes through
# If you set it high (e.g. 0.8), lower it while debugging
rag = RAGPipeline(..., min_score=0.0)
SupabaseStore not finding results after ingestion
Make sure you ran the setup SQL first. Open your Supabase SQL editor and run:
from ragkit.stores.supabase import SETUP_SQL
print(SETUP_SQL)
Copy the output and paste it into Supabase → SQL Editor → Run. This creates the table, HNSW index, and match_chunks function that the store depends on.
Still stuck?
Open an issue at github.com/iamadhitya1/rag-kit/issues with:
- Your Python version (
python --version) - The full error message (copy-paste, don't screenshot)
- The 5 lines of code that triggered it
License
MIT © 2025 M Adhitya
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragkit_adhitya-0.1.0.tar.gz.
File metadata
- Download URL: ragkit_adhitya-0.1.0.tar.gz
- Upload date:
- Size: 39.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4254343b8652bc92dbe2c59a99baacb0422a16d7088620c0cd968ffd61b50b5b
|
|
| MD5 |
43c06691cd01d9d97e9b850b261ca508
|
|
| BLAKE2b-256 |
54a16dae56abcfebc13a978cff7b25404259684d57e9267716607ba495d3ebf2
|
File details
Details for the file ragkit_adhitya-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ragkit_adhitya-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed6d2506ef067078efc38eecc6213125ac36b9d2a1598d43e0af6714145dcc3f
|
|
| MD5 |
a02ce2901dbe021f1f8c2077df4cee02
|
|
| BLAKE2b-256 |
789c7473280fd86c1c4c596490845b072f1e10f532b17d4ae8ba8630723dd5b5
|