Skip to main content

Production RAG pipelines without the abstraction tax

Project description

rag-kit

Production RAG pipelines without the abstraction tax.

from ragkit import RAGPipeline
from ragkit.embedders import OpenAIEmbedder
from ragkit.generators import GroqGenerator

rag = RAGPipeline(
    embedder=OpenAIEmbedder(api_key="sk-..."),
    generator=GroqGenerator(api_key="gsk_..."),
)
rag.ingest("handbook.pdf")
result = rag.query("What is the refund policy?")
print(result.answer)
# → "Refunds are available within 7 days of purchase. Contact support@..."
print(result.sources)
# → [Chunk(text="Refunds are available...", metadata={"source": "handbook.pdf"})]

No LangChain. No magic. Every line of the pipeline is readable Python you can modify.


Before You Start

What you need

Requirement Minimum How to check
Python 3.10+ python --version
pip any pip --version
An LLM API key Groq (free) console.groq.com
An embedding API key OpenAI or free local model platform.openai.com

Don't have Python? Download it from python.org. Pick any version ≥ 3.10.

What is an API key?

An API key is a password that lets your code talk to an AI service (like Groq or OpenAI). You get one by creating a free account on their website.

  • Groq — free, no credit card needed. Go to console.groq.com → API Keys → Create. Looks like gsk_abc123...
  • OpenAI — needs a paid account for embeddings. Go to platform.openai.com → Create new secret key. Looks like sk-abc123...
  • No money? Use LocalEmbedder instead of OpenAIEmbedder — runs on your own computer, completely free. See Quick Start #3.

Set up your environment (recommended)

# Create a virtual environment so rag-kit doesn't conflict with other packages
python -m venv venv

# Activate it
source venv/bin/activate        # Mac / Linux
venv\Scripts\activate           # Windows

# Install rag-kit with the providers you want
pip install rag-kit[openai,groq]

# Put your API keys in a .env file (never commit this to git)
cp .env.example .env
# Open .env and fill in your keys

What is RAG?

LLMs like GPT-4 and Llama3 are trained on public internet data up to a cutoff date. They know nothing about:

  • Your company's internal documents
  • Data created after their training cutoff
  • Private knowledge bases

RAG (Retrieval-Augmented Generation) solves this by giving the LLM the right context at query time, instead of baking knowledge into model weights.

User asks: "What's our refund policy?"

Without RAG:
  LLM: "I don't have information about your specific refund policy."
  (or worse — it hallucinates a plausible-sounding policy)

With RAG:
  1. Retrieve: find the paragraph about refunds from your policy PDF
  2. Generate: "Here is the relevant section: [paragraph]. Based on this, ..."
  LLM: "Refunds are available within 7 days. Contact support@yourco.com."

This is what every AI assistant with "chat with your docs" capability uses under the hood — Notion AI, GitHub Copilot's context, Cursor, Claude Projects, all of them.


How RAG Works — The Full Pipeline

Understanding this pipeline is more valuable than any certification.

INGESTION (run once per document)
──────────────────────────────────────────────────────────────────
                                                                  
  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
  │    LOADER    │──▶│   CHUNKER    │──▶│   EMBEDDER   │──▶│    STORE     │
  │              │   │              │   │              │   │              │
  │ PDF/TXT/URL  │   │ Split text   │   │ text → vec   │   │ Save vectors │
  │ → Document   │   │ into Chunks  │   │ (numbers)    │   │ to DB/memory │
  └──────────────┘   └──────────────┘   └──────────────┘   └──────────────┘

QUERYING (run for every user question)
──────────────────────────────────────────────────────────────────
                                                                  
  User Question                                                   
       │                                                          
       ▼                                                          
  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
  │   EMBEDDER   │──▶│  RETRIEVER   │──▶│  (RERANKER)  │──▶│  GENERATOR  │
  │              │   │              │   │   optional   │   │              │
  │ query → vec  │   │ find similar │   │ LLM re-scores│   │ LLM answers  │
  │              │   │ chunks by    │   │ for precision│   │ using chunks │
  │              │   │ vector dist  │   │              │   │ as context   │
  └──────────────┘   └──────────────┘   └──────────────┘   └──────────────┘
                                                                  │
                                                                  ▼
                                                            Answer + Sources

Let's go through each step.


Step 1: Loading

The loader converts any input (PDF, URL, plain text) into a uniform Document object. This abstraction means the rest of the pipeline doesn't care whether your document was a PDF or a webpage.

from ragkit.loaders import load_pdf, load_url, load_text

doc = load_pdf("report.pdf")
# doc.text = "Q1 revenue was $4.2M..."
# doc.source = "report.pdf"

doc = load_url("https://docs.yourapp.com/api")
# doc.text = "API Reference\n\nEndpoints:\n..."
# doc.source = "https://docs.yourapp.com/api"

Step 2: Chunking

Every LLM has a context window — the maximum amount of text it can read at one time. Think of it like working memory: a human can hold a few paragraphs in their head while answering a question, but not an entire book.

Most models allow 8,000–128,000 tokens (roughly 6,000–96,000 words). A 200-page PDF is ~100,000 words. Even if it fit, sending the entire document every time a user asks a question would be very slow and very expensive.

Chunking solves this: instead of sending everything, you only send the 3–5 most relevant pieces.

Chunking splits the document into small, overlapping pieces that can be retrieved and fed to the LLM individually.

Why overlap?

Original text: "...the payment is processed. Refunds take 3-5 business days to appear..."
                                            ↑ chunk boundary without overlap

Chunk 1: "...the payment is processed."
Chunk 2: "Refunds take 3-5 business days to appear..."

Without overlap, "Refunds take 3-5..." has no context for what payment method, what product, what country. With overlap:

Chunk 1: "...the payment is processed."
Chunk 2: "processed. Refunds take 3-5 business days to appear..."
             ↑ tail of chunk 1 gives context

Three chunking strategies:

Strategy How it splits Best for
fixed Every N characters, always Uniform text, simple baseline
recursive Paragraphs → sentences → words → chars Most documents (default)
semantic Where meaning shifts (needs embedder) High-precision knowledge bases
from ragkit.chunkers import recursive_chunker, fixed_chunker, semantic_chunker

chunks = recursive_chunker(doc, chunk_size=500, overlap=50)
# chunks[0].text = "First paragraph..."
# chunks[0].metadata = {"source": "report.pdf", "chunk_index": 0, "total_chunks": 42}

How to pick chunk_size:

  • Too small (< 100 chars): chunks lose context, embeddings become noisy
  • Too large (> 1000 chars): chunks cover multiple topics, retrieval is imprecise
  • Sweet spot: 300–700 characters for most documents

Step 3: Embedding

An embedding converts text into a list of numbers (a vector) that represents its meaning. Similar texts produce similar vectors.

"refund policy"   → [0.21, -0.54, 0.88, 0.12, ...]   (1536 numbers)
"money back"      → [0.19, -0.51, 0.85, 0.14, ...]   (similar direction)
"pizza recipes"   → [-0.72, 0.33, -0.41, 0.65, ...]  (different direction)

This is what makes semantic search work. "Refund" and "money back" don't share any words, but their embeddings are close — so a search for "refund policy" will find a chunk that says "money back guarantee."

Keyword search (like SQL LIKE '%refund%') would miss it.

from ragkit.embedders import OpenAIEmbedder, LocalEmbedder

# Cloud: best quality, small cost
embedder = OpenAIEmbedder(api_key="sk-...")
vector = embedder.embed("refund policy")
# → list of 1536 floats

# Local: free, private, slightly lower quality
embedder = LocalEmbedder(model_name="all-MiniLM-L6-v2")
vector = embedder.embed("refund policy")
# → list of 384 floats, runs on your CPU

Choosing an embedding model:

Model Dims Cost Quality Use when
text-embedding-3-small 1536 ~$0.00002/1K tokens Excellent Default choice
text-embedding-3-large 3072 ~$0.00013/1K tokens Best Legal/medical precision
all-MiniLM-L6-v2 (local) 384 Free Good Privacy, no API key
BAAI/bge-small-en (local) 384 Free Very good Best free model

Step 4: Vector Store

Once you have vectors, you need to store them so you can search them later.

from ragkit.stores import MemoryStore, SupabaseStore

# Development: fast, in-process, no setup
store = MemoryStore()

# Production: persistent, searchable across restarts, scales to millions
store = SupabaseStore(url="https://xxxx.supabase.co", key="eyJ...")

How vector search works:

The store computes cosine similarity between your query vector and every stored chunk vector. Chunks most similar in direction to the query are returned.

query vector: [0.21, -0.54, 0.88, ...]

chunk A: [0.19, -0.51, 0.85, ...]  → similarity: 0.98  ✓ very similar
chunk B: [0.20, -0.52, 0.87, ...]  → similarity: 0.97  ✓ similar  
chunk C: [-0.72, 0.33, -0.41, ...] → similarity: 0.12  ✗ unrelated

MemoryStore does a linear scan (O(n)) — fine for < 10K chunks.
SupabaseStore uses an HNSW index — sub-10ms at millions of vectors.

What is HNSW?
Hierarchical Navigable Small World. A graph-based index where each node connects to its nearest neighbors. Search navigates the graph instead of scanning every vector. O(log n) instead of O(n). You never need to build or maintain it — pgvector handles it automatically.


Step 5: Retrieval

Given a query vector, return the most relevant chunks.

from ragkit.retrievers import topk_retriever, mmr_retriever

# Simple: return the 5 most similar chunks
chunks = topk_retriever(store, query_vector, top_k=5)

# Advanced: return 5 diverse chunks (avoids redundant results)
chunks = mmr_retriever(store, query_vector, top_k=5, lambda_mult=0.5)

Top-K vs MMR:

Top-K returns the 5 most similar chunks. If your document says "refund" 10 times across different sections, you'll get 5 near-duplicate chunks. The LLM gets confused by repetition and wastes context.

MMR (Maximal Marginal Relevance) picks chunks one at a time, penalizing choices that are too similar to what was already picked. Each selected chunk must contribute new information.

Query: "refund policy"

Top-K results:          MMR results:
1. "Refunds in 7 days"  1. "Refunds in 7 days"     ← most relevant
2. "Refunds in 7 days"  2. "Contact support for..."  ← new info
3. "7 day refund limit" 3. "Cancellations vs refunds" ← new info
4. "7 day refund limit" 4. "Razorpay processes..."   ← new info
5. "Refunds available"  5. "Exceptions to refunds"  ← new info

Use MMR when your documents have repetitive content. Use Top-K otherwise.


Step 6 (Optional): Reranking

Vector similarity measures "are these about the same topic?" — not "does this directly answer the question?"

The reranker reads each chunk and the query, then scores relevance directly. More expensive (1 LLM call per chunk), but significantly more precise.

from ragkit.rerankers import llm_reranker

# Initial retrieval: 8 candidates by vector similarity
candidates = topk_retriever(store, query_vector, top_k=8)

# Rerank: LLM scores each on 1-10 relevance, return top 3
final = llm_reranker(candidates, query="refund policy", llm=generator, top_k=3)

Use reranking when:

  • Answer accuracy matters more than speed/cost
  • You're seeing the LLM use slightly-wrong chunks
  • Your documents have many similar-sounding sections

Skip reranking when:

  • You're building a high-QPS API (latency will hurt)
  • Your queries are simple and retrieval quality is already good

Step 7: Generation

Feed the retrieved chunks + the user's question to an LLM.

from ragkit.generators import GroqGenerator

generator = GroqGenerator(api_key="gsk_...")
answer = generator.generate(
    query="What is the refund policy?",
    chunks=retrieved_chunks,
)

The generator formats chunks into a numbered context block and passes it to the LLM with a system prompt that says: "Answer using ONLY the provided context. Never make up information."

This grounding instruction is critical. Without it, the LLM will blend retrieved facts with its training data and hallucinate confidently.

The answer will cite [1], [2], etc. corresponding to the numbered chunks. Show these citations to your users so they can verify the source.


Installation

# Minimal (choose your providers):
pip install rag-kit[groq]       # + Groq LLM
pip install rag-kit[openai]     # + OpenAI embeddings + GPT
pip install rag-kit[anthropic]  # + Claude

# Add-ons:
pip install rag-kit[pdf]        # PDF loading
pip install rag-kit[url]        # URL/webpage loading
pip install rag-kit[supabase]   # Persistent vector store
pip install rag-kit[local]      # Local embeddings (no API key)

# Everything:
pip install rag-kit[all]

Quick Start

1. Basic (in-memory, no persistence)

import os
from ragkit import RAGPipeline
from ragkit.embedders import OpenAIEmbedder
from ragkit.generators import GroqGenerator

rag = RAGPipeline(
    embedder=OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"]),
    generator=GroqGenerator(api_key=os.environ["GROQ_API_KEY"]),
)

rag.ingest("company_handbook.pdf")       # PDF
rag.ingest("https://docs.myapp.com")     # URL
rag.ingest_text("Prices: Pro = ₹399/mo") # Raw string

result = rag.query("What are the pricing tiers?")
print(result.answer)

for chunk in result.sources:
    print(f"  Source: {chunk.metadata['source']}")

2. Production (Supabase persistence)

from ragkit import RAGPipeline
from ragkit.embedders import OpenAIEmbedder
from ragkit.generators import GroqGenerator
from ragkit.stores import SupabaseStore

rag = RAGPipeline(
    embedder=OpenAIEmbedder(api_key="sk-..."),
    generator=GroqGenerator(api_key="gsk_..."),
    store=SupabaseStore(url="https://xxxx.supabase.co", key="eyJ..."),
)

First, run the setup SQL in your Supabase SQL editor:

from ragkit.stores.supabase import SETUP_SQL
print(SETUP_SQL)  # copy and run this in Supabase

3. Free (local embeddings, no API key for embedding)

from ragkit import RAGPipeline
from ragkit.embedders import LocalEmbedder
from ragkit.generators import GroqGenerator  # Groq free tier is generous

rag = RAGPipeline(
    embedder=LocalEmbedder("all-MiniLM-L6-v2"),  # runs on your CPU, free
    generator=GroqGenerator(api_key="gsk_..."),   # Groq free tier
)

4. Advanced (MMR + reranking for maximum quality)

from ragkit import RAGPipeline
from ragkit.embedders import OpenAIEmbedder
from ragkit.generators import GroqGenerator
from ragkit.stores import SupabaseStore

rag = RAGPipeline(
    embedder=OpenAIEmbedder(api_key="sk-..."),
    generator=GroqGenerator(api_key="gsk_...", model="llama3-70b-8192"),
    store=SupabaseStore(url="...", key="..."),
    chunker="recursive",
    chunk_size=600,
    chunk_overlap=80,
    retriever="mmr",          # diverse retrieval
    top_k=6,
    reranker=True,            # LLM re-scores for precision
)

API Reference

RAGPipeline

RAGPipeline(
    embedder,                # Required: OpenAIEmbedder | LocalEmbedder
    generator,               # Required: GroqGenerator | OpenAIGenerator | AnthropicGenerator
    store=None,              # MemoryStore() by default; pass SupabaseStore for persistence
    chunker="recursive",     # "fixed" | "recursive" | "semantic"
    chunk_size=500,          # characters per chunk
    chunk_overlap=50,        # overlap between consecutive chunks
    retriever="topk",        # "topk" | "mmr"
    top_k=5,                 # number of chunks to retrieve
    min_score=0.0,           # discard chunks below this similarity (0.0–1.0)
    reranker=None,           # True to enable LLM reranking
)
Method Returns Description
.ingest(source) int Load, chunk, embed, store a file/URL. Returns chunk count.
.ingest_text(text, source_label) int Same but from a raw string.
.query(question) QueryResult Embed query, retrieve, generate, return answer + sources.

QueryResult

result.answer   # str — the LLM's answer, grounded in retrieved chunks
result.sources  # list[Chunk] — the chunks that were used

Chunk

chunk.text                         # str — the text content
chunk.metadata["source"]           # str — file path or URL
chunk.metadata["chunk_index"]      # int — position in original document
chunk.metadata["total_chunks"]     # int — total chunks in this document
chunk.metadata["strategy"]         # str — "fixed" | "recursive" | "semantic"

Common Patterns

Show citations in a chat UI

result = rag.query(user_message)

response_parts = [result.answer, "\n\n**Sources:**"]
for i, chunk in enumerate(result.sources, 1):
    source = chunk.metadata.get("source", "unknown")
    preview = chunk.text[:120].strip().replace("\n", " ")
    response_parts.append(f"[{i}] {source}: _{preview}..._")

final_response = "\n".join(response_parts)

Ingest only new documents (avoid duplicates)

already_ingested = {"report_q1.pdf", "handbook.pdf"}

for file in Path("docs").glob("*.pdf"):
    if file.name not in already_ingested:
        n = rag.ingest(str(file))
        print(f"Ingested {file.name}: {n} chunks")

Filter by source document

With SupabaseStore, you can search only within a specific document:

results = store.search(
    query_embedding,
    top_k=5,
    filter={"source": "hr-policy.pdf"},
)

Streaming responses

# GroqGenerator supports streaming
from groq import Groq

client = Groq(api_key="gsk_...")
stream = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[...],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Choosing Your Stack

Need Pick
Fastest setup MemoryStore + OpenAIEmbedder + GroqGenerator
Zero API cost MemoryStore + LocalEmbedder + GroqGenerator (free tier)
Production persistence SupabaseStore + any embedder/generator
Maximum accuracy SupabaseStore + OpenAIEmbedder + mmr retriever + reranker + GPT-4o
Private / on-premise MemoryStore + LocalEmbedder + local Ollama generator

What You've Learned

If you read this far and ran the examples, you understand:

  • RAG architecture — loader → chunker → embedder → store → retriever → generator
  • Chunking strategies — fixed, recursive, semantic; why overlap matters
  • Embeddings — what they are, how cosine similarity works, how to pick a model
  • Vector search — how HNSW indexing works at scale
  • Retrieval strategies — Top-K vs MMR, when diversity matters
  • Reranking — LLM-as-judge, when precision > speed
  • Generation — how to write grounding prompts, how to show citations

This is the complete knowledge stack behind every "chat with your docs" product, every enterprise knowledge base, and every AI assistant with document understanding. No paid course required.


What's Next

Now that you understand RAG, the natural next steps:

  1. Agents — instead of one retrieval+generation step, let the LLM decide when to retrieve and what to do with the result. See agent-loop.

  2. Memory — give your RAG system episodic memory (remember past conversations) and semantic memory (retrieve relevant facts from prior sessions). See mem-store.

  3. Evals — measure whether your RAG pipeline is actually answering correctly. See eval-bench.


Troubleshooting

These are the errors every beginner hits. Fixes are here so you don't lose an hour to them.


ModuleNotFoundError: No module named 'ragkit'

You haven't installed the library yet, or your virtual environment isn't activated.

# Make sure your venv is active first
source venv/bin/activate       # Mac/Linux
venv\Scripts\activate          # Windows

# Then install
pip install rag-kit[openai,groq]

ModuleNotFoundError: No module named 'openai' (or groq, supabase, etc.)

rag-kit has zero mandatory dependencies. You only get the extras you ask for.

pip install rag-kit[openai]     # for OpenAIEmbedder / OpenAIGenerator
pip install rag-kit[groq]       # for GroqGenerator
pip install rag-kit[supabase]   # for SupabaseStore
pip install rag-kit[pdf]        # for load_pdf()
pip install rag-kit[url]        # for load_url()
pip install rag-kit[local]      # for LocalEmbedder
pip install rag-kit[all]        # everything at once

AuthenticationError / 401 Unauthorized

Your API key is wrong or not set.

# Bad — key hardcoded with a typo or expired key
embedder = OpenAIEmbedder(api_key="sk-abc123WRONG")

# Good — read from environment variable
import os
embedder = OpenAIEmbedder(api_key=os.environ["OPENAI_API_KEY"])

Double-check:

  1. You copied the full key (they're long — don't cut it off)
  2. The key is for the right service (OpenAI key ≠ Groq key)
  3. Your .env file is loaded before your script runs:
    export OPENAI_API_KEY=sk-...   # Mac/Linux — run this in your terminal first
    set OPENAI_API_KEY=sk-...      # Windows CMD
    $env:OPENAI_API_KEY="sk-..."   # Windows PowerShell
    

SyntaxError or TypeError on Python 3.9 or below

rag-kit uses list[str] and dict | None type hints, which require Python 3.10+.

python --version   # must show 3.10, 3.11, 3.12, or 3.13

# If you're on 3.9 or below, upgrade Python at python.org

result.answer is "I couldn't find relevant information"

This means no chunks passed the similarity threshold. Three possible causes:

1. Your document wasn't ingested yet.

rag.ingest("your_file.pdf")   # do this before rag.query()
result = rag.query("your question")

2. The question phrasing is too different from the document language.

Try rephrasing. "What is the money-back guarantee?" retrieves better than "refund" if the document uses the phrase "money-back guarantee."

3. min_score is set too high.

# Default min_score is 0.0 — everything passes through
# If you set it high (e.g. 0.8), lower it while debugging
rag = RAGPipeline(..., min_score=0.0)

SupabaseStore not finding results after ingestion

Make sure you ran the setup SQL first. Open your Supabase SQL editor and run:

from ragkit.stores.supabase import SETUP_SQL
print(SETUP_SQL)

Copy the output and paste it into Supabase → SQL Editor → Run. This creates the table, HNSW index, and match_chunks function that the store depends on.


Still stuck?

Open an issue at github.com/iamadhitya1/rag-kit/issues with:

  1. Your Python version (python --version)
  2. The full error message (copy-paste, don't screenshot)
  3. The 5 lines of code that triggered it

License

MIT © 2025 M Adhitya

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragkit_adhitya-0.1.0.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragkit_adhitya-0.1.0-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file ragkit_adhitya-0.1.0.tar.gz.

File metadata

  • Download URL: ragkit_adhitya-0.1.0.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ragkit_adhitya-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4254343b8652bc92dbe2c59a99baacb0422a16d7088620c0cd968ffd61b50b5b
MD5 43c06691cd01d9d97e9b850b261ca508
BLAKE2b-256 54a16dae56abcfebc13a978cff7b25404259684d57e9267716607ba495d3ebf2

See more details on using hashes here.

File details

Details for the file ragkit_adhitya-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ragkit_adhitya-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ragkit_adhitya-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed6d2506ef067078efc38eecc6213125ac36b9d2a1598d43e0af6714145dcc3f
MD5 a02ce2901dbe021f1f8c2077df4cee02
BLAKE2b-256 789c7473280fd86c1c4c596490845b072f1e10f532b17d4ae8ba8630723dd5b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page