Skip to main content

retrievalagent — multi-backend retrieval-augmented generation with LangGraph

Project description

retrievalagent

An autonomous retrieval-augmented generation agent. Plug in any vector store, any LLM, any reranker. Hybrid search, reranking, query rewriting, an LLM quality gate, and an autonomous retry loop — built on LangGraph.

PyPI Python License: MIT CI


from retrievalagent import init_agent

rag = init_agent("documents", model="openai:gpt-5.4", backend="qdrant")
state = rag.chat("What is the status of operation overlord?")
print(state.answer)

Scope — Retrieval, Not Ingestion

retrievalagent is built for retrieval quality at query time — hybrid search, reranking, query rewriting, an autonomous retry loop, and an LLM quality gate.

Ingestion is out of scope. The library does not chunk, clean, embed-at-scale, or index your corpus. Use a dedicated tool for that — Docling, Unstructured, LlamaIndex, a Databricks job, or a custom script — then point retrievalagent at the resulting index. Every backend exposes a minimal add_documents() helper for convenience and smoke tests; it is not meant to replace a real ingestion pipeline.

The narrow surface is deliberate: one thing, done well.


What it does

Most retrieval systems do a single search pass. retrievalagent runs a state machine that retrieves, evaluates the result, rewrites when needed, and retries — all autonomously, up to max_iter rounds.

Per query the agent will:

  1. Understand the intent — rewrite the question into precise search keywords, detect keyword-vs-semantic, pick the hybrid ratio, extract negated terms ("cola nicht zero" → filter out "zero").
  2. Search in parallel — keyword search (BM25 on the original query) and synonym search (BM25 across spell-corrected form + synonym/alias expansions) run concurrently. Semantic backup fires only when BM25 scores are low.
  3. Merge and diversify — deduplicate all pools, run an MMR diversity pass to remove near-duplicates, then rerank.
  4. Evaluate — an LLM quality gate decides whether the retrieved docs actually answer the question.
  5. Adapt — if not, rewrite the query (passing the top-3 doc snippets so the LLM can reason about what went wrong) and retry.
  6. Generate — only once the evidence holds, produce a cited, grounded answer.

Features

  • Fully async pipeline — parallel keyword + synonym fan-out, zero blocking calls; every public op has sync and async variants.
  • Lean parallel pipelineprepare → [keyword ∥ synonym] → quality_gate → [semantic_backup →] merge_rerank → generate; semantic backup only fires when BM25 scores are low.
  • Synonym + spell-correct — one LLM call per query expands synonyms/aliases, spell-corrects as a parallel BM25 term (original query preserved), and extracts negated terms.
  • Negative filter extraction"cola aber nicht zero"excluded_terms=["zero"]; post-filtered in merge; works for any negated concept.
  • MMR diversity — bag-of-words Maximal Marginal Relevance (lam=0.7) removes near-duplicate docs before reranking.
  • LLM quality gate — rejects weak results, drives the rewrite loop until the evidence holds.
  • Retry reasoning — rewrite passes top-3 doc snippets to the LLM so it can diagnose why the previous results were wrong.
  • Autonomous retry loop — retrieve → judge → rewrite → retry, up to max_iter rounds.
  • Hybrid search — BM25 + vector, fused with RRF or DBSF.
  • HyDE — hypothetical document embeddings for vague queries.
  • Tool-calling agentget_index_settings, get_filter_values, search_hybrid, search_bm25, rerank_results; the LLM picks tools dynamically.
  • Multiple rerankers — Cohere, HuggingFace, Jina, ColBERT, RankGPT, embed-anything, or a custom callable.
  • 8 search backends — Meilisearch, Azure AI Search, ChromaDB, LanceDB, Qdrant, pgvector, DuckDB, InMemory.
  • Any LLM — OpenAI, Azure, Anthropic, Ollama, Vertex AI, or any LangChain BaseChatModel.
  • One-line initinit_agent("docs", model="openai:gpt-5.4", backend="qdrant").
  • Multi-turn chat — conversation history with citation-aware answers.
  • Auto-strategy — the agent samples your collection at init and tunes itself.
  • Custom instructions — append domain-specific rules to any prompt via instructions= or RAGConfig.custom_instructions. Full guide.

Install

# Recommended — Meilisearch + Cohere reranker + interactive CLI
pip install retrievalagent[recommended]

# Base only — in-memory backend, BM25 keyword search
pip install retrievalagent
Extra What you get Command
recommended Meilisearch + Cohere reranker + Rich CLI pip install retrievalagent[recommended]
cli Interactive CLI with guided setup wizard pip install retrievalagent[cli]
all Every backend + reranker + CLI pip install retrievalagent[all]
Individual backends & rerankers
pip install retrievalagent[meilisearch]
pip install retrievalagent[azure]
pip install retrievalagent[chromadb]
pip install retrievalagent[lancedb]
pip install retrievalagent[pgvector]
pip install retrievalagent[qdrant]
pip install retrievalagent[duckdb]
pip install retrievalagent[cohere]
pip install retrievalagent[huggingface]
pip install retrievalagent[jina]
pip install retrievalagent[rerankers]      # ColBERT, Flashrank, RankGPT, …
pip install retrievalagent[embed-anything] # Local Rust-accelerated embeddings + reranking

Mix and match: pip install retrievalagent[qdrant,cohere,cli]


Quick Start

One-liner with init_agent

The fastest way to get started — no provider imports, string aliases for everything:

from retrievalagent import init_agent

# Minimal — in-memory backend, LLM from env vars
rag = init_agent("docs")

# OpenAI + Qdrant + Cohere reranker
rag = init_agent(
    "my-collection",
    model="openai:gpt-5.4",
    backend="qdrant",
    backend_url="http://localhost:6333",
    reranker="cohere",
)

# Anthropic + Azure AI Search (native vectorisation, no client-side embeddings)
rag = init_agent(
    "my-index",
    model="anthropic:claude-sonnet-4-6",
    gen_model="anthropic:claude-opus-4-6",
    backend="azure",
    backend_url="https://my-search.search.windows.net",
    reranker="huggingface",
    auto_strategy=True,
)

# Fully local — Ollama + ChromaDB + HuggingFace cross-encoder
rag = init_agent(
    "docs",
    model="ollama:llama3",
    backend="chroma",
    reranker="huggingface",
    reranker_model="cross-encoder/ms-marco-MiniLM-L-6-v2",
)

Multi-collection routing

Pass several collections and let the agent decide which to search. The LLM picks the relevant subset before retrieval, using either the collection names alone or optional natural-language descriptions.

from retrievalagent import init_agent

# List form — LLM routes by name only
rag = init_agent(
    collections=["products", "faq", "policies"],
    backend="qdrant",
    backend_url="http://localhost:6333",
    model="openai:gpt-5.4",
)

# Dict form — LLM routes using descriptions (better precision)
rag = init_agent(
    collections={
        "products": "Product catalog: SKUs, prices, specs, availability",
        "faq":      "Customer-facing FAQ, troubleshooting, return policy",
        "policies": "Internal HR/legal/compliance policy documents",
    },
    backend="qdrant",
    backend_url="http://localhost:6333",
    model="openai:gpt-5.4",
)

rag.invoke("What's our return policy?")       # → routes to faq / policies
rag.invoke("Price of SKU 12345?")              # → routes to products

Each retrieved document carries its origin in metadata["_collection"] so you can merge, filter, or attribute citations downstream. One backend instance is built per collection; they share the same backend type and URL.

Backend aliases

Alias Class Extra
"memory" / "in_memory" InMemoryBackend (none)
"chroma" / "chromadb" ChromaDBBackend retrievalagent[chromadb]
"qdrant" QdrantBackend retrievalagent[qdrant]
"lancedb" / "lance" LanceDBBackend retrievalagent[lancedb]
"duckdb" DuckDBBackend retrievalagent[duckdb]
"pgvector" / "pg" PgvectorBackend retrievalagent[pgvector]
"meilisearch" MeilisearchBackend retrievalagent[meilisearch]
"azure" AzureAISearchBackend retrievalagent[azure]

Reranker aliases

Alias Class reranker_model Extra
"cohere" CohereReranker Cohere model name (default: rerank-v3.5) retrievalagent[cohere]
"huggingface" / "hf" HuggingFaceReranker HF model name (default: cross-encoder/ms-marco-MiniLM-L-6-v2) retrievalagent[huggingface]
"jina" JinaReranker Jina model name (default: jina-reranker-v2-base-multilingual) retrievalagent[jina]
"llm" LLMReranker (uses the agent's LLM) (none)
"rerankers" RerankersReranker Any model from the rerankers library retrievalagent[rerankers]
"embed-anything" EmbedAnythingReranker ONNX reranker model (default: jina-reranker-v1-turbo-en) retrievalagent[embed-anything]
# Cohere (default model)
rag = init_agent("docs", model="openai:gpt-5.4", reranker="cohere")

# HuggingFace — multilingual model
rag = init_agent("docs", model="openai:gpt-5.4", reranker="huggingface",
                 reranker_model="cross-encoder/mmarco-mMiniLMv2-L12-H384-v1")

# Jina
rag = init_agent("docs", model="openai:gpt-5.4", reranker="jina")  # uses JINA_API_KEY

# ColBERT via rerankers library
rag = init_agent("docs", model="openai:gpt-5.4", reranker="rerankers",
                 reranker_model="colbert-ir/colbertv2.0",
                 reranker_kwargs={"model_type": "colbert"})

# Pass a pre-built reranker instance directly
from retrievalagent import CohereReranker
rag = init_agent("docs", reranker=CohereReranker(model="rerank-v3.5", api_key="..."))

Model strings: any "provider:model-name" from LangChain's init_chat_modelopenai, anthropic, azure_openai, google_vertexai, ollama, groq, mistralai, and more

Manual setup

from retrievalagent import Agent, InMemoryBackend

backend = InMemoryBackend(embed_fn=my_embed_fn)
backend.add_documents([
    {"content": "RAG combines retrieval with generation", "source": "wiki"},
    {"content": "Vector search finds similar embeddings", "source": "docs"},
])

rag = Agent(index="demo", backend=backend)

# Single query → full answer
state = rag.invoke("What is retrieval-augmented generation?")
print(state.answer)

# Retrieve only — documents without LLM answer
query, docs = rag.retrieve_documents("What is retrieval-augmented generation?")
for doc in docs:
    print(doc.page_content)

# Override top-K at call time
query, docs = rag.retrieve_documents("hybrid search", top_k=3)

Agent.from_model — model string with explicit backend

from retrievalagent import Agent, QdrantBackend

rag = Agent.from_model(
    "openai:gpt-5.4-mini",          # fast model for routing & rewriting
    index="docs",
    gen_model="openai:gpt-5.4",     # powerful model for the final answer
    backend=QdrantBackend("docs", url="http://localhost:6333"),
)

Multi-turn Chat

from retrievalagent import Agent, ConversationTurn

rag = Agent(index="articles")
history: list[ConversationTurn] = []

state = rag.chat("What is hybrid search?", history)
history.append(ConversationTurn(question="What is hybrid search?", answer=state.answer))

state = rag.chat("How does it compare to pure vector search?", history)
print(state.answer)
print(f"Sources: {len(state.documents)}")

Async variant:

state = await rag.achat("What is hybrid search?", history)

Search-knowledge memory with mem0

history= only carries the current session. For long-term search knowledge that improves retrieval on the same corpus over time, plug mem0 into the agent. The store grows into a corpus-aware glossary of term mappings the agent has learned — informal-to-formal terms, brand spellings, aliases, common typos. It is not a user-preferences store.

When a user query resolves through a non-trivial term expansion (the matching documents used a different surface form than the query), the agent's grader flags it for storage. On future queries, mem0 recalls the relevant mapping and feeds it into BM25 so the same expansion happens automatically.

pip install mem0ai
from retrievalagent import init_agent

rag = init_agent("articles", memory=True)

cfg = {"configurable": {"user_id": "alice"}}
rag.invoke("...", config=cfg)
# → grader may store a search-fact (synonym / alias / typo mapping)
rag.invoke("...", config=cfg)
# → if a stored mapping clears the relevance gate it is injected
#   into BM25 and the system prompt

Two thresholds gate the flow:

  • memory_relevance_threshold (env RAG_MEMORY_RELEVANCE_THRESHOLD, default 0.7) — mem0 cosine score the recall must exceed before a stored fact reaches retrieval/generation.
  • memory_storage_threshold (env RAG_MEMORY_STORAGE_THRESHOLD, default 0.85) — LLM memory_confidence the grader must report before a new fact is persisted.

Writes are fire-and-forget: the graph schedules mem0.add(...) as a background asyncio task; the user-facing response never waits on memory I/O. await rag.adrain_background() before shutdown if you need the writes to land before exit.

state.trace carries the decision events (read_memory with memories/n_kept/threshold or skipped: below_threshold; final_grade with memory_should_store/memory_confidence) so you can tune the thresholds for your corpus. See docs/memory.md for the full memory matrix (history vs. checkpointer vs. memory_store vs. mem0).


Architecture

retrievalagent has two operating modes — both fully autonomous:

Graph mode (rag.chat / rag.invoke)

The default. A LangGraph state machine that runs the full agentic pipeline:

Query
  │
  ▼
[Prepare]
  Contextualize, preprocess, filter-intent detection
  Tier-0 (ID fast-track) → generate directly
  │
  ├────────────────────────┐
  ▼                        ▼
[keyword_search]    [synonym_search]
 BM25 on              LLM: spell-correct + synonyms
 original query       + negation extract
                      BM25 fan-out (parallel)
  │                        │
  └───────────┬────────────┘
              ▼
          [evaluate]
         (fan-in sync)
              │
        ┌─────┴──────┐
      (score         (score
       high)          low)
        │              │
        │         [semantic_backup]
        │          Full hybrid search
        │              │
        └──────┬────────┘
               ▼
         [merge_rerank]
          Dedup + MMR diversity
          + rerank + boost
          + exclude negated terms
               │
         ┌─────┴──────────┐
       (ok)           (weak/empty)
         │                │
         ▼                ▼
     [generate]       [rewrite]
      Cited answer      LLM + top-3 doc
                        snippets → retry
         │
         ▼
  [final_grade] (skipped for Tier 0/1)
   LLM judges answer quality
         │
  Answer + [n] inline citations

Memory nodes (read_memory / write_memory) wrap the pipeline when mem0_memory= is set.

Tool-calling agent mode (rag.invoke_agent)

The agent receives a set of tools and reasons step-by-step, calling them in whatever order makes sense for the question. No fixed pipeline — pure field improvisation:

Query
  │
  ▼
[LLM Agent]  ◄──────────────────────────────────────┐
  Thinks: "What do I need to answer this?"           │
  │                                                  │
  ├── get_index_settings()                           │
  │   Discover filterable / sortable / boost fields  │
  │                                                  │
  ├── get_filter_values(field)                       │
  │   Sample real stored values for a field          │
  │   → build precise filter expressions             │
  │                                                  │
  ├── search_hybrid(query, filter, sort_fields)      │
  │   BM25 + vector, optional filter + sort boost    │
  │                                                  │
  ├── search_bm25(query, filter)                     │
  │   Fallback pure keyword search                   │
  │                                                  │
  ├── rerank_results(query, hits)                    │
  │   Re-rank with configured reranker               │
  │                                                  │
  └── [needs more info?] ─────────────────────────► │

  [done]
  │
  ▼
Answer  (tool calls explained inline)

Use invoke_agent when questions involve dynamic filtering — the agent inspects the index schema, samples real field values, builds filters on the fly, and decides whether to sort by business signals like popularity or recency.


Examples

1. Knowledge base Q&A (InMemory, no external services)

from retrievalagent import AgenticRAG, InMemoryBackend
from langchain_openai import ChatOpenAI

docs = [
    {"id": "1", "content": "The Eiffel Tower was built in 1889 for the World's Fair in Paris.", "topic": "history"},
    {"id": "2", "content": "The Louvre is the world's largest art museum, located in Paris.", "topic": "art"},
    {"id": "3", "content": "Photosynthesis converts sunlight and CO2 into glucose and oxygen.", "topic": "science"},
    {"id": "4", "content": "The Python programming language was created by Guido van Rossum in 1991.", "topic": "tech"},
    {"id": "5", "content": "Machine learning is a subset of artificial intelligence.", "topic": "tech"},
]

backend = InMemoryBackend(documents=docs)
llm = ChatOpenAI(model="gpt-5.4-mini")

rag = AgenticRAG(index="kb", backend=backend, llm=llm, gen_llm=llm)

state = rag.invoke("When was the Eiffel Tower built?")
print(state.answer)
# → "The Eiffel Tower was built in 1889 for the World's Fair in Paris. [1]"
print(state.query)        # rewritten query
print(state.iterations)   # how many retrieval rounds it took

2. Retrieve documents without generating an answer

Useful when you want the docs and will handle the answer yourself:

from retrievalagent import AgenticRAG, InMemoryBackend

rag = AgenticRAG(index="kb", backend=backend)

query, docs = rag.retrieve_documents("machine learning", top_k=3)
print(f"Rewritten query: {query}")
for doc in docs:
    print(doc.page_content)
    print(doc.metadata)  # original fields + _rankingScore

3. Multi-turn chat

from retrievalagent import AgenticRAG, InMemoryBackend, ConversationTurn

rag = AgenticRAG(index="kb", backend=backend, llm=llm, gen_llm=llm)
history: list[ConversationTurn] = []

q1 = "What is machine learning?"
s1 = rag.chat(q1, history)
history.append(ConversationTurn(question=q1, answer=s1.answer))
print(s1.answer)

q2 = "How does it relate to AI?"   # pronoun resolved from history
s2 = rag.chat(q2, history)
history.append(ConversationTurn(question=q2, answer=s2.answer))
print(s2.answer)

4. Always-on filter (e-commerce: in-stock items only)

from retrievalagent import AgenticRAG, MeilisearchBackend

backend = MeilisearchBackend(
    "products",
    url="http://localhost:7700",
    api_key="masterKey",
)

# Every search is scoped to in-stock items — no per-call boilerplate
rag = AgenticRAG(
    index="products",
    backend=backend,
    filter="is_in_stock = true",
    llm=llm,
    gen_llm=llm,
)

state = rag.invoke("red running shoes size 42")
for doc in state.documents:
    print(doc.metadata["product_name"], "|", doc.metadata["price"])

5. Filter + own-brand exclusion

# Exclude own-brand articles and search for third-party alternatives
rag = AgenticRAG(
    index="products",
    backend=backend,
    filter="is_own_brand = false",
    llm=llm,
    gen_llm=llm,
)

state = rag.invoke("Find alternatives to our house-brand brake cleaner 500ml")
print(state.answer)
# LLM strips the brand prefix, rewrites to "brake cleaner 500ml",
# filter ensures only third-party results are returned.

6. Async usage (FastAPI / Databricks / Jupyter)

import asyncio
from retrievalagent import AgenticRAG, InMemoryBackend

rag = AgenticRAG(index="kb", backend=backend, llm=llm, gen_llm=llm)

# Async single query
state = await rag.ainvoke("What is photosynthesis?")
print(state.answer)

# Async batch — runs all queries in parallel
states = await rag.abatch([
    "What is photosynthesis?",
    "Who created Python?",
    "Where is the Louvre?",
])
for s in states:
    print(s.answer)

Sync variants work from any context including Databricks/Jupyter (running event loop is handled automatically):

# Safe to call from a notebook cell even with a running event loop
state = rag.invoke("What is photosynthesis?")
states = rag.batch(["question one", "question two"])

7. Tool-calling agent — dynamic filter discovery

When you don't know the filter values upfront, the agent inspects the schema and samples field values itself:

from retrievalagent import AgenticRAG, MeilisearchBackend

rag = AgenticRAG(
    index="products",
    backend=MeilisearchBackend("products", url="http://localhost:7700"),
    llm=llm,
    gen_llm=llm,
)

# Agent calls get_index_settings() → get_filter_values("brand") →
# search_hybrid(filter="brand = 'Bosch'", sort_fields=["popularity"])
result = rag.invoke_agent("Show me the most popular Bosch power tools")
print(result)

8. Streaming the final answer

async def stream_answer():
    async for chunk in rag.astream("Explain hybrid search in simple terms"):
        print(chunk, end="", flush=True)

asyncio.run(stream_answer())

9. Qdrant — vector search with metadata filter

from retrievalagent import AgenticRAG, QdrantBackend
from qdrant_client import QdrantClient, models

# Insert docs (done once)
client = QdrantClient("http://localhost:6333")
client.upsert("articles", points=[
    models.PointStruct(id=1, vector=embed("RAG combines retrieval and generation"),
                       payload={"content": "RAG combines retrieval and generation", "year": 2023}),
    models.PointStruct(id=2, vector=embed("Vector databases store high-dimensional embeddings"),
                       payload={"content": "Vector databases store high-dimensional embeddings", "year": 2022}),
])

from qdrant_client.models import FieldCondition, MatchValue

rag = AgenticRAG(
    index="articles",
    backend=QdrantBackend("articles", url="http://localhost:6333", embed_fn=embed),
    llm=llm,
    gen_llm=llm,
)

# Pass native Qdrant filter dict — no string translation needed
state = rag.invoke("what is RAG?")
# Or with explicit filter at retrieve time:
_, docs = rag.retrieve_documents("vector databases")

10. Custom instructions (tone / domain)

rag = AgenticRAG(
    index="legal_docs",
    backend=backend,
    llm=llm,
    gen_llm=llm,
    instructions=(
        "You are a legal assistant. Answer in formal language. "
        "Always cite the article number when referencing a law. "
        "If the context is insufficient, say so explicitly."
    ),
)

state = rag.invoke("What are the notice periods for dismissal?")
print(state.answer)

Backends

Azure AI Search

Native hybrid search — no client-side embeddings needed when the index has an integrated vectorizer:

from retrievalagent import Agent, AzureAISearchBackend

# Native vectorization — service embeds the query server-side
rag = Agent(
    index="my-index",
    backend=AzureAISearchBackend(
        "my-index",
        endpoint="https://my-search.search.windows.net",
        api_key="...",
    ),
)

# Client-side vectorization
rag = Agent(
    index="my-index",
    backend=AzureAISearchBackend(
        "my-index",
        endpoint="https://my-search.search.windows.net",
        api_key="...",
        embed_fn=my_embed_fn,
    ),
)

# With Azure semantic reranking
rag = Agent(
    index="my-index",
    backend=AzureAISearchBackend(
        "my-index",
        endpoint="https://my-search.search.windows.net",
        api_key="...",
        semantic_config="my-semantic-config",
    ),
)

Qdrant

from retrievalagent import Agent, QdrantBackend

rag = Agent(
    index="my_collection",
    backend=QdrantBackend("my_collection", url="http://localhost:6333", embed_fn=my_embed_fn),
)

ChromaDB

from retrievalagent import Agent, ChromaDBBackend

rag = Agent(
    index="my_collection",
    backend=ChromaDBBackend("my_collection", path="./chroma_db", embed_fn=my_embed_fn),
)

LanceDB

from retrievalagent import Agent, LanceDBBackend

rag = Agent(
    index="docs",
    backend=LanceDBBackend("docs", db_uri="./lancedb", embed_fn=my_embed_fn),
)

PostgreSQL + pgvector

from retrievalagent import Agent, PgvectorBackend

rag = Agent(
    index="documents",
    backend=PgvectorBackend(
        "documents",
        dsn="postgresql://user:pass@localhost:5432/mydb",
        embed_fn=my_embed_fn,
    ),
)

DuckDB

from retrievalagent import Agent, DuckDBBackend

rag = Agent(
    index="vectors",
    backend=DuckDBBackend("vectors", db_path="./my.duckdb", embed_fn=my_embed_fn),
)

Meilisearch

from retrievalagent import Agent, MeilisearchBackend

rag = Agent(
    index="articles",
    backend=MeilisearchBackend("articles", url="http://localhost:7700", api_key="masterKey"),
)

InMemory (default, zero dependencies)

from retrievalagent import Agent, InMemoryBackend

backend = InMemoryBackend(embed_fn=my_embed_fn)
backend.add_documents([
    {"content": "RAG combines retrieval with generation", "source": "wiki"},
    {"content": "Vector search finds similar embeddings", "source": "docs"},
])

rag = Agent(index="demo", backend=backend)

LLM Configuration

Pass a pre-built LangChain model or use init_agent / Agent.from_model for string-based init.
When using Agent directly, configure via env vars or pass an explicit model instance.

OpenAI

from langchain_openai import ChatOpenAI
from retrievalagent import Agent

rag = Agent(
    index="articles",
    llm=ChatOpenAI(model="gpt-5.4", api_key="sk-..."),
    gen_llm=ChatOpenAI(model="gpt-5.4", api_key="sk-..."),
)

Azure OpenAI (explicit keys)

from langchain_openai import AzureChatOpenAI
from retrievalagent import Agent

llm = AzureChatOpenAI(
    azure_endpoint="https://my-resource.openai.azure.com",
    azure_deployment="gpt-5.4",
    api_key="...",
    api_version="2024-12-01-preview",
)
rag = Agent(index="articles", llm=llm, gen_llm=llm)

Azure OpenAI (env vars)

# Set: AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT
from retrievalagent import Agent

rag = Agent(index="articles")  # auto-detected

Azure OpenAI with Managed Identity (no API key)

from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from langchain_openai import AzureChatOpenAI
from retrievalagent import Agent

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)
llm = AzureChatOpenAI(
    azure_endpoint="https://my-resource.openai.azure.com",
    azure_deployment="gpt-5.4",
    azure_ad_token_provider=token_provider,
    api_version="2024-12-01-preview",
)
rag = Agent(index="articles", llm=llm, gen_llm=llm)

Anthropic Claude

pip install langchain-anthropic
from langchain_anthropic import ChatAnthropic
from retrievalagent import Agent

llm = ChatAnthropic(model="claude-sonnet-4-6", api_key="sk-ant-...")
rag = Agent(index="articles", llm=llm, gen_llm=llm)

Ollama (local, no API key)

pip install langchain-ollama
from langchain_ollama import ChatOllama
from retrievalagent import Agent

rag = Agent(
    index="articles",
    llm=ChatOllama(model="llama3.2", base_url="http://localhost:11434"),
    gen_llm=ChatOllama(model="llama3.2", base_url="http://localhost:11434"),
)

Google Vertex AI

pip install langchain-google-vertexai
from langchain_google_vertexai import ChatVertexAI
from retrievalagent import Agent

llm = ChatVertexAI(model="gemini-2.0-flash", project="my-gcp-project", location="us-central1")
rag = Agent(index="articles", llm=llm, gen_llm=llm)

Separate fast and generation models

Three LLM slots — llm (utility), gen_llm (generation), grader_llm (answer grader):

Slot Used for Recommended
llm synonym expand, spell-correct, rewrite, quality-gate cheap (gpt-5.4-mini)
gen_llm final cited answer powerful (gpt-5.5)
grader_llm answer grader — set via config.grader_model inherits gen_llm
from langchain_openai import AzureChatOpenAI
from retrievalagent import Agent

fast_llm = AzureChatOpenAI(azure_deployment="gpt-5.4-mini", api_key="...", api_version="2024-12-01-preview")
gen_llm  = AzureChatOpenAI(azure_deployment="gpt-5.5",      api_key="...", api_version="2024-12-01-preview")

rag = Agent(index="articles", llm=fast_llm, gen_llm=gen_llm)

Rerankers

Cohere

from retrievalagent import Agent, CohereReranker

rag = Agent(index="articles", reranker=CohereReranker(model="rerank-v3.5", api_key="..."))

HuggingFace cross-encoder (local, no API key)

pip install retrievalagent[huggingface]
from retrievalagent import Agent, HuggingFaceReranker

rag = Agent(index="articles", reranker=HuggingFaceReranker())

# Multilingual
rag = Agent(index="articles", reranker=HuggingFaceReranker(model="cross-encoder/mmarco-mMiniLMv2-L12-H384-v1"))

Jina (multilingual API)

pip install retrievalagent[jina]
from retrievalagent import Agent, JinaReranker

rag = Agent(index="articles", reranker=JinaReranker(api_key="..."))  # or JINA_API_KEY env var

rerankers — ColBERT / Flashrank / RankGPT / any cross-encoder

Unified bridge to the rerankers library by answer.ai:

pip install retrievalagent[rerankers]
from retrievalagent import Agent, RerankersReranker

rag = Agent(index="articles", reranker=RerankersReranker("cross-encoder/ms-marco-MiniLM-L-6-v2", model_type="cross-encoder"))
rag = Agent(index="articles", reranker=RerankersReranker("colbert-ir/colbertv2.0", model_type="colbert"))
rag = Agent(index="articles", reranker=RerankersReranker("flashrank", model_type="flashrank"))
rag = Agent(index="articles", reranker=RerankersReranker("gpt-5.4-mini", model_type="rankgpt", api_key="..."))

Embed-anything — Rust-accelerated local embeddings + reranking

Embeddings and reranking in a single Rust-powered package. Fully local — no API keys, no network calls. Powered by embed-anything.

pip install retrievalagent[embed-anything]
from retrievalagent import Agent, EmbedAnythingEmbedder, EmbedAnythingReranker

# Local embeddings — works as embed_fn (callable)
embedder = EmbedAnythingEmbedder("sentence-transformers/all-MiniLM-L6-v2")

# Local reranker — implements Reranker protocol
reranker = EmbedAnythingReranker("jinaai/jina-reranker-v1-turbo-en")

rag = Agent(
    index="articles",
    backend=QdrantBackend("articles", url="http://localhost:6333", embed_fn=embedder),
    embed_fn=embedder,
    reranker=reranker,
)

Mix and match freely — use embed-anything for one piece and a cloud provider for the other:

from retrievalagent import Agent, EmbedAnythingEmbedder, CohereReranker

# Local embeddings + cloud reranker
rag = Agent(index="docs", embed_fn=EmbedAnythingEmbedder(), reranker=CohereReranker())

# Cloud embeddings + local reranker
from retrievalagent import EmbedAnythingReranker
rag = Agent(index="docs", embed_fn=azure_embed_fn, reranker=EmbedAnythingReranker())

Custom reranker

from retrievalagent import Agent, RerankResult

class MyReranker:
    def rerank(self, query: str, documents: list[str], top_n: int) -> list[RerankResult]:
        return [RerankResult(index=i, relevance_score=1.0 / (i + 1)) for i in range(top_n)]

rag = Agent(index="articles", reranker=MyReranker())

Tools

When using invoke_agent, the LLM has access to a set of tools it can call in any order. No fixed pipeline — the agent decides what it needs.

Tool Description
get_index_settings() Discover filterable, searchable, sortable, and boost fields from the index schema
get_filter_values(field) Sample real stored values for a field — used to build precise filter expressions
search_hybrid(query, filter_expr, semantic_ratio, sort_fields) BM25 + vector hybrid search with optional filter and sort boost
search_bm25(query, filter_expr) Pure keyword search — fallback when hybrid returns poor results
rerank_results(query, hits) Re-rank a list of hits with the configured reranker

The agent follows this reasoning pattern:

  1. Call get_index_settings() to learn the schema
  2. If the question names a specific entity, call get_filter_values(field) to find the exact stored value
  3. Call search_hybrid() with a filter and/or sort if relevant, otherwise broad hybrid search
  4. Fall back to search_bm25() if results are thin
  5. Call rerank_results() to surface the most relevant hits
  6. Summarise — explaining which filters and signals influenced the answer
from retrievalagent import Agent

rag = Agent(index="products")

# Agent inspects schema, detects brand field, samples values,
# builds filter, sorts by popularity signal — all autonomously
result = rag.invoke_agent("Show me the most popular Bosch power tools")
print(result)

Constructor Reference

Agent(
    index="my_index",           # collection / index name
    backend=...,                # SearchBackend (default: InMemoryBackend)
    llm=...,                    # fast LLM — routing, rewrite, filter
    gen_llm=...,                # generation LLM — final answer
    reranker=...,               # Cohere / HuggingFace / Jina / custom
    top_k=10,                   # final result count            [RAG_TOP_K]
    rerank_top_n=5,             # reranker top-n                [RAG_RERANK_TOP_N]
    retrieval_factor=4,         # over-retrieval multiplier     [RAG_RETRIEVAL_FACTOR]
    max_iter=20,                # max retrieve-rewrite cycles   [RAG_MAX_ITER]
    semantic_ratio=0.5,         # hybrid semantic weight        [RAG_SEMANTIC_RATIO]
    fusion="rrf",               # "rrf" or "dbsf"               [RAG_FUSION]
    instructions="",            # extra system prompt for generation
    embed_fn=None,              # (str) -> list[float]
    boost_fn=None,              # (doc_dict) -> float score boost
    filter=None,                # always-on Meilisearch filter expr (e.g. "brand = 'Bosch'")
    category_fields=None,       # fields used by alternative retrieve (None → auto-detect via regex)
    hyde_min_words=8,           # min words to trigger HyDE     [RAG_HYDE_MIN_WORDS]
    hyde_style_hint="",         # style hint for HyDE prompt
    auto_strategy=True,         # auto-tune from index samples
)

Always-on filter

Pin every search to a subset of the index with filter — Meilisearch syntax, AND-joined with any per-call filter (intent, language, ...):

rag = init_agent("products", filter="brand = 'Bosch'")
# every BM25 + vector + hybrid search scoped to Bosch only

The legacy base_filter kwarg still works but emits a DeprecationWarning — migrate to filter at your convenience.

Category fields (alternative retrieve)

The alternative-retrieve fallback broadens the search by pivoting on category-like fields (product groups, taxonomy levels, sections, ...). By default, retrievalagent auto-detects them from the index schema via regex — matching names like category, product_group_l3, article_group_name, kategorie, family, section, ... — and prioritises deeper taxonomy levels (_l3 > _l2 > _l1).

Override explicitly when your schema uses unusual names:

rag = init_agent(
    "products",
    category_fields=["taxonomy_leaf", "taxonomy_parent", "department"],
)

Pass category_fields=[] to disable the fallback entirely.


API Reference

Method Returns Description
rag.invoke(query) RAGState Full RAG pipeline (sync)
rag.ainvoke(query) RAGState Full RAG pipeline (async)
rag.chat(query, history) RAGState Multi-turn chat (sync)
rag.achat(query, history) RAGState Multi-turn chat (async)
rag.retrieve_documents(query, top_k) (str, list[Document]) Retrieve only, no answer
rag.query(query) str Answer string directly
rag.invoke_agent(query) str Tool-calling agent mode (sync)
rag.ainvoke_agent(query) str Tool-calling agent mode (async)

RAGState fields: answer · documents · query · question · history · iterations · excluded_terms · synonyms · tier · trace


Environment Variables

Variable Description Default
AZURE_OPENAI_ENDPOINT Azure OpenAI endpoint
AZURE_OPENAI_API_KEY Azure OpenAI API key
AZURE_OPENAI_DEPLOYMENT Default deployment
AZURE_OPENAI_FAST_DEPLOYMENT Fast model deployment DEPLOYMENT
AZURE_OPENAI_GENERATION_DEPLOYMENT Generation deployment DEPLOYMENT
AZURE_OPENAI_API_VERSION API version 2024-12-01-preview
OPENAI_API_KEY OpenAI API key (fallback)
OPENAI_MODEL OpenAI model name gpt-5.4
AZURE_COHERE_ENDPOINT Azure Cohere endpoint
AZURE_COHERE_API_KEY Azure Cohere API key
COHERE_API_KEY Cohere API key (fallback)
JINA_API_KEY Jina reranker API key
MEILI_URL Meilisearch URL http://localhost:7700
MEILI_KEY Meilisearch API key masterKey
RAG_TOP_K Final result count 10
RAG_RERANK_TOP_N Reranker top-n 5
RAG_RETRIEVAL_FACTOR Over-retrieval multiplier 4
RAG_SEMANTIC_RATIO Hybrid semantic weight 0.5
RAG_FUSION Fusion strategy rrf
RAG_HYDE_MIN_WORDS Min words to trigger HyDE 8

Configuration

retrievalagent ships with sensible defaults. Adjust via RAGConfig:

from retrievalagent import AgenticRAG, RAGConfig

rag = AgenticRAG(
    index="my_index",
    backend=backend,
    config=RAGConfig.auto(),  # discovers pyproject.toml → retrievalagent.config.toml → env
)

Config discovery order

  1. Runtime kwargAgenticRAG(config=RAGConfig(top_k=15, ...))
  2. retrievalagent.config.toml — per-deployment local override (gitignored by default)
  3. [tool.retrievalagent] in pyproject.toml — shared team defaults
  4. RAG_* env vars — container / CI overrides
  5. Library defaults — fallback

Key parameters

Parameter Env var Default Description
top_k RAG_TOP_K 10 Final results returned
max_iter RAG_MAX_ITER 3 Max retrieve-rewrite cycles
semantic_ratio RAG_SEMANTIC_RATIO 0.5 BM25 ⇄ vector balance
n_synonym_variants RAG_N_SYNONYM_VARIANTS 3 Synonym swarm width
query_languages RAG_QUERY_LANGUAGES "en" Comma-separated language codes
custom_instructions RAG_CUSTOM_INSTRUCTIONS "" Domain rules appended to prompts

Disabling optional stages

TOML has no null, so disabled fields go in a disable list:

[tool.retrievalagent]
top_k = 10
semantic_ratio = 0.5
disable = ["bm25_fallback_threshold", "hyde_min_words"]

RAGConfig.from_toml() / from_pyproject() translates disable = [...] into None on load.


CLI

pip install retrievalagent[recommended]

# Guided setup wizard — choose LLM, embedder, backend, reranker
retrievalagent

# Interactive chat — full agentic pipeline (TUI)
retrievalagent --chat -c my_index

# Single query → answer (stdout, scriptable)
retrievalagent --query "What is X?" -c my_index

# Pure retrieval — top-k documents, no LLM generation
retrievalagent --retrieve "What is X?" -c my_index

# Plain REPL (no TUI)
retrievalagent --plain -c my_index

# Skip wizard, use env vars
retrievalagent --skip-wizard -c my_index

The wizard guides you through:

  1. LLM provider — OpenAI, Anthropic, Ollama, or env default
  2. Embedding model — OpenAI, Azure OpenAI, Ollama, or none (BM25 only)
  3. Vector store — InMemory, Meilisearch, ChromaDB, Qdrant, pgvector, DuckDB, LanceDB, Azure AI Search
  4. Reranker — Cohere, Jina, HuggingFace, LLM-based, or none
  5. Mode — Chat (with answers) or Retriever (documents only)

License

MIT — Licence to code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

retrievalagent-0.13.1.tar.gz (489.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

retrievalagent-0.13.1-py3-none-any.whl (153.1 kB view details)

Uploaded Python 3

File details

Details for the file retrievalagent-0.13.1.tar.gz.

File metadata

  • Download URL: retrievalagent-0.13.1.tar.gz
  • Upload date:
  • Size: 489.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for retrievalagent-0.13.1.tar.gz
Algorithm Hash digest
SHA256 e035212ae31c53104f09e94a26e73fd65d2dcf444b5dc0b4df3cadf2fb245d2e
MD5 26cd119d4d823f5f68ae5e0e7a7671d8
BLAKE2b-256 a8b43c7a99bb3a96d52b296026c019a1b1b754d1e8c5bb538e4e54949be8a02a

See more details on using hashes here.

Provenance

The following attestation bundles were made for retrievalagent-0.13.1.tar.gz:

Publisher: workflow.yml on bmsuisse/retrievalagent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file retrievalagent-0.13.1-py3-none-any.whl.

File metadata

  • Download URL: retrievalagent-0.13.1-py3-none-any.whl
  • Upload date:
  • Size: 153.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for retrievalagent-0.13.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cd552ab1d0b965b5e72b6202980d7e70cfcd3b8c25564e96c90b07abffe12aa3
MD5 70abaed2757d37d7564db1f158ef9f90
BLAKE2b-256 5c6789e58232f6deb2d22c16729d61e8fa65ad6da1c6fc5802ae1f5ffe644419

See more details on using hashes here.

Provenance

The following attestation bundles were made for retrievalagent-0.13.1-py3-none-any.whl:

Publisher: workflow.yml on bmsuisse/retrievalagent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page