Skip to main content

Autonomous Agentic QA System for testing RAG pipelines and LLM systems.

Project description

๐Ÿ›ก๏ธ Agentic QA: Autonomous Multi-Agent Testing for RAG & LLMs

Agentic QA is a Python library that autonomously generates adversarial test cases, executes them against your RAG/LLM system, evaluates the results, and self-improves its testing coverageโ€”all without human intervention.

Unlike traditional testing frameworks (like RAGAS or TruLens) that evaluate outputs against static, human-written inputs, Agentic QA acts as an active red-team, dynamically generating the tricky edge cases needed to break your system.

Python LangGraph LangSmith Streamlit


๐Ÿš€ Quick Start

1. Installation

Install the library locally:

git clone https://github.com/yourusername/multi-agent-qa.git
cd multi-agent-qa
pip install -e .

Ensure you have your .env configured with your API keys:

cp .env.example .env
# Edit .env and provide OPENAI_API_KEY

2. Using the Python Library

You can test any RAG or LLM pipeline in just a few lines of code.

Option A: Testing a Python Function

If your RAG system is a Python function in your codebase:

import agentic_qa

# Your existing RAG or Chatbot function
def my_custom_rag(query: str) -> str:
    # Example: return my_langchain_pipeline.invoke(query)
    return "This is my AI response."

# Run the autonomous testing loop
report = agentic_qa.run_autonomous_test(
    target_function=my_custom_rag,
    system_name="YouTube Video Q&A",
    system_description="A chatbot that answers questions about YouTube transcripts.",
    domain="video content",
    max_iterations=3,          # How many times agents learn and retry
    tests_per_iteration=5      # Tests generated per round
)

Option B: Testing an API Endpoint

If your system is deployed behind a REST API (FastAPI, Flask, LangServe):

import agentic_qa

report = agentic_qa.run_autonomous_test(
    api_endpoint="http://localhost:8000/api/chat",
    system_name="Customer Support Bot",
    system_description="An AI that resolves customer support tickets.",
    domain="customer support"
)

3. Using the Streamlit UI

If you prefer a visual dashboard to monitor the agents in real-time, run the included Streamlit app:

streamlit run app.py

From the UI, you can connect your API endpoint or use the built-in mock system for a demonstration.


๐Ÿ—๏ธ Architecture

The framework is powered by 5 autonomous agents built with LangGraph:

START โ”€โ”€โ–ถ ๐Ÿ”ด Red-Team Agent โ”€โ”€โ–ถ โšก Executor Agent โ”€โ”€โ–ถ โš–๏ธ Judge Agent โ”€โ”€โ–ถ Decision
              โ–ฒ                                                            โ”‚
              โ”‚                                                      โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”
              โ”‚                                                      โ–ผ           โ–ผ
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ”ง Refiner Agent               ๐Ÿ“Š Reporter Agent
                                   (loop back)                       (END)

Agent Roles

Agent Role
๐Ÿ”ด Red-Team Generates adversarial test inputs targeting edge cases (prompt injections, boundary values, etc.).
โšก Executor Runs tests through the target system and captures the outputs.
โš–๏ธ Judge Evaluates the outputs using an LLM-as-a-Judge pattern with strict pass/fail criteria.
๐Ÿ”ง Refiner Analyzes the judge's failure patterns and instructs the Red-Team on how to exploit weaknesses in the next iteration.
๐Ÿ“Š Reporter Compiles a comprehensive final Markdown QA report.

๐Ÿง  What Makes This Novel

Traditional Testing Tools (RAGAS, TruLens) Agentic QA
Measures outputs against static inputs Generates the adversarial inputs autonomously
Human writes test cases AI agents write and refine test cases
One-shot evaluation Self-improving loop with pattern learning
Relies heavily on reference data Relies on behavioral boundaries and edge-case testing

๐Ÿ“‚ Project Structure

multi-agent-qa/
โ”œโ”€โ”€ agentic_qa/
โ”‚   โ”œโ”€โ”€ __init__.py           # Clean developer API (run_autonomous_test)
โ”‚   โ”œโ”€โ”€ agents/               # 5 LangGraph agent definitions
โ”‚   โ”œโ”€โ”€ graph/                # State definitions and LangGraph flow
โ”‚   โ”œโ”€โ”€ schemas/              # Pydantic validation models
โ”‚   โ”œโ”€โ”€ sut/                  # Adapters (API, Callable, Base)
โ”‚   โ””โ”€โ”€ utils/                # Prompt templates
โ”œโ”€โ”€ setup.py                  # Package configuration
โ”œโ”€โ”€ app.py                    # Streamlit Dashboard UI
โ”œโ”€โ”€ .env                      # API Keys configuration
โ””โ”€โ”€ README.md

๐Ÿ“ก LangSmith Monitoring

All agent interactions are automatically traced via LangSmith if configured in .env.

LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your-langsmith-api-key
LANGCHAIN_PROJECT=agentic-qa

๐Ÿ“„ License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_qa-0.1.0.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentic_qa-0.1.0-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file agentic_qa-0.1.0.tar.gz.

File metadata

  • Download URL: agentic_qa-0.1.0.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentic_qa-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4ec4c3de0dbf79c1938e638a8a566a6458c89878569724fea42770646d9667ea
MD5 22f32a1461eed89a42c78f225516b16c
BLAKE2b-256 f55d0a40f218085c9df58b01ebb371757ecd6aff539e67834cf79b5071c6a315

See more details on using hashes here.

File details

Details for the file agentic_qa-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentic_qa-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for agentic_qa-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ee9e2512c02d1c982a7417aeedac02b1e2f276f7902d27be33338c3050b88fc1
MD5 ba7ec0bfaf25a89dd0b51b8f54db508b
BLAKE2b-256 ed962e583de7227e2e0745571367f3d8c93c59f3802ca3d10b4ff2c1523035e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page