Skip to main content

LLM Quality Gate - A provider-agnostic evaluation framework for LLM applications

Project description

LLMQ Logo

LLMQ

Regression Testing & Quality Gates for LLM Applications

Open-source framework to catch silent LLM failures before they reach production.

Tests PyPI License: MIT Python 3.8+ PRs Welcome

Website · Quick Start · PyPI · Dashboard Demo · Contributing


The Problem

LLM applications fail silently. There's no stack trace when your summarizer starts hallucinating. No exception when your classifier drifts. A prompt change that works in development can quietly degrade production. Model updates break existing functionality overnight.

Without systematic testing, these regressions go undetected until users complain.

Common regressions LLMQ catches:

  • Prompt optimization improves one task but degrades another
  • Model updates change response formats, breaking downstream parsing
  • Provider API changes affect response quality
  • Temperature adjustments reduce output consistency
  • Context length changes truncate important information

Quick Start

Get running in under 5 minutes:

# Install
pip install llmq-gate

# Initialize project
llmq init

# Set your API key
echo "GROQ_API_KEY=your_key_here" >> .env

# Run your first evaluation
llmq eval --provider groq

View results in the browser:

llmq dashboard
# → http://localhost:8000

How It Works

Dataset → LLM Provider → Metrics Engine → Quality Gates → Pass / Fail
  1. Define test cases in evals/dataset.json with inputs, expected outputs, and context
  2. Run evaluations against any supported provider
  3. Metrics are computed automatically — task success, relevance, hallucination, consistency
  4. Quality gates pass or fail based on your configured thresholds
  5. Results are stored for historical tracking and comparison

Supported Providers

Provider Models API Key Cost
Groq Llama 3.1, Mixtral Required Free tier
OpenAI GPT-3.5, GPT-4 Required Paid
Claude Claude 3 Haiku / Sonnet Required Paid
Gemini Gemini 1.5 Flash / Pro Required Free tier
HuggingFace Open models Required Free
OpenRouter 100+ models Required Varies
Ollama Local models Free
LocalAI Local models Free

CI/CD Integration

Add quality gates to your pull request workflow. Builds fail automatically when LLM performance drops below your thresholds.

# .github/workflows/llm-quality-gate.yml
name: LLM Quality Gate

on:
  pull_request:
    paths: ['prompts/**', 'llm/**', 'llmq.yaml']

jobs:
  quality-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install LLMQ
        run: pip install llmq-gate

      - name: Run Quality Gate
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
        run: llmq eval --provider groq --fail-on-gate

Metrics

Metric Method Description
Task Success Exact match + semantic similarity Did the model produce the correct answer?
Relevance Embedding-based cosine similarity Is the response relevant to the input?
Hallucination LLM-as-judge detection Did the model fabricate information?
Consistency Multi-run variance analysis Are responses stable across runs?

Dashboard

llmq dashboard

The interactive web dashboard provides historical performance tracking, provider comparison charts, quality gate pass/fail trends, and test case drill-down analysis.

🎬 Watch the full CLI + Dashboard walkthrough →

v0.1.1 Highlights

  • Unified configuration filename: llmq.yaml everywhere.
  • Config auto-discovery from current directory upward (similar to pyproject.toml lookup).
  • llmq eval now supports standalone mode by falling back to local engine if dashboard API is unavailable.
  • CLI exit codes are standardized:
    • 0: quality gate passed
    • 1: quality gate failed or runtime error
    • 2: configuration error (e.g., missing llmq.yaml)

Configuration

llmq.yaml — project-level settings:

llm:
  default_provider: "groq"
  temperature: 0.0
  max_tokens: 1000

providers:
  groq:
    api_key_env: "GROQ_API_KEY"
    model: "llama-3.1-8b-instant"
  openai:
    api_key_env: "OPENAI_API_KEY"
    model: "gpt-3.5-turbo"

quality_gates:
  task_success_threshold: 0.8
  relevance_threshold: 0.7
  hallucination_threshold: 0.1

evals/dataset.json — test cases:

{
  "test_cases": [
    {
      "id": "example_1",
      "task_type": "question_answering",
      "input": "What is the capital of France?",
      "expected_output": "Paris",
      "context": "Geography question",
      "reference": "Paris is the capital of France."
    }
  ]
}

CLI Reference

# Setup
llmq init                                    # Initialize new project
llmq doctor                                  # Check system health

# Evaluation
llmq eval --provider groq                    # Run evaluation
llmq eval --provider openai --fail-on-gate   # CI mode (exit 1 on gate failure)
llmq compare                                 # Compare providers side-by-side

# Management
llmq providers                               # List provider status
llmq runs --limit 10                         # View recent runs
llmq dashboard                               # Start web dashboard
llmq settings --set '{"quality_gates": {"task_success_threshold": 0.9}}'

Migration Guide (<=0.1.0 -> 0.1.1)

  1. Rename existing config.yaml to llmq.yaml.
  2. Update scripts to use --config-path (or continue using --config) when you need an explicit location.
  3. Remove hard dependency on llmq dashboard for CLI evaluations; llmq eval now runs standalone if API is unavailable.
  4. If you parse CLI statuses in CI, adopt the documented exit codes (0/1/2).

API

# Start an evaluation
curl -X POST http://localhost:8000/api/v1/evaluate \
  -H "Content-Type: application/json" \
  -d '{"provider": "groq"}'

# Get run history
curl http://localhost:8000/api/v1/runs

# Compare providers
curl http://localhost:8000/api/v1/compare

Contributing

Contributions are welcome — whether it's a bug fix, new provider integration, docs improvement, or feature request.

git clone https://github.com/Emart29/llm-quality-gate.git
cd llm-quality-gate
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -e .
llmq doctor               # Verify setup
  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make changes and add tests
  4. Run tests: python -m pytest tests/ -v
  5. Submit a pull request

Roadmap

v1.1 — Custom metric plugins · Slack/Discord webhooks · A/B testing framework · Performance benchmarking

v1.2 — Multi-language datasets · Advanced regression analysis · Cost tracking per provider · Distributed evaluation

v2.0 — Visual prompt debugging · Automated prompt optimization · Enterprise SSO · Advanced analytics

License

MIT — see LICENSE for details.


⭐ Star on GitHub · 📦 PyPI · 🌐 Website

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmq_gate-0.1.1.tar.gz (85.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmq_gate-0.1.1-py3-none-any.whl (91.9 kB view details)

Uploaded Python 3

File details

Details for the file llmq_gate-0.1.1.tar.gz.

File metadata

  • Download URL: llmq_gate-0.1.1.tar.gz
  • Upload date:
  • Size: 85.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llmq_gate-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6a2ce79bef490989c208e4b5ccea65773fdb76ca1b71f9617721a83e32e0193c
MD5 9a49bf298f27f7848e1fe4bf6280617b
BLAKE2b-256 4cf16712f1f580f0c92e154efcd2e9ce4a5c78961a8db8e0a7363192bd3aa16e

See more details on using hashes here.

File details

Details for the file llmq_gate-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llmq_gate-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 91.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llmq_gate-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9bd98fd3636749ce5af8e39409b085a133c97d5ce4e5b3c3ae7e9e03ee8975ff
MD5 71bd39705fea60a08e4c5bd771dd722f
BLAKE2b-256 c1df8526fd89a7521827394720eaaccc722e1671ab0e1099a3eb03328d7b6301

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page