Skip to main content

LLM Quality Gate - A provider-agnostic evaluation framework for LLM applications

Project description

LLMQ

Tests LLM Quality Gate License: MIT Python 3.8+ Coverage PRs Welcome

An open-source LLM regression testing & CI quality gate framework.

Prevent prompt and model regressions before they reach production with automated testing across 8 LLM providers.

Why LLMQ?

LLM applications fail silently. A prompt change that works in development can degrade performance in production. Model updates can break existing functionality. Without systematic testing, these regressions go undetected until users complain.

Common LLM Regression Examples:

  • Prompt optimization improves one task but breaks another
  • Model updates change response format, breaking downstream parsing
  • Provider API changes affect response quality
  • Temperature adjustments reduce consistency
  • Context length changes truncate important information

LLMQ catches these issues before deployment with automated regression testing and quality gates.

Quick Start

Get running in 5 minutes:

# 1. Install
pip install -e .

# 2. Initialize project
llmq init

# 3. Set API key (copy .env.example to .env)
echo "GROQ_API_KEY=your_key_here" >> .env

# 4. Run evaluation
llmq eval --provider groq

View results at http://localhost:8000 after running llmq dashboard.

CI Integration

Add to .github/workflows/llm-quality-gate.yml:

name: LLM Quality Gate

on:
  pull_request:
    paths: ['prompts/**', 'llm/**', 'llmq.yaml']

jobs:
  quality-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install LLMQ
        run: pip install -e .
      
      - name: Run Quality Gate
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
        run: |
          llmq eval --provider groq --fail-on-gate

Architecture

LLMQ Architecture

Flow: Dataset → Provider → Metrics → Quality Gates → Results

Supported Providers

Provider Models API Key Required Cost
Groq Llama 3.1, Mixtral Free tier
OpenAI GPT-3.5, GPT-4 Paid
Claude Claude 3 Haiku/Sonnet Paid
Gemini Gemini 1.5 Flash/Pro Free tier
HuggingFace Open models Free
OpenRouter 100+ models Varies
Ollama Local models Free
LocalAI Local models Free

CLI Commands

# Project setup
llmq init                           # Initialize new project
llmq doctor                         # Check system health

# Evaluation
llmq eval --provider groq           # Run evaluation
llmq eval --provider openai --fail-on-gate  # CI mode
llmq compare                        # Compare providers

# Management
llmq providers                      # List provider status
llmq runs --limit 10               # View recent runs
llmq dashboard                      # Start web interface
llmq settings --set '{"quality_gates": {"task_success_threshold": 0.9}}'

Dashboard

Dashboard Overview

🎬 Interactive Demo — See the full CLI + Dashboard walkthrough.

Features:

  • Historical performance tracking
  • Provider comparison charts
  • Quality gate pass/fail trends
  • Test case drill-down analysis

Configuration

llmq.yaml:

llm:
  default_provider: "groq"
  temperature: 0.0
  max_tokens: 1000

providers:
  groq:
    api_key_env: "GROQ_API_KEY"
    model: "llama-3.1-8b-instant"
  openai:
    api_key_env: "OPENAI_API_KEY"
    model: "gpt-3.5-turbo"

quality_gates:
  task_success_threshold: 0.8
  relevance_threshold: 0.7
  hallucination_threshold: 0.1

evals/dataset.json:

{
  "test_cases": [
    {
      "id": "example_1",
      "task_type": "question_answering",
      "input": "What is the capital of France?",
      "expected_output": "Paris",
      "context": "Geography question",
      "reference": "Paris is the capital of France."
    }
  ]
}

Metrics

  • Task Success: Exact match + semantic similarity
  • Relevance: Embedding-based cosine similarity
  • Hallucination: LLM-as-judge detection
  • Consistency: Multi-run variance analysis

API

# Start evaluation
curl -X POST http://localhost:8000/api/v1/evaluate \
  -H "Content-Type: application/json" \
  -d '{"provider": "groq"}'

# Get results
curl http://localhost:8000/api/v1/runs

# Provider comparison
curl http://localhost:8000/api/v1/compare

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make changes and add tests
  4. Run tests: python -m pytest tests/ -v
  5. Submit a pull request

Development setup:

git clone https://github.com/Emart29/llm-quality-gate.git
cd llm-quality-gate
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -e .
llmq doctor  # Verify setup

Roadmap

v1.1

  • Custom metric plugins
  • Slack/Discord webhooks
  • A/B testing framework
  • Performance benchmarking

v1.2

  • Multi-language datasets
  • Advanced regression analysis
  • Cost tracking per provider
  • Distributed evaluation

v2.0

  • Visual prompt debugging
  • Automated prompt optimization
  • Enterprise SSO integration
  • Advanced analytics

License: MIT | Python: 3.8+ | Status: Production Ready

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmq_gate-0.1.0.tar.gz (85.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmq_gate-0.1.0-py3-none-any.whl (92.2 kB view details)

Uploaded Python 3

File details

Details for the file llmq_gate-0.1.0.tar.gz.

File metadata

  • Download URL: llmq_gate-0.1.0.tar.gz
  • Upload date:
  • Size: 85.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llmq_gate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fa824f55acd1db89f9f1ac738d12748c9953f753609dfcb6275c76d660efd67b
MD5 7dc31a5dab12b1bad1e265700fdf4e00
BLAKE2b-256 2e4e91cfa1e2e3167c9480988918142b8740fbbecd6c823bca0d171ccf376be1

See more details on using hashes here.

File details

Details for the file llmq_gate-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llmq_gate-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 92.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for llmq_gate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 05a1081a5f15300770d6c2104825185e0c368c93653952bdc271dd90c57f300a
MD5 13faff7452571485f620944501f3c2c0
BLAKE2b-256 484654f446832ed59abf7844d8c4efaa21d7d8bf3cee75655ec32d2f1ac514c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page