LLM Quality Gate - A provider-agnostic evaluation framework for LLM applications
Project description
LLMQ
Regression Testing & Quality Gates for LLM Applications
Open-source framework to catch silent LLM failures before they reach production.
Website · Quick Start · PyPI · Dashboard Demo · Contributing
The Problem
LLM applications fail silently. There's no stack trace when your summarizer starts hallucinating. No exception when your classifier drifts. A prompt change that works in development can quietly degrade production. Model updates break existing functionality overnight.
Without systematic testing, these regressions go undetected until users complain.
Common regressions LLMQ catches:
- Prompt optimization improves one task but degrades another
- Model updates change response formats, breaking downstream parsing
- Provider API changes affect response quality
- Temperature adjustments reduce output consistency
- Context length changes truncate important information
Quick Start
Get running in under 5 minutes:
# Install
pip install llmq-gate
# Initialize project
llmq init
# Set your API key
echo "GROQ_API_KEY=your_key_here" >> .env
# Run your first evaluation
llmq eval --provider groq
View results in the browser:
llmq dashboard
# → http://localhost:8000
How It Works
Dataset → LLM Provider → Metrics Engine → Quality Gates → Pass / Fail
- Define test cases in
evals/dataset.jsonwith inputs, expected outputs, and context - Run evaluations against any supported provider
- Metrics are computed automatically — task success, relevance, hallucination, consistency
- Quality gates pass or fail based on your configured thresholds
- Results are stored for historical tracking and comparison
Supported Providers
| Provider | Models | API Key | Cost |
|---|---|---|---|
| Groq | Llama 3.1, Mixtral | Required | Free tier |
| OpenAI | GPT-3.5, GPT-4 | Required | Paid |
| Claude | Claude 3 Haiku / Sonnet | Required | Paid |
| Gemini | Gemini 1.5 Flash / Pro | Required | Free tier |
| HuggingFace | Open models | Required | Free |
| OpenRouter | 100+ models | Required | Varies |
| Ollama | Local models | — | Free |
| LocalAI | Local models | — | Free |
CI/CD Integration
Add quality gates to your pull request workflow. Builds fail automatically when LLM performance drops below your thresholds.
# .github/workflows/llm-quality-gate.yml
name: LLM Quality Gate
on:
pull_request:
paths: ['prompts/**', 'llm/**', 'llmq.yaml']
jobs:
quality-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install LLMQ
run: pip install llmq-gate
- name: Run Quality Gate
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
run: llmq eval --provider groq --fail-on-gate
Metrics
| Metric | Method | Description |
|---|---|---|
| Task Success | Exact match + semantic similarity | Did the model produce the correct answer? |
| Relevance | Embedding-based cosine similarity | Is the response relevant to the input? |
| Hallucination | LLM-as-judge detection | Did the model fabricate information? |
| Consistency | Multi-run variance analysis | Are responses stable across runs? |
Dashboard
llmq dashboard
The interactive web dashboard provides historical performance tracking, provider comparison charts, quality gate pass/fail trends, and test case drill-down analysis.
🎬 Watch the full CLI + Dashboard walkthrough →
v0.1.1 Highlights
- Unified configuration filename:
llmq.yamleverywhere. - Config auto-discovery from current directory upward (similar to
pyproject.tomllookup). llmq evalnow supports standalone mode by falling back to local engine if dashboard API is unavailable.- CLI exit codes are standardized:
0: quality gate passed1: quality gate failed or runtime error2: configuration error (e.g., missingllmq.yaml)
Configuration
llmq.yaml — project-level settings:
llm:
default_provider: "groq"
temperature: 0.0
max_tokens: 1000
providers:
groq:
api_key_env: "GROQ_API_KEY"
model: "llama-3.1-8b-instant"
openai:
api_key_env: "OPENAI_API_KEY"
model: "gpt-3.5-turbo"
quality_gates:
task_success_threshold: 0.8
relevance_threshold: 0.7
hallucination_threshold: 0.1
evals/dataset.json — test cases:
{
"test_cases": [
{
"id": "example_1",
"task_type": "question_answering",
"input": "What is the capital of France?",
"expected_output": "Paris",
"context": "Geography question",
"reference": "Paris is the capital of France."
}
]
}
CLI Reference
# Setup
llmq init # Initialize new project
llmq doctor # Check system health
# Evaluation
llmq eval --provider groq # Run evaluation
llmq eval --provider openai --fail-on-gate # CI mode (exit 1 on gate failure)
llmq compare # Compare providers side-by-side
# Management
llmq providers # List provider status
llmq runs --limit 10 # View recent runs
llmq dashboard # Start web dashboard
llmq settings --set '{"quality_gates": {"task_success_threshold": 0.9}}'
Migration Guide (<=0.1.0 -> 0.1.1)
- Rename existing
config.yamltollmq.yaml. - Update scripts to use
--config-path(or continue using--config) when you need an explicit location. - Remove hard dependency on
llmq dashboardfor CLI evaluations;llmq evalnow runs standalone if API is unavailable. - If you parse CLI statuses in CI, adopt the documented exit codes (
0/1/2).
API
# Start an evaluation
curl -X POST http://localhost:8000/api/v1/evaluate \
-H "Content-Type: application/json" \
-d '{"provider": "groq"}'
# Get run history
curl http://localhost:8000/api/v1/runs
# Compare providers
curl http://localhost:8000/api/v1/compare
Contributing
Contributions are welcome — whether it's a bug fix, new provider integration, docs improvement, or feature request.
git clone https://github.com/Emart29/llm-quality-gate.git
cd llm-quality-gate
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -e .
llmq doctor # Verify setup
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make changes and add tests
- Run tests:
python -m pytest tests/ -v - Submit a pull request
Roadmap
v1.1 — Custom metric plugins · Slack/Discord webhooks · A/B testing framework · Performance benchmarking
v1.2 — Multi-language datasets · Advanced regression analysis · Cost tracking per provider · Distributed evaluation
v2.0 — Visual prompt debugging · Automated prompt optimization · Enterprise SSO · Advanced analytics
License
MIT — see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmq_gate-0.1.1.tar.gz.
File metadata
- Download URL: llmq_gate-0.1.1.tar.gz
- Upload date:
- Size: 85.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a2ce79bef490989c208e4b5ccea65773fdb76ca1b71f9617721a83e32e0193c
|
|
| MD5 |
9a49bf298f27f7848e1fe4bf6280617b
|
|
| BLAKE2b-256 |
4cf16712f1f580f0c92e154efcd2e9ce4a5c78961a8db8e0a7363192bd3aa16e
|
File details
Details for the file llmq_gate-0.1.1-py3-none-any.whl.
File metadata
- Download URL: llmq_gate-0.1.1-py3-none-any.whl
- Upload date:
- Size: 91.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bd98fd3636749ce5af8e39409b085a133c97d5ce4e5b3c3ae7e9e03ee8975ff
|
|
| MD5 |
71bd39705fea60a08e4c5bd771dd722f
|
|
| BLAKE2b-256 |
c1df8526fd89a7521827394720eaaccc722e1671ab0e1099a3eb03328d7b6301
|