LLM Quality Gate - A provider-agnostic evaluation framework for LLM applications
Project description
LLMQ
An open-source LLM regression testing & CI quality gate framework.
Prevent prompt and model regressions before they reach production with automated testing across 8 LLM providers.
Why LLMQ?
LLM applications fail silently. A prompt change that works in development can degrade performance in production. Model updates can break existing functionality. Without systematic testing, these regressions go undetected until users complain.
Common LLM Regression Examples:
- Prompt optimization improves one task but breaks another
- Model updates change response format, breaking downstream parsing
- Provider API changes affect response quality
- Temperature adjustments reduce consistency
- Context length changes truncate important information
LLMQ catches these issues before deployment with automated regression testing and quality gates.
Quick Start
Get running in 5 minutes:
# 1. Install
pip install -e .
# 2. Initialize project
llmq init
# 3. Set API key (copy .env.example to .env)
echo "GROQ_API_KEY=your_key_here" >> .env
# 4. Run evaluation
llmq eval --provider groq
View results at http://localhost:8000 after running llmq dashboard.
CI Integration
Add to .github/workflows/llm-quality-gate.yml:
name: LLM Quality Gate
on:
pull_request:
paths: ['prompts/**', 'llm/**', 'llmq.yaml']
jobs:
quality-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install LLMQ
run: pip install -e .
- name: Run Quality Gate
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
run: |
llmq eval --provider groq --fail-on-gate
Architecture
Flow: Dataset → Provider → Metrics → Quality Gates → Results
Supported Providers
| Provider | Models | API Key Required | Cost |
|---|---|---|---|
| Groq | Llama 3.1, Mixtral | ✅ | Free tier |
| OpenAI | GPT-3.5, GPT-4 | ✅ | Paid |
| Claude | Claude 3 Haiku/Sonnet | ✅ | Paid |
| Gemini | Gemini 1.5 Flash/Pro | ✅ | Free tier |
| HuggingFace | Open models | ✅ | Free |
| OpenRouter | 100+ models | ✅ | Varies |
| Ollama | Local models | ❌ | Free |
| LocalAI | Local models | ❌ | Free |
CLI Commands
# Project setup
llmq init # Initialize new project
llmq doctor # Check system health
# Evaluation
llmq eval --provider groq # Run evaluation
llmq eval --provider openai --fail-on-gate # CI mode
llmq compare # Compare providers
# Management
llmq providers # List provider status
llmq runs --limit 10 # View recent runs
llmq dashboard # Start web interface
llmq settings --set '{"quality_gates": {"task_success_threshold": 0.9}}'
Dashboard
🎬 Interactive Demo — See the full CLI + Dashboard walkthrough.
Features:
- Historical performance tracking
- Provider comparison charts
- Quality gate pass/fail trends
- Test case drill-down analysis
Configuration
llmq.yaml:
llm:
default_provider: "groq"
temperature: 0.0
max_tokens: 1000
providers:
groq:
api_key_env: "GROQ_API_KEY"
model: "llama-3.1-8b-instant"
openai:
api_key_env: "OPENAI_API_KEY"
model: "gpt-3.5-turbo"
quality_gates:
task_success_threshold: 0.8
relevance_threshold: 0.7
hallucination_threshold: 0.1
evals/dataset.json:
{
"test_cases": [
{
"id": "example_1",
"task_type": "question_answering",
"input": "What is the capital of France?",
"expected_output": "Paris",
"context": "Geography question",
"reference": "Paris is the capital of France."
}
]
}
Metrics
- Task Success: Exact match + semantic similarity
- Relevance: Embedding-based cosine similarity
- Hallucination: LLM-as-judge detection
- Consistency: Multi-run variance analysis
API
# Start evaluation
curl -X POST http://localhost:8000/api/v1/evaluate \
-H "Content-Type: application/json" \
-d '{"provider": "groq"}'
# Get results
curl http://localhost:8000/api/v1/runs
# Provider comparison
curl http://localhost:8000/api/v1/compare
Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make changes and add tests
- Run tests:
python -m pytest tests/ -v - Submit a pull request
Development setup:
git clone https://github.com/Emart29/llm-quality-gate.git
cd llm-quality-gate
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -e .
llmq doctor # Verify setup
Roadmap
v1.1
- Custom metric plugins
- Slack/Discord webhooks
- A/B testing framework
- Performance benchmarking
v1.2
- Multi-language datasets
- Advanced regression analysis
- Cost tracking per provider
- Distributed evaluation
v2.0
- Visual prompt debugging
- Automated prompt optimization
- Enterprise SSO integration
- Advanced analytics
License: MIT | Python: 3.8+ | Status: Production Ready
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmq_gate-0.1.0.tar.gz.
File metadata
- Download URL: llmq_gate-0.1.0.tar.gz
- Upload date:
- Size: 85.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa824f55acd1db89f9f1ac738d12748c9953f753609dfcb6275c76d660efd67b
|
|
| MD5 |
7dc31a5dab12b1bad1e265700fdf4e00
|
|
| BLAKE2b-256 |
2e4e91cfa1e2e3167c9480988918142b8740fbbecd6c823bca0d171ccf376be1
|
File details
Details for the file llmq_gate-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llmq_gate-0.1.0-py3-none-any.whl
- Upload date:
- Size: 92.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05a1081a5f15300770d6c2104825185e0c368c93653952bdc271dd90c57f300a
|
|
| MD5 |
13faff7452571485f620944501f3c2c0
|
|
| BLAKE2b-256 |
484654f446832ed59abf7844d8c4efaa21d7d8bf3cee75655ec32d2f1ac514c7
|