LLM Quality Gate - A provider-agnostic evaluation framework for LLM applications
Project description
๐ก๏ธ LLMQ Gate
You changed one prompt. Summarization improved.
Classification silently broke. Nobody noticed for 3 days.
LLMQ Gate makes this a CI problem, not a production incident.
The Problem
LLM applications have no test suite. Changes that seem safe โ a prompt tweak, a model version bump, a temperature adjustment โ can silently degrade performance on tasks you didn't test manually. A model update changes response formats overnight. Nobody notices until users complain or metrics tank a week later.
Traditional software has pytest. CI/CD pipelines catch regressions before they ship. LLM apps have had nothing equivalent โ until now.
LLMQ Gate brings the same regression-detection discipline to LLM applications. Define test cases, set quality thresholds, run evals against any provider, and fail CI builds automatically when quality drops below your standards.
How It Works
Dataset โ LLM Provider โ Metrics Engine โ Quality Gates โ Pass / Fail
- Define test cases in
evals/dataset.jsonโ inputs, expected outputs, context - Run evals against any supported provider (Groq, OpenAI, Gemini, Ollama, and more)
- Four metrics computed automatically โ task success, relevance, hallucination, consistency
- Quality gates pass or fail against your configured thresholds
- Results stored for historical tracking and regression detection
- CI build fails if quality drops โ just like a broken unit test
โก Quickstart
pip install llmq-gate && llmq init && llmq eval --provider groq
$ llmq eval --provider groq --fail-on-gate
Loading configuration... โ
Connecting to Groq (llama-3.1-8b-instant)... โ
Loading dataset (12 test cases)... โ
Running evaluations:
โ question_answering_1 Task Success: 0.95 Relevance: 0.92 Hallucination: 0.02
โ summarization_1 Task Success: 0.88 Relevance: 0.94 Hallucination: 0.01
โ classification_1 Task Success: 1.00 Relevance: 0.89 Hallucination: 0.00
โ sentiment_analysis_1 Task Success: 0.72 Relevance: 0.85 Hallucination: 0.03
โ code_generation_1 Task Success: 0.91 Relevance: 0.96 Hallucination: 0.01
... (7 more)
Metrics Summary:
Task Success: 0.87 (threshold: 0.80) โ
Relevance: 0.91 (threshold: 0.70) โ
Hallucination: 0.02 (threshold: 0.10) โ
Consistency: 0.94 (threshold: 0.80) โ
Quality Gate: PASS โ
๐ Metrics
| Metric | Method | What It Answers |
|---|---|---|
| Task Success | Exact match + semantic similarity | Did the model get it right? |
| Relevance | Embedding cosine similarity | Is the response on-topic? |
| Hallucination | LLM-as-judge | Did it fabricate information? |
| Consistency | Multi-run variance | Are outputs stable across runs? |
Each metric maps to a real failure mode teams encounter in production. Hallucination detection uses an LLM-as-judge pattern โ a second model evaluates whether the response introduces facts not grounded in the provided context.
๐ Supported Providers
| Provider | Models | Cost |
|---|---|---|
| Groq | Llama 3.1, Mixtral | Free tier |
| OpenAI | GPT-3.5, GPT-4 | Paid |
| Claude | Claude 3 Haiku / Sonnet | Paid |
| Gemini | Gemini 1.5 Flash / Pro | Free tier |
| HuggingFace | Open models | Free |
| OpenRouter | 100+ models | Varies |
| Ollama | Local models | Free |
| LocalAI | Local models | Free |
๐ CI/CD Integration
One workflow file. Builds fail automatically when LLM quality drops below your thresholds โ the same way a broken unit test fails a build.
# .github/workflows/llm-quality-gate.yml
name: LLM Quality Gate
on:
pull_request:
paths: ['prompts/**', 'llm/**', 'llmq.yaml']
jobs:
quality-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install LLMQ Gate
run: pip install llmq-gate
- name: Run Quality Gate
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
run: llmq eval --provider groq --fail-on-gate
Exit codes: 0 pass ยท 1 fail or runtime error ยท 2 config error
This means your pipeline knows exactly what went wrong โ quality regression, runtime failure, or misconfiguration โ without parsing logs manually.
โ๏ธ Configuration
llmq.yaml
llm:
default_provider: "groq"
temperature: 0.0
max_tokens: 1000
providers:
groq:
api_key_env: "GROQ_API_KEY"
model: "llama-3.1-8b-instant"
openai:
api_key_env: "OPENAI_API_KEY"
model: "gpt-3.5-turbo"
quality_gates:
task_success_threshold: 0.8
relevance_threshold: 0.7
hallucination_threshold: 0.1
evals/dataset.json
{
"test_cases": [
{
"id": "example_1",
"task_type": "question_answering",
"input": "What is the capital of France?",
"expected_output": "Paris",
"context": "Geography question",
"reference": "Paris is the capital of France."
}
]
}
๐ฅ๏ธ Dashboard
llmq dashboard
# โ http://localhost:8000
Historical performance tracking, provider comparisons, quality gate trends, and per-test-case drill-down โ all in one view. See exactly which test cases are degrading and when the regression started.
๐ง CLI Reference
llmq init # Initialize project
llmq doctor # Check system health
llmq eval --provider groq # Run evaluation
llmq eval --provider openai --fail-on-gate # CI mode โ exits 1 on fail
llmq compare # Compare providers side-by-side
llmq providers # List provider status
llmq runs --limit 10 # View recent runs
llmq dashboard # Start web dashboard
๐ REST API
# Trigger an evaluation programmatically
curl -X POST http://localhost:8000/api/v1/evaluate \
-H "Content-Type: application/json" \
-d '{"provider": "groq"}'
# View recent runs
curl http://localhost:8000/api/v1/runs
# Compare providers
curl http://localhost:8000/api/v1/compare
๐ Project Structure
llm-quality-gate/
โโโ core/ # Metrics engine (task success, relevance,
โ # hallucination, consistency)
โโโ llm/ # Provider abstractions (Groq, OpenAI, etc.)
โโโ evals/ # Evaluation runner + dataset loader
โโโ cli/ # llmq CLI commands
โโโ dashboard/ # Web dashboard (FastAPI + frontend)
โโโ storage/ # Run history + result persistence
โโโ ci/ # CI integration helpers
โโโ tests/ # Pytest test suite
โโโ .github/workflows/ # GitHub Actions pipeline
โโโ llmq.yaml # Project configuration
โโโ config.example.yaml # Configuration reference
โโโ pyproject.toml # Package metadata (published to PyPI)
๐ Key Engineering Decisions
Why LLM-as-judge for hallucination detection? Rule-based approaches (keyword matching, NLI models) miss subtle fabrications. Using a second LLM to evaluate grounding catches hallucinations that simpler methods skip โ at the cost of one extra API call per evaluation.
Why semantic similarity for relevance instead of exact match? Exact match punishes correct answers that use different wording. Embedding cosine similarity measures whether the response is semantically on-topic โ which is what relevance actually means for generative outputs.
Why multi-run variance for consistency? LLMs are non-deterministic at temperature > 0. A model that gives correct answers 70% of the time is not production-ready. Consistency scoring surfaces this instability before it reaches users.
Why fail the CI build instead of just reporting? Reporting creates alert fatigue โ teams learn to ignore dashboards. Failing the build makes quality regression a blocking issue that must be resolved before merging, the same discipline applied to unit tests.
๐ฆ Installation & Version Notes
pip install llmq-gate
v0.1.1 (current)
- Unified config:
llmq.yamleverywhere - Config auto-discovery from current directory upward
- Standalone eval mode โ no dashboard dependency
- Standardized exit codes (0/1/2)
Migrating from โค0.1.0? Rename config.yaml โ llmq.yaml.
That's the only breaking change.
๐บ๏ธ Roadmap
v1.1 Custom metric plugins ยท Slack/Discord webhooks ยท A/B testing ยท Performance benchmarks
v1.2 Multi-language datasets ยท Regression analysis ยท Cost tracking ยท Distributed eval
v2.0 Visual prompt debugging ยท Automated prompt optimization ยท Enterprise SSO
Contributing
git clone https://github.com/Emart29/llm-quality-gate.git
cd llm-quality-gate
python -m venv venv && source venv/bin/activate
pip install -e . && llmq doctor
Fork โ branch โ test (python -m pytest tests/ -v) โ PR.
See CONTRIBUTING.md for full guidelines.
๐ License
MIT โ see LICENSE for details.
Built by Emmanuel Nwanguma
โญ Star on GitHub ยท
๐ฆ PyPI ยท
๐ Website
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmq_gate-0.1.2.tar.gz.
File metadata
- Download URL: llmq_gate-0.1.2.tar.gz
- Upload date:
- Size: 88.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f473f2ef12ecd57ef3655ffe1053802a096e5eb9f60e1ee22f31c89aa81fac77
|
|
| MD5 |
17863665ba5e03b5760397c7e3d5c680
|
|
| BLAKE2b-256 |
c288f3d9ad60373935e1f1b1930885d8f2a1572fb460b9e42d312153c7b4c205
|
File details
Details for the file llmq_gate-0.1.2-py3-none-any.whl.
File metadata
- Download URL: llmq_gate-0.1.2-py3-none-any.whl
- Upload date:
- Size: 91.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a29f7bb6c74d19ac86991414ba416c0300ab65492fbac6ceb2354f679084315b
|
|
| MD5 |
7ff5872cb2ca9c47a1d34101515bd66c
|
|
| BLAKE2b-256 |
fce9dbdadddf8c0be322ef3a14ac79ed6697e1964f5800f95efca2b5f822417e
|