Multi-LLM Response Validation & Selection Framework with RAG metrics evaluation
Project description
Evaluator Service
Multi-LLM Response Validation & Selection Framework with RAG metrics evaluation.
Features
- Single-call Custom Evaluator: Evaluates RAG metrics (faithfulness, context precision, context recall, relevance, hallucination risk) in a single LLM call
- Score Aggregation: Weighted scoring formula to combine multiple metrics into a final score
- LLM-as-a-Judge: Tie-breaking mechanism using LLM comparison when scores are close
- Parallel Processing: Evaluates multiple candidate responses concurrently
- Observability: MongoDB integration for storing evaluation traces
- FastAPI: RESTful API for easy integration
- Extensible: Pluggable architecture for different storage backends (MongoDB, Azure Blob, etc.)
Installation
pip install evaluator-service
Configuration
Set the following environment variables:
# Pepgnix LLM Service Configuration
PEPGNIX_SERVICE_URL=https://pepgnix-service.example.com/api/v1/llm
PEPGNIX_TEAM_ID=your-team-id
PEPGNIX_PROJECT_ID=your-project-id
PEPGNIX_API_KEY=your-pepgnix-api-key
# MongoDB Configuration (for observability)
MONGODB_CONNECTION_STRING=mongodb://localhost:27017
MONGODB_DATABASE_NAME=evaluator_service
MONGODB_COLLECTION_NAME=evaluation_traces
Usage
As a Library
from evaluator_service import EvaluationOrchestrator, EvaluatorService, WinnerSelector
from evaluator_service.clients import PepgnixClient, MongoObservabilityClient
from evaluator_service.models import EvalRequest, Candidate, ContextChunk
# Initialize clients
llm_client = PepgnixClient()
observability_client = MongoObservabilityClient()
# Initialize services
evaluator_service = EvaluatorService(llm_client)
llm_judge = LLMJudge(llm_client)
winner_selector = WinnerSelector(llm_judge)
orchestrator = EvaluationOrchestrator(evaluator_service, winner_selector, observability_client)
# Create evaluation request
request = EvalRequest(
request_id="req-123",
user_query="What was PepsiCo revenue in 2024?",
context_chunks=[
ContextChunk(
chunk_id="doc-001-chunk-04",
text="PepsiCo reported revenue of 91.8 billion USD in FY2024.",
retrieval_score=0.94
)
],
candidates=[
Candidate(model="gpt", response="PepsiCo reported revenue of 91.8 billion USD in FY2024."),
Candidate(model="claude", response="According to the annual report, PepsiCo reported total revenue of 91.8B for FY2024.")
]
)
# Run evaluation
result = orchestrator.evaluate(request)
print(f"Winner: {result.winner.model}, Score: {result.score}")
As a REST API
# Start the server
evaluator-service
# Or using python
python -m evaluator_service.main
The API will be available at http://localhost:8080
API Endpoint
POST /api/v1/evaluate
Request body:
{
"request_id": "req-123",
"user_query": "What was PepsiCo revenue in 2024?",
"context_chunks": [
{
"chunk_id": "doc-001-chunk-04",
"text": "PepsiCo reported revenue of 91.8 billion USD in FY2024.",
"retrieval_score": 0.94
}
],
"candidates": [
{
"model": "gpt",
"response": "PepsiCo reported revenue of 91.8 billion USD in FY2024."
},
{
"model": "claude",
"response": "According to the annual report, PepsiCo reported total revenue of 91.8B for FY2024."
}
]
}
Response:
{
"request_id": "req-123",
"winner": {
"model": "claude",
"response": "According to the annual report, PepsiCo reported total revenue of 91.8B for FY2024.",
"score": 0.85,
"selection_method": "score_winner"
},
"all_scores": {
"gpt": {
"final": 0.82,
"faithfulness": 0.9,
"context_precision": 0.85,
"context_recall": 0.8,
"relevance": 0.95,
"hallucination_risk": 0.1
},
"claude": {
"final": 0.85,
"faithfulness": 0.95,
"context_precision": 0.9,
"context_recall": 0.85,
"relevance": 0.9,
"hallucination_risk": 0.05
}
},
"trace_id": "abc-123-def-456",
"evaluated_at": "2024-01-15T10:30:00Z",
"latency_ms": 2340
}
Scoring Formula
The final score is calculated using the following weighted formula:
Final Score =
(0.35 × faithfulness)
+ (0.25 × context_recall)
+ (0.20 × relevance)
+ (0.20 × context_precision)
- (0.30 × hallucination_risk)
Tie-Breaking
When the difference between the top two scores is less than 0.05, the LLM Judge is invoked to compare the two answers based on:
- Accuracy
- Completeness
- Grounding
- Clarity
Development
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black .
# Lint
ruff check .
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evaluator_service-0.1.1.tar.gz.
File metadata
- Download URL: evaluator_service-0.1.1.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
670f9378b5b96244f95833c3208cd7d0e9b57ab303eec77219bbefd06c8f6732
|
|
| MD5 |
b3969e579dd6aab95d97e5a7939c8253
|
|
| BLAKE2b-256 |
fe00edee748e3ce854ad84d68e80f6f67af7f9c4efb4ddf886b6be14ae12badf
|
File details
Details for the file evaluator_service-0.1.1-py3-none-any.whl.
File metadata
- Download URL: evaluator_service-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7485f7376741c18060a4f2d712ca246abbc30a0af6ab5a58c7bc53674c31f73d
|
|
| MD5 |
754b2dd37cc62b7f4a09c1f055a058d9
|
|
| BLAKE2b-256 |
30836489a0c35f4412e5cb51d5512f48f4c03130d55759dba3df79b5ba1db673
|