Multi-LLM Response Validation & Selection Framework with RAG metrics evaluation

These details have not been verified by PyPI

Project links

Project description

Evaluator Service

Multi-LLM Response Validation & Selection Framework with RAG metrics evaluation.

Features

Single-call Custom Evaluator: Evaluates RAG metrics (faithfulness, context precision, context recall, relevance, hallucination risk) in a single LLM call
Score Aggregation: Weighted scoring formula to combine multiple metrics into a final score
LLM-as-a-Judge: Tie-breaking mechanism using LLM comparison when scores are close
Parallel Processing: Evaluates multiple candidate responses concurrently
Observability: MongoDB integration for storing evaluation traces
FastAPI: RESTful API for easy integration
Extensible: Pluggable architecture for different storage backends (MongoDB, Azure Blob, etc.)

Installation

pip install evaluator-service

Configuration

Set the following environment variables:

# Pepgnix LLM Service Configuration
PEPGNIX_SERVICE_URL=https://pepgnix-service.example.com/api/v1/llm
PEPGNIX_TEAM_ID=your-team-id
PEPGNIX_PROJECT_ID=your-project-id
PEPGNIX_API_KEY=your-pepgnix-api-key

# MongoDB Configuration (for observability)
MONGODB_CONNECTION_STRING=mongodb://localhost:27017
MONGODB_DATABASE_NAME=evaluator_service
MONGODB_COLLECTION_NAME=evaluation_traces

Usage

As a Library

from evaluator_service import EvaluationOrchestrator, EvaluatorService, WinnerSelector
from evaluator_service.clients import PepgnixClient, MongoObservabilityClient
from evaluator_service.models import EvalRequest, Candidate, ContextChunk

# Initialize clients
llm_client = PepgnixClient()
observability_client = MongoObservabilityClient()

# Initialize services
evaluator_service = EvaluatorService(llm_client)
llm_judge = LLMJudge(llm_client)
winner_selector = WinnerSelector(llm_judge)
orchestrator = EvaluationOrchestrator(evaluator_service, winner_selector, observability_client)

# Create evaluation request
request = EvalRequest(
    request_id="req-123",
    user_query="What was PepsiCo revenue in 2024?",
    context_chunks=[
        ContextChunk(
            chunk_id="doc-001-chunk-04",
            text="PepsiCo reported revenue of 91.8 billion USD in FY2024.",
            retrieval_score=0.94
        )
    ],
    candidates=[
        Candidate(model="gpt", response="PepsiCo reported revenue of 91.8 billion USD in FY2024."),
        Candidate(model="claude", response="According to the annual report, PepsiCo reported total revenue of 91.8B for FY2024.")
    ]
)

# Run evaluation
result = orchestrator.evaluate(request)
print(f"Winner: {result.winner.model}, Score: {result.score}")

As a REST API

# Start the server
evaluator-service

# Or using python
python -m evaluator_service.main

The API will be available at http://localhost:8080

API Endpoint

POST /api/v1/evaluate

Request body:

{
  "request_id": "req-123",
  "user_query": "What was PepsiCo revenue in 2024?",
  "context_chunks": [
    {
      "chunk_id": "doc-001-chunk-04",
      "text": "PepsiCo reported revenue of 91.8 billion USD in FY2024.",
      "retrieval_score": 0.94
    }
  ],
  "candidates": [
    {
      "model": "gpt",
      "response": "PepsiCo reported revenue of 91.8 billion USD in FY2024."
    },
    {
      "model": "claude",
      "response": "According to the annual report, PepsiCo reported total revenue of 91.8B for FY2024."
    }
  ]
}

Response:

{
  "request_id": "req-123",
  "winner": {
    "model": "claude",
    "response": "According to the annual report, PepsiCo reported total revenue of 91.8B for FY2024.",
    "score": 0.85,
    "selection_method": "score_winner"
  },
  "all_scores": {
    "gpt": {
      "final": 0.82,
      "faithfulness": 0.9,
      "context_precision": 0.85,
      "context_recall": 0.8,
      "relevance": 0.95,
      "hallucination_risk": 0.1
    },
    "claude": {
      "final": 0.85,
      "faithfulness": 0.95,
      "context_precision": 0.9,
      "context_recall": 0.85,
      "relevance": 0.9,
      "hallucination_risk": 0.05
    }
  },
  "trace_id": "abc-123-def-456",
  "evaluated_at": "2024-01-15T10:30:00Z",
  "latency_ms": 2340
}

Scoring Formula

The final score is calculated using the following weighted formula:

Final Score =
  (0.35 × faithfulness)
+ (0.25 × context_recall)
+ (0.20 × relevance)
+ (0.20 × context_precision)
- (0.30 × hallucination_risk)

Tie-Breaking

When the difference between the top two scores is less than 0.05, the LLM Judge is invoked to compare the two answers based on:

Accuracy
Completeness
Grounding
Clarity

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black .

# Lint
ruff check .

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 8, 2026

0.1.0

Jun 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evaluator_service-0.1.1.tar.gz (14.4 kB view details)

Uploaded Jun 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evaluator_service-0.1.1-py3-none-any.whl (17.8 kB view details)

Uploaded Jun 8, 2026 Python 3

File details

Details for the file evaluator_service-0.1.1.tar.gz.

File metadata

Download URL: evaluator_service-0.1.1.tar.gz
Upload date: Jun 8, 2026
Size: 14.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for evaluator_service-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`670f9378b5b96244f95833c3208cd7d0e9b57ab303eec77219bbefd06c8f6732`
MD5	`b3969e579dd6aab95d97e5a7939c8253`
BLAKE2b-256	`fe00edee748e3ce854ad84d68e80f6f67af7f9c4efb4ddf886b6be14ae12badf`

See more details on using hashes here.

File details

Details for the file evaluator_service-0.1.1-py3-none-any.whl.

File metadata

Download URL: evaluator_service-0.1.1-py3-none-any.whl
Upload date: Jun 8, 2026
Size: 17.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for evaluator_service-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7485f7376741c18060a4f2d712ca246abbc30a0af6ab5a58c7bc53674c31f73d`
MD5	`754b2dd37cc62b7f4a09c1f055a058d9`
BLAKE2b-256	`30836489a0c35f4412e5cb51d5512f48f4c03130d55759dba3df79b5ba1db673`

See more details on using hashes here.

evaluator-service 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Evaluator Service

Features

Installation

Configuration

Usage

As a Library

As a REST API

API Endpoint

Scoring Formula

Tie-Breaking

Development

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes