End-to-end LLM evaluation framework with synthetic data generation and multi-cloud support
Project description
Model Evaluation Library
A production-grade Python framework for end-to-end evaluation of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems.
This library provides a modular architecture for:
- Generating synthetic evaluation datasets from private documentation.
- Evaluating models using "LLM-as-a-Judge" and deterministic metrics.
- Benchmarking RAG retrieval accuracy and generation quality.
- Comparing models with statistical and narrative reporting.
Installation
pip install model-evaluation
Quick Start
Run a complete evaluation pipeline: generate a dataset, define metrics, and evaluate a target model.
import asyncio
from model_evaluation import ModelFactory, EvaluationEngine, DatasetGenerator
async def main():
# 1. Setup secure connectors (auto-infers provider from model name)
# Ensure OPENAI_API_KEY environment variable is set
judge = ModelFactory.create_auto("gpt-4o")
target = ModelFactory.create_auto("gpt-4o-mini")
# 2. Generate synthetic test data from your documentation
generator = DatasetGenerator(
model_connector=judge,
prompt_template="Create technical interview questions based on: {chunk}"
)
# Supports PDF, Markdown, and text files
dataset = await generator.agenerate_from_file("docs/architecture.md", num_questions=10)
# 3. Configure the Evaluation Engine
engine = EvaluationEngine(
judge_connector=judge,
max_concurrency=5, # Control parallelism
metrics_config={
"faithfulness": {"weight": 2.0}, # Assign higher importance
"answer_relevance": {"weight": 1.0}
}
)
# 4. Execute Evaluation
results = await engine.aevaluate(
dataset=dataset,
target_model=target,
metrics=["faithfulness", "answer_relevance", "compliance"]
)
# 5. Review Results
print(f"Average Quality Score: {results['average_score']:.2f}")
for metric, stats in results['metrics'].items():
print(f"{metric}: {stats['score']:.2f} (min: {stats['min']:.2f})")
if __name__ == "__main__":
asyncio.run(main())
Advanced Usage & Extensibility
The framework is designed to be easily extended for custom enterprise needs.
1. Custom Model Connectors
Integrate with internal or unsupported model providers by subclassing ModelConnector.
from model_evaluation.connectors.base import ModelConnector, PredictResult
from model_evaluation.connectors.factory import ModelFactory
class LocalLlamaConnector(ModelConnector):
async def predict(self, input_text: str, system_prompt: str | None = None) -> PredictResult:
# Your custom inference logic (e.g., requests to local vLLM)
response_text = call_my_local_model(input_text)
return PredictResult(output=response_text)
# Register globally
ModelFactory.register("local_llama", LocalLlamaConnector)
# Use in pipeline
model = ModelFactory.create("local_llama", "llama-3-70b-instruct")
2. Custom Metrics
Define domain-specific evaluation logic.
from model_evaluation.metrics import BaseLLMJudgeMetric, MetricResult, MetricRegistry
class SecurityComplianceMetric(BaseLLMJudgeMetric):
@property
def name(self) -> str:
return "security_compliance"
@property
def prompt_template(self) -> str:
return """
Analyze if the answer reveals internal IP addresses or secrets.
Context: {context}
Answer: {answer}
"""
def _parse_response(self, response: str) -> MetricResult:
# Custom parsing logic
return MetricResult(score=1.0, reasoning="No secrets found.")
# Register and use
MetricRegistry.register(SecurityComplianceMetric)
results = await engine.aevaluate(..., metrics=["security_compliance"])
3. RAG Evaluation (Retrieval + Generation)
Assess the retrieval component separately from the generation quality.
from model_evaluation.metrics import RecallAtK, PrecisionAtK
# Measure if relevant chunks were retrieved
recall_metric = RecallAtK(k=5)
score = await recall_metric.evaluate(
context=ground_truth_document,
retrieved_chunks=retrieved_snippets
)
4. Model Comparison & Narrative Reporting
Compare two models side-by-side with an AI-generated summary.
from model_evaluation.reporting import NarrativeReporter
# Get results from two different models
results_a = await engine.aevaluate(dataset, model_a)
results_b = await engine.aevaluate(dataset, model_b)
# Generate textual comparison
reporter = NarrativeReporter(judge_llm)
report = await reporter.generate_comparison(
results_a, results_b,
model_a_name="GPT-4",
model_b_name="Llama-3"
)
print(report)
Core Features
- Multi-Provider Support: Unified interface for OpenAI, Anthropic, Google Vertex AI, Groq, and custom endpoints.
- Intelligent Filtering: Automatically applies relevant metrics based on item type (e.g.,
CodeBLEUruns only ontype="code"items). - Cost Estimation: Calculate expected token usage and costs before running large evaluations.
- Concurrency Control: Reliable async execution with semaphores to respect API rate limits.
- Schema Validation: Strict typing for inputs and outputs ensures pipeline reliability.
API Reference
Evaluation Item Schema
The input dataset should follow this structure:
| Field | Type | Description |
|---|---|---|
question |
str |
The input prompt or question. |
ground_truth |
str, optional |
Reference answer for comparison. |
context |
str, optional |
Source text for grounding/faithfulness checks. |
type |
str, optional |
text or code. Controls which metrics run. |
Available Metrics
| Type | Metric Names |
|---|---|
| Judge | faithfulness, compliance, answer_relevance, conciseness, toxicity, logical_consistency |
| Retrieval | recall, precision, ndcg |
| Similarity | semantic_similarity, codebleu, determinism |
Examples
Check the examples/ directory for ready-to-run scripts:
quickstart_eval.py: Basic evaluation loop.vertex_rag_eval.py: RAG evaluation with Google Vertex AI.model_comparison.py: Comparing two models.cost_estimation.py: Estimating API costs.custom_gen_prompt.py: Customizing dataset generation.template_customization.py: Customizing report outputs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file eic_model_evaluation-1.0.0.tar.gz.
File metadata
- Download URL: eic_model_evaluation-1.0.0.tar.gz
- Upload date:
- Size: 86.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20e4c6aa69b5959cec6e69fcd9b8f8f5be780b5c81f14b3eaa300b72ba7b7220
|
|
| MD5 |
6269d801322f2dc837e2d9cb0a96df44
|
|
| BLAKE2b-256 |
809fc307be73ecadea81fd64f44fcc499e7219113952d4aa43e91ffd5de6925a
|
File details
Details for the file eic_model_evaluation-1.0.0-py3-none-any.whl.
File metadata
- Download URL: eic_model_evaluation-1.0.0-py3-none-any.whl
- Upload date:
- Size: 118.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
935a9cdcbdf903d4c60b67a6665b41d3f554c1ec4de72cf406c0e14b6d367ac2
|
|
| MD5 |
1af0b41679e7b871d3919613fdbbea44
|
|
| BLAKE2b-256 |
e90c1861635c6986759436487788b2051a048ea6b2ea9afbeee7a3bd48c7a4d6
|