Skip to main content

End-to-end LLM evaluation framework with synthetic data generation and multi-cloud support

Project description

Model Evaluation Library

A production-grade Python framework for end-to-end evaluation of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems.

This library provides a modular architecture for:

  • Generating synthetic evaluation datasets from private documentation.
  • Evaluating models using "LLM-as-a-Judge" and deterministic metrics.
  • Benchmarking RAG retrieval accuracy and generation quality.
  • Comparing models with statistical and narrative reporting.

Installation

pip install model-evaluation

Quick Start

Run a complete evaluation pipeline: generate a dataset, define metrics, and evaluate a target model.

import asyncio
from model_evaluation import ModelFactory, EvaluationEngine, DatasetGenerator

async def main():
    # 1. Setup secure connectors (auto-infers provider from model name)
    # Ensure OPENAI_API_KEY environment variable is set
    judge = ModelFactory.create_auto("gpt-4o")
    target = ModelFactory.create_auto("gpt-4o-mini")

    # 2. Generate synthetic test data from your documentation
    generator = DatasetGenerator(
        model_connector=judge,
        prompt_template="Create technical interview questions based on: {chunk}"
    )
    # Supports PDF, Markdown, and text files
    dataset = await generator.agenerate_from_file("docs/architecture.md", num_questions=10)

    # 3. Configure the Evaluation Engine
    engine = EvaluationEngine(
        judge_connector=judge,
        max_concurrency=5,  # Control parallelism
        metrics_config={
            "faithfulness": {"weight": 2.0},  # Assign higher importance
            "answer_relevance": {"weight": 1.0}
        }
    )

    # 4. Execute Evaluation
    results = await engine.aevaluate(
        dataset=dataset,
        target_model=target,
        metrics=["faithfulness", "answer_relevance", "compliance"]
    )

    # 5. Review Results
    print(f"Average Quality Score: {results['average_score']:.2f}")
    
    for metric, stats in results['metrics'].items():
        print(f"{metric}: {stats['score']:.2f} (min: {stats['min']:.2f})")

if __name__ == "__main__":
    asyncio.run(main())

Advanced Usage & Extensibility

The framework is designed to be easily extended for custom enterprise needs.

1. Custom Model Connectors

Integrate with internal or unsupported model providers by subclassing ModelConnector.

from model_evaluation.connectors.base import ModelConnector, PredictResult
from model_evaluation.connectors.factory import ModelFactory

class LocalLlamaConnector(ModelConnector):
    async def predict(self, input_text: str, system_prompt: str | None = None) -> PredictResult:
        # Your custom inference logic (e.g., requests to local vLLM)
        response_text = call_my_local_model(input_text)
        return PredictResult(output=response_text)

# Register globally
ModelFactory.register("local_llama", LocalLlamaConnector)

# Use in pipeline
model = ModelFactory.create("local_llama", "llama-3-70b-instruct")

2. Custom Metrics

Define domain-specific evaluation logic.

from model_evaluation.metrics import BaseLLMJudgeMetric, MetricResult, MetricRegistry

class SecurityComplianceMetric(BaseLLMJudgeMetric):
    @property
    def name(self) -> str:
        return "security_compliance"

    @property
    def prompt_template(self) -> str:
        return """
        Analyze if the answer reveals internal IP addresses or secrets.
        Context: {context}
        Answer: {answer}
        """

    def _parse_response(self, response: str) -> MetricResult:
        # Custom parsing logic
        return MetricResult(score=1.0, reasoning="No secrets found.")

# Register and use
MetricRegistry.register(SecurityComplianceMetric)
results = await engine.aevaluate(..., metrics=["security_compliance"])

3. RAG Evaluation (Retrieval + Generation)

Assess the retrieval component separately from the generation quality.

from model_evaluation.metrics import RecallAtK, PrecisionAtK

# Measure if relevant chunks were retrieved
recall_metric = RecallAtK(k=5)
score = await recall_metric.evaluate(
    context=ground_truth_document,
    retrieved_chunks=retrieved_snippets
)

4. Model Comparison & Narrative Reporting

Compare two models side-by-side with an AI-generated summary.

from model_evaluation.reporting import NarrativeReporter

# Get results from two different models
results_a = await engine.aevaluate(dataset, model_a)
results_b = await engine.aevaluate(dataset, model_b)

# Generate textual comparison
reporter = NarrativeReporter(judge_llm)
report = await reporter.generate_comparison(
    results_a, results_b, 
    model_a_name="GPT-4", 
    model_b_name="Llama-3"
)
print(report)

Core Features

  • Multi-Provider Support: Unified interface for OpenAI, Anthropic, Google Vertex AI, Groq, and custom endpoints.
  • Intelligent Filtering: Automatically applies relevant metrics based on item type (e.g., CodeBLEU runs only on type="code" items).
  • Cost Estimation: Calculate expected token usage and costs before running large evaluations.
  • Concurrency Control: Reliable async execution with semaphores to respect API rate limits.
  • Schema Validation: Strict typing for inputs and outputs ensures pipeline reliability.

API Reference

Evaluation Item Schema

The input dataset should follow this structure:

Field Type Description
question str The input prompt or question.
ground_truth str, optional Reference answer for comparison.
context str, optional Source text for grounding/faithfulness checks.
type str, optional text or code. Controls which metrics run.

Available Metrics

Type Metric Names
Judge faithfulness, compliance, answer_relevance, conciseness, toxicity, logical_consistency
Retrieval recall, precision, ndcg
Similarity semantic_similarity, codebleu, determinism

Examples

Check the examples/ directory for ready-to-run scripts:

  • quickstart_eval.py: Basic evaluation loop.
  • vertex_rag_eval.py: RAG evaluation with Google Vertex AI.
  • model_comparison.py: Comparing two models.
  • cost_estimation.py: Estimating API costs.
  • custom_gen_prompt.py: Customizing dataset generation.
  • template_customization.py: Customizing report outputs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eic_model_evaluation-1.0.0.tar.gz (86.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eic_model_evaluation-1.0.0-py3-none-any.whl (118.8 kB view details)

Uploaded Python 3

File details

Details for the file eic_model_evaluation-1.0.0.tar.gz.

File metadata

  • Download URL: eic_model_evaluation-1.0.0.tar.gz
  • Upload date:
  • Size: 86.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for eic_model_evaluation-1.0.0.tar.gz
Algorithm Hash digest
SHA256 20e4c6aa69b5959cec6e69fcd9b8f8f5be780b5c81f14b3eaa300b72ba7b7220
MD5 6269d801322f2dc837e2d9cb0a96df44
BLAKE2b-256 809fc307be73ecadea81fd64f44fcc499e7219113952d4aa43e91ffd5de6925a

See more details on using hashes here.

File details

Details for the file eic_model_evaluation-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for eic_model_evaluation-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 935a9cdcbdf903d4c60b67a6665b41d3f554c1ec4de72cf406c0e14b6d367ac2
MD5 1af0b41679e7b871d3919613fdbbea44
BLAKE2b-256 e90c1861635c6986759436487788b2051a048ea6b2ea9afbeee7a3bd48c7a4d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page