End-to-end LLM evaluation framework with synthetic data generation and multi-cloud support

These details have not been verified by PyPI

Project links

Project description

Model Evaluation Library

A production-grade Python framework for end-to-end evaluation of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems.

This library provides a modular architecture for:

Generating synthetic evaluation datasets from private documentation.
Evaluating models using "LLM-as-a-Judge" and deterministic metrics.
Benchmarking RAG retrieval accuracy and generation quality.
Comparing models with statistical and narrative reporting.

Installation

pip install model-evaluation

Quick Start

Run a complete evaluation pipeline: generate a dataset, define metrics, and evaluate a target model.

import asyncio
from model_evaluation import ModelFactory, EvaluationEngine, DatasetGenerator

async def main():
    # 1. Setup secure connectors (auto-infers provider from model name)
    # Ensure OPENAI_API_KEY environment variable is set
    judge = ModelFactory.create_auto("gpt-4o")
    target = ModelFactory.create_auto("gpt-4o-mini")

    # 2. Generate synthetic test data from your documentation
    generator = DatasetGenerator(
        model_connector=judge,
        prompt_template="Create technical interview questions based on: {chunk}"
    )
    # Supports PDF, Markdown, and text files
    dataset = await generator.agenerate_from_file("docs/architecture.md", num_questions=10)

    # 3. Configure the Evaluation Engine
    engine = EvaluationEngine(
        judge_connector=judge,
        max_concurrency=5,  # Control parallelism
        metrics_config={
            "faithfulness": {"weight": 2.0},  # Assign higher importance
            "answer_relevance": {"weight": 1.0}
        }
    )

    # 4. Execute Evaluation
    results = await engine.aevaluate(
        dataset=dataset,
        target_model=target,
        metrics=["faithfulness", "answer_relevance", "compliance"]
    )

    # 5. Review Results
    print(f"Average Quality Score: {results['average_score']:.2f}")
    
    for metric, stats in results['metrics'].items():
        print(f"{metric}: {stats['score']:.2f} (min: {stats['min']:.2f})")

if __name__ == "__main__":
    asyncio.run(main())

Advanced Usage & Extensibility

The framework is designed to be easily extended for custom enterprise needs.

1. Custom Model Connectors

Integrate with internal or unsupported model providers by subclassing ModelConnector.

from model_evaluation.connectors.base import ModelConnector, PredictResult
from model_evaluation.connectors.factory import ModelFactory

class LocalLlamaConnector(ModelConnector):
    async def predict(self, input_text: str, system_prompt: str | None = None) -> PredictResult:
        # Your custom inference logic (e.g., requests to local vLLM)
        response_text = call_my_local_model(input_text)
        return PredictResult(output=response_text)

# Register globally
ModelFactory.register("local_llama", LocalLlamaConnector)

# Use in pipeline
model = ModelFactory.create("local_llama", "llama-3-70b-instruct")

2. Custom Metrics

Define domain-specific evaluation logic.

from model_evaluation.metrics import BaseLLMJudgeMetric, MetricResult, MetricRegistry

class SecurityComplianceMetric(BaseLLMJudgeMetric):
    @property
    def name(self) -> str:
        return "security_compliance"

    @property
    def prompt_template(self) -> str:
        return """
        Analyze if the answer reveals internal IP addresses or secrets.
        Context: {context}
        Answer: {answer}
        """

    def _parse_response(self, response: str) -> MetricResult:
        # Custom parsing logic
        return MetricResult(score=1.0, reasoning="No secrets found.")

# Register and use
MetricRegistry.register(SecurityComplianceMetric)
results = await engine.aevaluate(..., metrics=["security_compliance"])

3. RAG Evaluation (Retrieval + Generation)

Assess the retrieval component separately from the generation quality.

from model_evaluation.metrics import RecallAtK, PrecisionAtK

# Measure if relevant chunks were retrieved
recall_metric = RecallAtK(k=5)
score = await recall_metric.evaluate(
    context=ground_truth_document,
    retrieved_chunks=retrieved_snippets
)

4. Model Comparison & Narrative Reporting

Compare two models side-by-side with an AI-generated summary.

from model_evaluation.reporting import NarrativeReporter

# Get results from two different models
results_a = await engine.aevaluate(dataset, model_a)
results_b = await engine.aevaluate(dataset, model_b)

# Generate textual comparison
reporter = NarrativeReporter(judge_llm)
report = await reporter.generate_comparison(
    results_a, results_b, 
    model_a_name="GPT-4", 
    model_b_name="Llama-3"
)
print(report)

Core Features

Multi-Provider Support: Unified interface for OpenAI, Anthropic, Google Vertex AI, Groq, and custom endpoints.
Intelligent Filtering: Automatically applies relevant metrics based on item type (e.g., CodeBLEU runs only on type="code" items).
Cost Estimation: Calculate expected token usage and costs before running large evaluations.
Concurrency Control: Reliable async execution with semaphores to respect API rate limits.
Schema Validation: Strict typing for inputs and outputs ensures pipeline reliability.

API Reference

Evaluation Item Schema

The input dataset should follow this structure:

Field	Type	Description
`question`	`str`	The input prompt or question.
`ground_truth`	`str, optional`	Reference answer for comparison.
`context`	`str, optional`	Source text for grounding/faithfulness checks.
`type`	`str, optional`	`text` or `code`. Controls which metrics run.

Available Metrics

Type	Metric Names
Judge	`faithfulness`, `compliance`, `answer_relevance`, `conciseness`, `toxicity`, `logical_consistency`
Retrieval	`recall`, `precision`, `ndcg`
Similarity	`semantic_similarity`, `codebleu`, `determinism`

Examples

Check the examples/ directory for ready-to-run scripts:

quickstart_eval.py: Basic evaluation loop.
vertex_rag_eval.py: RAG evaluation with Google Vertex AI.
model_comparison.py: Comparing two models.
cost_estimation.py: Estimating API costs.
custom_gen_prompt.py: Customizing dataset generation.
template_customization.py: Customizing report outputs.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eic_model_evaluation-1.0.0.tar.gz (86.3 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

eic_model_evaluation-1.0.0-py3-none-any.whl (118.8 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file eic_model_evaluation-1.0.0.tar.gz.

File metadata

Download URL: eic_model_evaluation-1.0.0.tar.gz
Upload date: Apr 1, 2026
Size: 86.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for eic_model_evaluation-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`20e4c6aa69b5959cec6e69fcd9b8f8f5be780b5c81f14b3eaa300b72ba7b7220`
MD5	`6269d801322f2dc837e2d9cb0a96df44`
BLAKE2b-256	`809fc307be73ecadea81fd64f44fcc499e7219113952d4aa43e91ffd5de6925a`

See more details on using hashes here.

File details

Details for the file eic_model_evaluation-1.0.0-py3-none-any.whl.

File metadata

Download URL: eic_model_evaluation-1.0.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 118.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for eic_model_evaluation-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`935a9cdcbdf903d4c60b67a6665b41d3f554c1ec4de72cf406c0e14b6d367ac2`
MD5	`1af0b41679e7b871d3919613fdbbea44`
BLAKE2b-256	`e90c1861635c6986759436487788b2051a048ea6b2ea9afbeee7a3bd48c7a4d6`

See more details on using hashes here.

eic-model-evaluation 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Model Evaluation Library

Installation

Quick Start

Advanced Usage & Extensibility

1. Custom Model Connectors

2. Custom Metrics

3. RAG Evaluation (Retrieval + Generation)

4. Model Comparison & Narrative Reporting

Core Features

API Reference

Evaluation Item Schema

Available Metrics

Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes