Project GenEval: A Unified Evaluation Framework for Generative AI Applications

Project description

GenEval

A unified evaluation framework for Gen AI applications. GenEval provides a standardized interface for evaluating generative AI models using multiple evaluation frameworks including RAGAS and DeepEval, with profile-driven evaluation for reproducible, reusable quality gates.

Overview

GenEval solves the problem of fragmented evaluation approaches in the RAG ecosystem. Instead of learning different APIs and managing separate evaluation workflows, GenEval provides a single interface that works with multiple evaluation frameworks.

Key benefits:

Profile-driven evaluation with reusable profiles, weighted scoring, and pass/fail verdicts
Automatic adapter routing -- metrics resolve to the best available framework (RAGAS or DeepEval) without manual tool selection
CI/CD integration -- exit codes, JSON output, and policy overrides for pipeline gates
Unified API for RAGAS and DeepEval metrics
Consistent output format across all frameworks
Support for 8 unique evaluation metrics
Config-driven LLM management supporting OpenAI, Anthropic, Google Gemini, Ollama, DeepSeek, Amazon Bedrock, Azure OpenAI, and vLLM

Supported Metrics

GenEval supports 8 unique metrics through its metric registry. Each metric automatically resolves to the best available adapter:

Metric	RAGAS	DeepEval
faithfulness	1st	2nd
answer_relevancy	1st	2nd
context_recall	1st	2nd
context_precision	1st	2nd
context_precision_with_reference	yes	--
context_entity_recall	yes	--
noise_sensitivity	yes	--
context_relevance	--	yes

When a metric is available in both frameworks, the registry tries the higher-priority adapter first and falls back to the other if it fails.

Installation

From Source

git clone https://github.com/savitharaghunathan/gen-eval.git
cd gen-eval
uv sync

# Set your API keys
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export GOOGLE_API_KEY="your-google-api-key"
export DEEPSEEK_API_KEY="your-deepseek-api-key"

# For vLLM servers
export VLLM_BASE_URL="https://your-vllm-server.com"
export VLLM_API_PATH="/v1"

# For AWS Bedrock (optional)
export AWS_ACCESS_KEY_ID="your-aws-access-key"
export AWS_SECRET_ACCESS_KEY="your-aws-secret-key"

Development Setup

uv sync --dev --all-extras
uv run pre-commit install
make lint
make test

Configuration

GenEval requires an LLM configuration file. API keys are read from environment variables -- never hardcoded.

Create llm_config.yaml:

providers:
  openai:
    enabled: true
    default: true
    api_key_env: "OPENAI_API_KEY"
    model: "gpt-4o-mini"

  anthropic:
    enabled: true
    default: false
    api_key_env: "ANTHROPIC_API_KEY"
    model: "claude-3-5-haiku-20241022"

  gemini:
    enabled: true
    default: false
    api_key_env: "GOOGLE_API_KEY"
    model: "gemini-1.5-flash"

  ollama:
    enabled: true
    default: false
    base_url: "http://localhost:11434"
    model: "llama3.2"

  deepseek:
    enabled: true
    default: false
    api_key_env: "DEEPSEEK_API_KEY"
    model: "deepseek-chat"

  vllm:
    enabled: true
    default: false
    base_url_env: "VLLM_BASE_URL"
    api_path_env: "VLLM_API_PATH"
    api_key_env: "OPENAI_API_KEY"
    model: "your-model-name"
    ssl_verify: false

  amazon_bedrock:
    enabled: true
    default: false
    model: "anthropic.claude-3-sonnet-20240229-v1:0"
    region_name: "us-east-1"

  azure_openai:
    enabled: true
    default: false
    model: "gpt-4"
    deployment_name: "your-deployment-name"
    azure_openai_api_key: "your-azure-openai-api-key"
    openai_api_version: "2025-01-01-preview"
    azure_endpoint: "https://your-resource.openai.azure.com/"

settings:
  temperature: 0.1
  max_tokens: 1000
  timeout: 30

Important: Only one provider should have default: true.

A sample config is available at config/llm_config.yaml.

Profile-Driven Evaluation

Profiles let you define evaluation standards once and reuse them across teams, workflows, and CI pipelines.

What's in a Profile

A profile specifies:

Metrics -- which metrics to evaluate
Weights -- how much each metric contributes to the composite score (must sum to 1.0)
Criteria -- minimum threshold for each individual metric
Composite threshold -- minimum weighted composite score to pass

Built-in Profiles

GenEval ships with three profiles:

Profile	Metrics	Composite Threshold	Use Case
`rag_default`	faithfulness, answer_relevancy, context_recall	0.70	General RAG evaluation
`strict`	faithfulness, answer_relevancy, context_recall, context_precision	0.85	Production deployments
`smoke_test`	faithfulness, answer_relevancy	0.50	Quick sanity checks

Custom Profiles

Create a profiles.yaml file:

profiles:
  medical_rag:
    description: "High-accuracy medical Q&A evaluation"
    metrics:
      - faithfulness
      - answer_relevancy
      - context_recall
    weights:
      faithfulness: 0.5
      answer_relevancy: 0.3
      context_recall: 0.2
    criteria:
      faithfulness: 0.95
      answer_relevancy: 0.85
      context_recall: 0.80
    composite_threshold: 0.90

Policies

Policies let you create runtime variations of a profile without redefining metrics or weights. Only criteria and composite_threshold can be overridden:

profiles:
  rag_default:
    metrics: [faithfulness, answer_relevancy, context_recall]
    weights: {faithfulness: 0.4, answer_relevancy: 0.3, context_recall: 0.3}
    criteria: {faithfulness: 0.7, answer_relevancy: 0.7, context_recall: 0.7}
    composite_threshold: 0.7

policies:
  ci_gate:
    profile: rag_default
    overrides:
      criteria:
        faithfulness: 0.9
      composite_threshold: 0.85

  nightly:
    profile: rag_default

Pass/Fail Logic

A test case passes when both conditions are met:

Every metric score meets its individual criteria threshold
The weighted composite score meets the composite threshold

CLI Usage

Run Evaluation

# Evaluate with a profile
geneval evaluate --profile rag_default --data test_data.yaml --config llm_config.yaml

# Evaluate with a policy
geneval evaluate --policy ci_gate --data test_data.yaml --profiles profiles.yaml

# Output as table
geneval evaluate --profile strict --data test_data.yaml --format table

# Write results to file
geneval evaluate --profile rag_default --data test_data.yaml --output results.json

The CLI exits with code 0 if all test cases pass, code 1 if any fail.

Manage Profiles

# List all available profiles
geneval profiles list

# List with custom profiles
geneval profiles list --profiles profiles.yaml

# Show profile details
geneval profiles show rag_default

# Validate a profiles file
geneval profiles validate profiles.yaml

Test Data Format

test_cases:
  - id: "test_001"
    user_input: "What is the capital of France?"
    retrieved_contexts: "France is a country in Europe. Its capital city is Paris."
    response: "Paris is the capital of France."
    reference: "Paris"

  - id: "test_002"
    user_input: "What is 2+2?"
    retrieved_contexts: "Basic arithmetic: 2+2 equals 4."
    response: "2+2 equals 4."
    reference: "4"

Python API

Profile-Based Evaluation

from geneval import GenEvalFramework

framework = GenEvalFramework(config_path="llm_config.yaml")

# Single test case
result = framework.evaluate_profile(
    profile="rag_default",
    question="What is the capital of France?",
    response="Paris is the capital of France.",
    reference="Paris",
    retrieval_context="France is a country in Europe. Its capital city is Paris.",
)

print(f"Passed: {result.overall_passed}")
print(f"Composite score: {result.composite_score:.2f}")
for mr in result.metric_results:
    print(f"  {mr.name}: {mr.score:.2f} (threshold: {mr.threshold}, passed: {mr.passed})")

# Batch evaluation
batch = framework.evaluate_profile_batch(
    data_path="test_data.yaml",
    profile="strict",
)

print(f"Pass rate: {batch.pass_rate:.0%}")
print(f"Overall: {'PASSED' if batch.overall_passed else 'FAILED'}")

Direct Metric Evaluation

from geneval import GenEvalFramework

framework = GenEvalFramework(config_path="llm_config.yaml")

# Evaluate with explicit metrics (runs on all adapters that support them)
results = framework.evaluate(
    question="What is the capital of France?",
    response="Paris is the capital of France.",
    reference="Paris",
    retrieval_context="France is a country in Europe. Its capital city is Paris.",
    metrics=["faithfulness", "answer_relevancy", "context_precision"]
)
print(results)

CI/CD Integration

GenEval is designed for pipeline integration. Use profiles to set quality gates:

# .github/workflows/eval.yml
- name: Run evaluation
  run: |
    geneval evaluate \
      --profile strict \
      --data tests/eval_data.yaml \
      --config config/llm_config.yaml \
      --output eval_results.json

The exit code (0 = pass, 1 = fail) integrates directly with CI systems.

Use policies to vary thresholds per environment without changing the profile:

# Strict gate for production
geneval evaluate --policy prod_gate --data test_data.yaml

# Relaxed gate for development
geneval evaluate --policy dev_gate --data test_data.yaml

Output Format

Profile Evaluation (JSON)

{
  "profile_name": "rag_default",
  "policy_name": null,
  "overall_passed": true,
  "composite_score": 0.87,
  "composite_threshold": 0.7,
  "composite_passed": true,
  "metric_results": [
    {
      "name": "faithfulness",
      "score": 0.95,
      "threshold": 0.7,
      "passed": true,
      "weight": 0.4,
      "weighted_score": 0.38,
      "adapter": "ragas",
      "details": "High faithfulness"
    }
  ],
  "metadata": {
    "timestamp": "2026-05-01T12:00:00+00:00"
  }
}

Table Output

Profile: rag_default
Pass Rate: 100% (5/5)

Metric                         Mean Score   Threshold
------------------------------------------------------
faithfulness                   0.9200       0.70
answer_relevancy               0.8800       0.70
context_recall                 0.8500       0.70

Overall: PASSED

Project Structure

gen-eval/
├── pyproject.toml
├── README.md
├── Makefile
├── config/
│   └── llm_config.yaml
├── geneval/
│   ├── __init__.py
│   ├── schemas.py              # Pydantic models (Input, Output, ProfileResult, BatchResult)
│   ├── framework.py            # Main evaluation framework
│   ├── llm_manager.py          # LLM provider management
│   ├── profile_manager.py      # Profile/policy loading, validation, scoring
│   ├── metric_registry.py      # Metric-to-adapter resolution
│   ├── exceptions.py           # ProfileValidationError, UnknownMetricError, ProfileNotFoundError
│   ├── cli.py                  # Click CLI (evaluate, profiles list/show/validate)
│   ├── normalization.py        # Score normalization utilities
│   ├── profiles/
│   │   └── default_profiles.yaml  # Built-in profiles (rag_default, strict, smoke_test)
│   └── adapters/
│       ├── ragas_adapter.py    # RAGAS integration
│       └── deepeval_adapter.py # DeepEval integration
├── tests/
│   ├── test_framework.py
│   ├── test_ragas_adapter.py
│   ├── test_deepeval_adapter.py
│   ├── test_llm_manager.py
│   ├── test_schemas.py
│   ├── test_schemas_profile.py
│   ├── test_profile_manager.py
│   ├── test_metric_registry.py
│   ├── test_cli.py
│   ├── test_data.yaml
│   └── test_data_clean.yaml
└── uv.lock

Development

Code Quality

Black -- Code formatting
isort -- Import sorting
Ruff -- Fast linting
pre-commit -- Git hooks for automated checks

Available Commands

make install       # Install dependencies
make dev           # Install pre-commit hooks
make format        # Format code
make lint          # Run linting
make test          # Run tests with coverage
make test-ci       # Run tests without coverage (faster)
make pre-commit    # Run all hooks on all files
make pre-push      # Format + lint before pushing
make build         # Build the package
make security      # Run security checks
make clean         # Clean generated files

Running Tests

uv run pytest tests/ -v
uv run pytest tests/ --cov=geneval --cov-report=term-missing
uv run pytest tests/test_profile_manager.py -v
uv run pytest tests/test_ragas_adapter.py -v

Contributing

Fork and clone the repo
Set up dev environment: make install && make dev
Create a feature branch: git checkout -b feature/your-feature
Make changes, add tests, verify: make test && make lint
Push and create a pull request

Support

Issues: GitHub Issues
Discussions: GitHub Discussions

Project details

Release history Release notifications | RSS feed

This version

0.0.2

May 1, 2026

0.0.1

Oct 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gen_eval-0.0.2.tar.gz (201.2 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gen_eval-0.0.2-py3-none-any.whl (29.3 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file gen_eval-0.0.2.tar.gz.

File metadata

Download URL: gen_eval-0.0.2.tar.gz
Upload date: May 1, 2026
Size: 201.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gen_eval-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`711d3340ef7d5c5844de715f8ac60f9fab04d2ffb7c0469f70ebb0e083cbd0ad`
MD5	`0983730996ca01f574f7d14dbb96a924`
BLAKE2b-256	`a13e6d1d514f7a6865c535b84c07fc708beb36c8520cf9303cf7fedca06ac589`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gen_eval-0.0.2.tar.gz:

Publisher: release.yml on savitharaghunathan/gen-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gen_eval-0.0.2.tar.gz
- Subject digest: 711d3340ef7d5c5844de715f8ac60f9fab04d2ffb7c0469f70ebb0e083cbd0ad
- Sigstore transparency entry: 1420223386
- Sigstore integration time: May 1, 2026
Source repository:
- Permalink: savitharaghunathan/gen-eval@94d54850e5cac47123f0dff9738a0c0d018d9e06
- Branch / Tag: refs/heads/main
- Owner: https://github.com/savitharaghunathan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@94d54850e5cac47123f0dff9738a0c0d018d9e06
- Trigger Event: workflow_dispatch

File details

Details for the file gen_eval-0.0.2-py3-none-any.whl.

File metadata

Download URL: gen_eval-0.0.2-py3-none-any.whl
Upload date: May 1, 2026
Size: 29.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gen_eval-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b269cac39c78127ee0ee82fb5fd0564a707dab2c78d081ea067a12276d19bbd`
MD5	`41860898f6217fe56098fb3537c67077`
BLAKE2b-256	`b7ca1435d9585472e6002c04e4c3e011dceeeb12714ecb1e24dece3ce2957651`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gen_eval-0.0.2-py3-none-any.whl:

Publisher: release.yml on savitharaghunathan/gen-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gen_eval-0.0.2-py3-none-any.whl
- Subject digest: 0b269cac39c78127ee0ee82fb5fd0564a707dab2c78d081ea067a12276d19bbd
- Sigstore transparency entry: 1420223522
- Sigstore integration time: May 1, 2026
Source repository:
- Permalink: savitharaghunathan/gen-eval@94d54850e5cac47123f0dff9738a0c0d018d9e06
- Branch / Tag: refs/heads/main
- Owner: https://github.com/savitharaghunathan
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@94d54850e5cac47123f0dff9738a0c0d018d9e06
- Trigger Event: workflow_dispatch

gen-eval 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

GenEval

Overview

Supported Metrics

Installation

From Source

Development Setup

Configuration

Profile-Driven Evaluation

What's in a Profile

Built-in Profiles

Custom Profiles

Policies

Pass/Fail Logic

CLI Usage

Run Evaluation

Manage Profiles

Test Data Format

Python API

Profile-Based Evaluation

Direct Metric Evaluation

CI/CD Integration

Output Format

Profile Evaluation (JSON)

Table Output

Project Structure

Development

Code Quality

Available Commands

Running Tests

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance