Skip to main content

Trivial LLM eval - as simple as possible

Project description

teval

Trivial LLM eval - as simple as possible (maybe even a bit more simple)

A lightweight, straightforward evaluation framework for LLM outputs using Yes/No metrics with mandatory and cumulative scoring.

Features

  • Two-tier metric system: Mandatory metrics (all must pass) and cumulative metrics (threshold-based scoring)
  • Simple Yes/No evaluations: Each metric is a binary pass/fail criterion
  • Count-based scoring: Cumulative metrics contribute to a total score based on the number passed
  • LLM integration ready: Generate prompts, JSON schemas, and Pydantic models for structured LLM evaluation
  • Dynamic Pydantic models: Automatically create type-safe Pydantic classes from rubrics
  • Flexible validation: Accepts both JSON strings and dictionaries for LLM response validation
  • Type safety: Full IDE autocomplete and type checking support
  • Minimal dependencies: Only requires Pydantic 2.7.4+ (< 3.0.0)

Installation

Requirements: Python 3.10 - 3.14

This project uses uv for dependency management:

# Install dependencies
uv sync

# Activate virtual environment
source .venv/bin/activate  # Linux/macOS

How It Works

Evaluation System

The framework uses two types of metrics:

  1. Mandatory Metrics: All must pass (Yes=1) for the evaluation to succeed
  2. Cumulative Metrics: Contribute to a total score (count of passed metrics)

An evaluation passes if:

  • ALL mandatory metrics pass, AND
  • The count of passed cumulative metrics meets or exceeds the passing_score_threshold

Example Structure

from tevak import EvaluationRubric, MetricDefinition

rubric = EvaluationRubric(
    rubric_id="code_review_v1",
    metrics=[
        # Mandatory metrics (must all pass)
        MetricDefinition(id="M1", rubric="Code compiles without errors", mandatory=True),
        MetricDefinition(id="M2", rubric="No security vulnerabilities detected", mandatory=True),
        # Cumulative metrics (contribute to score count)
        MetricDefinition(id="C1", rubric="Follows project style guide"),
        MetricDefinition(id="C2", rubric="Includes appropriate comments"),
        MetricDefinition(id="C3", rubric="Uses meaningful variable names"),
        MetricDefinition(id="C4", rubric="Has proper error handling"),
    ],
    passing_score_threshold=3  # At least 3 of 4 cumulative metrics must pass
)

# Access mandatory and cumulative metrics via properties
print(f"Mandatory: {len(rubric.mandatory_metrics)}")  # 2
print(f"Cumulative: {len(rubric.cumulative_metrics)}")  # 4

LLM Integration

The framework provides built-in methods to integrate with LLM APIs for automated evaluation.

1. Generate Prompt Text

Use to_prompt_text() to create formatted instructions for LLM evaluators:

prompt_text = rubric.to_prompt_text()
# Returns formatted markdown with:
# - Mandatory criteria (all must pass)
# - Cumulative criteria (with threshold)
# - Clear evaluation instructions

# Use in your LLM prompt:
evaluation_prompt = f"""
Evaluate the following code submission:

{rubric.to_prompt_text()}

Code to evaluate:
{code_to_evaluate}
"""

2. Generate JSON Schema or Pydantic Model

Option A: JSON Schema - Use to_json_schema() for structured LLM outputs:

schema = rubric.to_json_schema()
# Returns OpenAPI/Swagger-compatible JSON Schema with:
# - Boolean fields for each metric
# - Optional reasoning fields (metric_id + "_reasoning")
# - Proper required/optional specifications

# Example with Gemini:
import google.generativeai as genai

response = model.generate_content(
    evaluation_prompt,
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json",
        response_schema=schema
    )
)

# Example with OpenAI:
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": evaluation_prompt}],
    response_format={"type": "json_schema", "json_schema": schema}
)

Option B: Pydantic Model - Use to_pydantic_model() for type-safe validation (recommended):

ResultModel = rubric.to_pydantic_model()
# Returns a dynamically created Pydantic model class with:
# - Boolean fields for each metric (required)
# - Optional string fields for reasoning
# - Automatic validation and type checking

# Use with libraries that support Pydantic models
# Example with instructor (OpenAI):
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

result = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": evaluation_prompt}],
    response_model=ResultModel  # Type-safe response
)

# Direct access to fields with type safety
print(result.M1)  # IDE autocomplete works!
print(result.M1_reasoning)

# Or parse JSON manually
json_response = '{"M1": true, "C1": false}'
result = ResultModel.model_validate_json(json_response)

# Export to dict
result_dict = result.model_dump()

3. Validate LLM Responses

Use validate_result() to check if the evaluation passes (accepts JSON string or dict):

# Option A: Pass JSON string directly
passes = rubric.validate_result(response.text)

# Option B: Pass parsed dictionary
import json
result_dict = json.loads(response.text)
passes = rubric.validate_result(result_dict)

if passes:
    print("✓ Evaluation passed!")
    # All mandatory metrics passed AND
    # Cumulative threshold met
else:
    print("✗ Evaluation failed")
    # Either a mandatory metric failed OR
    # Cumulative threshold not met

Complete Example

from tevak import EvaluationRubric, MetricDefinition
import json

# 1. Define rubric
rubric = EvaluationRubric(
    rubric_id="code_review_v1",
    metrics=[
        MetricDefinition(id="M1", rubric="Code compiles", mandatory=True),
        MetricDefinition(id="M2", rubric="No security issues", mandatory=True),
        MetricDefinition(id="C1", rubric="Follows style guide"),
        MetricDefinition(id="C2", rubric="Has tests"),
    ],
    passing_score_threshold=1
)

# 2. Get prompt and schema for LLM
prompt = rubric.to_prompt_text()
schema = rubric.to_json_schema()

# 3. Get LLM evaluation (your API call here)
llm_response = '{"M1": true, "M2": true, "C1": true, "C2": false}'

# 4. Validate result
passes = rubric.validate_result(llm_response)
print(f"Result: {'PASS' if passes else 'FAIL'}")
# Output: Result: PASS
# (Both mandatory metrics passed, 1 of 2 cumulative metrics passed)

See example_usage.py and example_pydantic.py for complete working examples.

Why Use Pydantic Models?

The to_pydantic_model() approach provides significant advantages over plain JSON schemas:

✅ Type Safety

ResultModel = rubric.to_pydantic_model()
result = ResultModel(M1=True, C1=False)

# IDE knows these types:
result.M1  # bool (autocomplete works!)
result.M1_reasoning  # Optional[str]

✅ Automatic Validation

# Wrong type - caught immediately
result = ResultModel(M1="yes")  # ❌ ValidationError

# Missing required field - caught immediately
result = ResultModel(M1=True)  # ❌ ValidationError (missing C1)

# Extra fields - rejected
result = ResultModel(M1=True, C1=False, extra="bad")  # ❌ ValidationError

✅ Better Developer Experience

# Direct attribute access (not dict keys)
if result.M1:  # Clear and type-safe
    print(result.M1_reasoning)

# Easy serialization
json_str = result.model_dump_json()
dict_data = result.model_dump(exclude_none=True)

# Easy parsing
result = ResultModel.model_validate_json(llm_response)

✅ Integration with LLM Libraries

# Works seamlessly with instructor
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

# Returns a validated Pydantic instance
result = client.chat.completions.create(
    model="gpt-4",
    response_model=ResultModel,  # ← Type-safe!
    messages=[...]
)

When to Use What

  • Use to_pydantic_model() when:

    • You want type safety and IDE support
    • You're using libraries like instructor, marvin, or langchain
    • You want automatic validation
    • You prefer Python objects over dictionaries
  • Use to_json_schema() when:

    • You need a plain JSON schema for API specifications
    • You're working with non-Python systems
    • You need OpenAPI/Swagger documentation
    • The LLM API only accepts JSON schemas

Project Structure

teval/
├── tevak/                          # Main package
│   ├── __init__.py
│   └── metrics.py                  # Core evaluation framework
├── tests/                          # Test suite
│   ├── __init__.py
│   ├── test_metrics.py             # Unit tests
│   ├── test_integration_vertex_ai.py  # Vertex AI integration tests
│   └── INTEGRATION_TESTS.md        # Integration test documentation
├── example_usage.py                # Complete LLM integration example
├── example_pydantic.py             # Pydantic model examples
├── pyproject.toml                  # Project configuration
├── CLAUDE.md                       # Development guide
└── README.md

Testing

Unit Tests

Run the core framework tests (no external dependencies):

# Run all unit tests
uv run pytest tests/test_metrics.py -v

# Or exclude integration tests
uv run pytest -m "not integration" -v

Integration Tests

Integration tests with Vertex AI require Google Cloud credentials:

# Install integration test dependencies
uv sync --group integration-tests

# Set up credentials
export GOOGLE_CLOUD_PROJECT=your-project-id
gcloud auth application-default login

# Run integration tests
uv run pytest tests/test_integration_vertex_ai.py -v

See tests/INTEGRATION_TESTS.md for detailed setup instructions.

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

teval-0.1.1.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

teval-0.1.1-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file teval-0.1.1.tar.gz.

File metadata

  • Download URL: teval-0.1.1.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.1

File hashes

Hashes for teval-0.1.1.tar.gz
Algorithm Hash digest
SHA256 87503045b87803449bc45eda0a47718a84d3d9bba247fde09a609ca9149a7eff
MD5 6418a08fb1fbc95b163bc311b4048243
BLAKE2b-256 9455a19cb7c3af789c82764b61c3e3a3651ef28fa5c65ea21bf48787d89b1d68

See more details on using hashes here.

File details

Details for the file teval-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: teval-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.1

File hashes

Hashes for teval-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fef4cd05451fe690dea3461d2e8fbad96ae3a285409671aba18940a618cb4091
MD5 0b9bbff6b09a4eb9da8e62bc992710e7
BLAKE2b-256 060cb08ff0244670b8c10fabf326093786c495c66b1e39f0ed17c8b9233d57c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page