Trivial LLM eval - as simple as possible

Project description

teval

Trivial LLM eval - as simple as possible (maybe even a bit more simple)

A lightweight, straightforward evaluation framework for LLM outputs using Yes/No metrics with mandatory and cumulative scoring.

Features

Two-tier metric system: Mandatory metrics (all must pass) and cumulative metrics (threshold-based scoring)
Simple Yes/No evaluations: Each metric is a binary pass/fail criterion
Count-based scoring: Cumulative metrics contribute to a total score based on the number passed
LLM integration ready: Generate prompts, JSON schemas, and Pydantic models for structured LLM evaluation
Dynamic Pydantic models: Automatically create type-safe Pydantic classes from rubrics
Flexible validation: Accepts both JSON strings and dictionaries for LLM response validation
Type safety: Full IDE autocomplete and type checking support
Minimal dependencies: Only requires Pydantic 2.7.4+ (< 3.0.0)

Installation

Requirements: Python 3.10 - 3.14

This project uses uv for dependency management:

# Install dependencies
uv sync

# Activate virtual environment
source .venv/bin/activate  # Linux/macOS

How It Works

Evaluation System

The framework uses two types of metrics:

Mandatory Metrics: All must pass (Yes=1) for the evaluation to succeed
Cumulative Metrics: Contribute to a total score (count of passed metrics)

An evaluation passes if:

ALL mandatory metrics pass, AND
The count of passed cumulative metrics meets or exceeds the passing_score_threshold

Example Structure

from tevak import EvaluationRubric, MetricDefinition

rubric = EvaluationRubric(
    rubric_id="code_review_v1",
    metrics=[
        # Mandatory metrics (must all pass)
        MetricDefinition(id="M1", rubric="Code compiles without errors", mandatory=True),
        MetricDefinition(id="M2", rubric="No security vulnerabilities detected", mandatory=True),
        # Cumulative metrics (contribute to score count)
        MetricDefinition(id="C1", rubric="Follows project style guide"),
        MetricDefinition(id="C2", rubric="Includes appropriate comments"),
        MetricDefinition(id="C3", rubric="Uses meaningful variable names"),
        MetricDefinition(id="C4", rubric="Has proper error handling"),
    ],
    passing_score_threshold=3  # At least 3 of 4 cumulative metrics must pass
)

# Access mandatory and cumulative metrics via properties
print(f"Mandatory: {len(rubric.mandatory_metrics)}")  # 2
print(f"Cumulative: {len(rubric.cumulative_metrics)}")  # 4

LLM Integration

The framework provides built-in methods to integrate with LLM APIs for automated evaluation.

1. Generate Prompt Text

Use to_prompt_text() to create formatted instructions for LLM evaluators:

prompt_text = rubric.to_prompt_text()
# Returns formatted markdown with:
# - Mandatory criteria (all must pass)
# - Cumulative criteria (with threshold)
# - Clear evaluation instructions

# Use in your LLM prompt:
evaluation_prompt = f"""
Evaluate the following code submission:

{rubric.to_prompt_text()}

Code to evaluate:
{code_to_evaluate}
"""

2. Generate JSON Schema or Pydantic Model

Option A: JSON Schema - Use to_json_schema() for structured LLM outputs:

schema = rubric.to_json_schema()
# Returns OpenAPI/Swagger-compatible JSON Schema with:
# - Boolean fields for each metric
# - Optional reasoning fields (metric_id + "_reasoning")
# - Proper required/optional specifications

# Example with Gemini:
import google.generativeai as genai

response = model.generate_content(
    evaluation_prompt,
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json",
        response_schema=schema
    )
)

# Example with OpenAI:
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": evaluation_prompt}],
    response_format={"type": "json_schema", "json_schema": schema}
)

Option B: Pydantic Model - Use to_pydantic_model() for type-safe validation (recommended):

ResultModel = rubric.to_pydantic_model()
# Returns a dynamically created Pydantic model class with:
# - Boolean fields for each metric (required)
# - Optional string fields for reasoning
# - Automatic validation and type checking

# Use with libraries that support Pydantic models
# Example with instructor (OpenAI):
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

result = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": evaluation_prompt}],
    response_model=ResultModel  # Type-safe response
)

# Direct access to fields with type safety
print(result.M1)  # IDE autocomplete works!
print(result.M1_reasoning)

# Or parse JSON manually
json_response = '{"M1": true, "C1": false}'
result = ResultModel.model_validate_json(json_response)

# Export to dict
result_dict = result.model_dump()

3. Validate LLM Responses

Use validate_result() to check if the evaluation passes (accepts JSON string or dict):

# Option A: Pass JSON string directly
passes = rubric.validate_result(response.text)

# Option B: Pass parsed dictionary
import json
result_dict = json.loads(response.text)
passes = rubric.validate_result(result_dict)

if passes:
    print("✓ Evaluation passed!")
    # All mandatory metrics passed AND
    # Cumulative threshold met
else:
    print("✗ Evaluation failed")
    # Either a mandatory metric failed OR
    # Cumulative threshold not met

Complete Example

from tevak import EvaluationRubric, MetricDefinition
import json

# 1. Define rubric
rubric = EvaluationRubric(
    rubric_id="code_review_v1",
    metrics=[
        MetricDefinition(id="M1", rubric="Code compiles", mandatory=True),
        MetricDefinition(id="M2", rubric="No security issues", mandatory=True),
        MetricDefinition(id="C1", rubric="Follows style guide"),
        MetricDefinition(id="C2", rubric="Has tests"),
    ],
    passing_score_threshold=1
)

# 2. Get prompt and schema for LLM
prompt = rubric.to_prompt_text()
schema = rubric.to_json_schema()

# 3. Get LLM evaluation (your API call here)
llm_response = '{"M1": true, "M2": true, "C1": true, "C2": false}'

# 4. Validate result
passes = rubric.validate_result(llm_response)
print(f"Result: {'PASS' if passes else 'FAIL'}")
# Output: Result: PASS
# (Both mandatory metrics passed, 1 of 2 cumulative metrics passed)

See example_usage.py and example_pydantic.py for complete working examples.

Why Use Pydantic Models?

The to_pydantic_model() approach provides significant advantages over plain JSON schemas:

✅ Type Safety

ResultModel = rubric.to_pydantic_model()
result = ResultModel(M1=True, C1=False)

# IDE knows these types:
result.M1  # bool (autocomplete works!)
result.M1_reasoning  # Optional[str]

✅ Automatic Validation

# Wrong type - caught immediately
result = ResultModel(M1="yes")  # ❌ ValidationError

# Missing required field - caught immediately
result = ResultModel(M1=True)  # ❌ ValidationError (missing C1)

# Extra fields - rejected
result = ResultModel(M1=True, C1=False, extra="bad")  # ❌ ValidationError

✅ Better Developer Experience

# Direct attribute access (not dict keys)
if result.M1:  # Clear and type-safe
    print(result.M1_reasoning)

# Easy serialization
json_str = result.model_dump_json()
dict_data = result.model_dump(exclude_none=True)

# Easy parsing
result = ResultModel.model_validate_json(llm_response)

✅ Integration with LLM Libraries

# Works seamlessly with instructor
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

# Returns a validated Pydantic instance
result = client.chat.completions.create(
    model="gpt-4",
    response_model=ResultModel,  # ← Type-safe!
    messages=[...]
)

When to Use What

Use to_pydantic_model() when:
- You want type safety and IDE support
- You're using libraries like instructor, marvin, or langchain
- You want automatic validation
- You prefer Python objects over dictionaries
Use to_json_schema() when:
- You need a plain JSON schema for API specifications
- You're working with non-Python systems
- You need OpenAPI/Swagger documentation
- The LLM API only accepts JSON schemas

Project Structure

teval/
├── tevak/                          # Main package
│   ├── __init__.py
│   └── metrics.py                  # Core evaluation framework
├── tests/                          # Test suite
│   ├── __init__.py
│   ├── test_metrics.py             # Unit tests
│   ├── test_integration_vertex_ai.py  # Vertex AI integration tests
│   └── INTEGRATION_TESTS.md        # Integration test documentation
├── example_usage.py                # Complete LLM integration example
├── example_pydantic.py             # Pydantic model examples
├── pyproject.toml                  # Project configuration
├── CLAUDE.md                       # Development guide
└── README.md

Testing

Unit Tests

Run the core framework tests (no external dependencies):

# Run all unit tests
uv run pytest tests/test_metrics.py -v

# Or exclude integration tests
uv run pytest -m "not integration" -v

Integration Tests

Integration tests with Vertex AI require Google Cloud credentials:

# Install integration test dependencies
uv sync --group integration-tests

# Set up credentials
export GOOGLE_CLOUD_PROJECT=your-project-id
gcloud auth application-default login

# Run integration tests
uv run pytest tests/test_integration_vertex_ai.py -v

See tests/INTEGRATION_TESTS.md for detailed setup instructions.

License

Apache License 2.0

Project details

Release history Release notifications | RSS feed

0.1.2

Jan 15, 2026

This version

0.1.1

Dec 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

teval-0.1.1.tar.gz (21.0 kB view details)

Uploaded Dec 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

teval-0.1.1-py3-none-any.whl (13.7 kB view details)

Uploaded Dec 29, 2025 Python 3

File details

Details for the file teval-0.1.1.tar.gz.

File metadata

Download URL: teval-0.1.1.tar.gz
Upload date: Dec 29, 2025
Size: 21.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.1

File hashes

Hashes for teval-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`87503045b87803449bc45eda0a47718a84d3d9bba247fde09a609ca9149a7eff`
MD5	`6418a08fb1fbc95b163bc311b4048243`
BLAKE2b-256	`9455a19cb7c3af789c82764b61c3e3a3651ef28fa5c65ea21bf48787d89b1d68`

See more details on using hashes here.

File details

Details for the file teval-0.1.1-py3-none-any.whl.

File metadata

Download URL: teval-0.1.1-py3-none-any.whl
Upload date: Dec 29, 2025
Size: 13.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.1

File hashes

Hashes for teval-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fef4cd05451fe690dea3461d2e8fbad96ae3a285409671aba18940a618cb4091`
MD5	`0b9bbff6b09a4eb9da8e62bc992710e7`
BLAKE2b-256	`060cb08ff0244670b8c10fabf326093786c495c66b1e39f0ed17c8b9233d57c9`

See more details on using hashes here.

teval 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

teval

Features

Installation

How It Works

Evaluation System

Example Structure

LLM Integration

1. Generate Prompt Text

2. Generate JSON Schema or Pydantic Model

3. Validate LLM Responses

Complete Example

Why Use Pydantic Models?

✅ Type Safety

✅ Automatic Validation

✅ Better Developer Experience

✅ Integration with LLM Libraries

When to Use What

Project Structure

Testing

Unit Tests

Integration Tests

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes