Trivial LLM eval - as simple as possible
Project description
teval
Trivial LLM eval - as simple as possible (maybe even a bit more simple)
A lightweight, straightforward evaluation framework for LLM outputs using Yes/No metrics with mandatory and cumulative scoring.
Features
- Two-tier metric system: Mandatory metrics (all must pass) and cumulative metrics (threshold-based scoring)
- Simple Yes/No evaluations: Each metric is a binary pass/fail criterion
- Count-based scoring: Cumulative metrics contribute to a total score based on the number passed
- LLM integration ready: Generate prompts, JSON schemas, and Pydantic models for structured LLM evaluation
- Dynamic Pydantic models: Automatically create type-safe Pydantic classes from rubrics
- Flexible validation: Accepts both JSON strings and dictionaries for LLM response validation
- Type safety: Full IDE autocomplete and type checking support
- Minimal dependencies: Only requires Pydantic 2.7.4+ (< 3.0.0)
Quick Start
New to teval? Check out the 5-minute Quick Start Guide to get up and running fast.
Documentation
- Quick Start Guide - Get started in 5 minutes
- API Reference - Complete API documentation for all classes and methods
- Roadmap - Future development plans
For complete usage details, continue reading below.
Installation
Requirements: Python 3.10 - 3.13 (Python 3.14 support pending Pydantic compatibility)
This project uses uv for dependency management:
# Install dependencies
uv sync
# Activate virtual environment
source .venv/bin/activate # Linux/macOS
How It Works
Evaluation System
The framework uses two types of metrics:
- Mandatory Metrics: All must pass (Yes=1) for the evaluation to succeed
- Cumulative Metrics: Contribute to a total score (count of passed metrics)
An evaluation passes if:
- ALL mandatory metrics pass, AND
- The count of passed cumulative metrics meets or exceeds the
passing_score_threshold
Example Structure
from teval import EvaluationRubric, MetricDefinition
rubric = EvaluationRubric(
rubric_id="code_review_v1",
metrics=[
# Mandatory metrics (must all pass)
MetricDefinition(id="M1", rubric="Code compiles without errors", mandatory=True),
MetricDefinition(id="M2", rubric="No security vulnerabilities detected", mandatory=True),
# Cumulative metrics (contribute to score count)
MetricDefinition(id="C1", rubric="Follows project style guide"),
MetricDefinition(id="C2", rubric="Includes appropriate comments"),
MetricDefinition(id="C3", rubric="Uses meaningful variable names"),
MetricDefinition(id="C4", rubric="Has proper error handling"),
],
passing_score_threshold=3 # At least 3 of 4 cumulative metrics must pass
)
# Access mandatory and cumulative metrics via properties
print(f"Mandatory: {len(rubric.mandatory_metrics)}") # 2
print(f"Cumulative: {len(rubric.cumulative_metrics)}") # 4
LLM Integration
The framework provides built-in methods to integrate with LLM APIs for automated evaluation.
1. Generate Prompt Text
Use to_prompt_text() to create formatted instructions for LLM evaluators:
prompt_text = rubric.to_prompt_text()
# Returns formatted markdown with:
# - Mandatory criteria (all must pass)
# - Cumulative criteria (with threshold)
# - Clear evaluation instructions
# Use in your LLM prompt:
evaluation_prompt = f"""
Evaluate the following code submission:
{rubric.to_prompt_text()}
Code to evaluate:
{code_to_evaluate}
"""
2. Generate JSON Schema or Pydantic Model
Option A: JSON Schema - Use to_json_schema() for structured LLM outputs:
schema = rubric.to_json_schema()
# Returns OpenAPI/Swagger-compatible JSON Schema with:
# - Boolean fields for each metric
# - Optional reasoning fields (metric_id + "_reasoning")
# - Proper required/optional specifications
# Example with Gemini:
import google.generativeai as genai
response = model.generate_content(
evaluation_prompt,
generation_config=genai.GenerationConfig(
response_mime_type="application/json",
response_schema=schema
)
)
# Example with OpenAI:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": evaluation_prompt}],
response_format={"type": "json_schema", "json_schema": schema}
)
Option B: Pydantic Model - Use to_pydantic_model() for type-safe validation (recommended):
ResultModel = rubric.to_pydantic_model()
# Returns a dynamically created Pydantic model class with:
# - Boolean fields for each metric (required)
# - Optional string fields for reasoning
# - Automatic validation and type checking
# Use with libraries that support Pydantic models
# Example with instructor (OpenAI):
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
result = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": evaluation_prompt}],
response_model=ResultModel # Type-safe response
)
# Direct access to fields with type safety
print(result.M1) # IDE autocomplete works!
print(result.M1_reasoning)
# Or parse JSON manually
json_response = '{"M1": true, "C1": false}'
result = ResultModel.model_validate_json(json_response)
# Export to dict
result_dict = result.model_dump()
3. Validate LLM Responses
Use validate_result() to check if the evaluation passes (accepts JSON string or dict):
# Option A: Pass JSON string directly
passes = rubric.validate_result(response.text)
# Option B: Pass parsed dictionary
import json
result_dict = json.loads(response.text)
passes = rubric.validate_result(result_dict)
if passes:
print("✓ Evaluation passed!")
# All mandatory metrics passed AND
# Cumulative threshold met
else:
print("✗ Evaluation failed")
# Either a mandatory metric failed OR
# Cumulative threshold not met
Complete Example
from teval import EvaluationRubric, MetricDefinition
import json
# 1. Define rubric
rubric = EvaluationRubric(
rubric_id="code_review_v1",
metrics=[
MetricDefinition(id="M1", rubric="Code compiles", mandatory=True),
MetricDefinition(id="M2", rubric="No security issues", mandatory=True),
MetricDefinition(id="C1", rubric="Follows style guide"),
MetricDefinition(id="C2", rubric="Has tests"),
],
passing_score_threshold=1
)
# 2. Get prompt and schema for LLM
prompt = rubric.to_prompt_text()
schema = rubric.to_json_schema()
# 3. Get LLM evaluation (your API call here)
llm_response = '{"M1": true, "M2": true, "C1": true, "C2": false}'
# 4. Validate result
passes = rubric.validate_result(llm_response)
print(f"Result: {'PASS' if passes else 'FAIL'}")
# Output: Result: PASS
# (Both mandatory metrics passed, 1 of 2 cumulative metrics passed)
See example_usage.py and example_pydantic.py for complete working examples.
Why Use Pydantic Models?
The to_pydantic_model() approach provides significant advantages over plain JSON schemas:
✅ Type Safety
ResultModel = rubric.to_pydantic_model()
result = ResultModel(M1=True, C1=False)
# IDE knows these types:
result.M1 # bool (autocomplete works!)
result.M1_reasoning # Optional[str]
✅ Automatic Validation
# Wrong type - caught immediately
result = ResultModel(M1="yes") # ❌ ValidationError
# Missing required field - caught immediately
result = ResultModel(M1=True) # ❌ ValidationError (missing C1)
# Extra fields - rejected
result = ResultModel(M1=True, C1=False, extra="bad") # ❌ ValidationError
✅ Better Developer Experience
# Direct attribute access (not dict keys)
if result.M1: # Clear and type-safe
print(result.M1_reasoning)
# Easy serialization
json_str = result.model_dump_json()
dict_data = result.model_dump(exclude_none=True)
# Easy parsing
result = ResultModel.model_validate_json(llm_response)
✅ Integration with LLM Libraries
# Works seamlessly with instructor
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
# Returns a validated Pydantic instance
result = client.chat.completions.create(
model="gpt-4",
response_model=ResultModel, # ← Type-safe!
messages=[...]
)
When to Use What
-
Use
to_pydantic_model()when:- You want type safety and IDE support
- You're using libraries like
instructor,marvin, orlangchain - You want automatic validation
- You prefer Python objects over dictionaries
-
Use
to_json_schema()when:- You need a plain JSON schema for API specifications
- You're working with non-Python systems
- You need OpenAPI/Swagger documentation
- The LLM API only accepts JSON schemas
Project Structure
teval/
├── teval/ # Main package
│ ├── __init__.py
│ └── metrics.py # Core evaluation framework
├── tests/ # Test suite
│ ├── __init__.py
│ ├── test_metrics.py # Unit tests
│ ├── test_integration_vertex_ai.py # Vertex AI integration tests
│ └── INTEGRATION_TESTS.md # Integration test documentation
├── example_usage.py # Complete LLM integration example
├── example_pydantic.py # Pydantic model examples
├── pyproject.toml # Project configuration
├── CLAUDE.md # Development guide
└── README.md
Testing
Unit Tests
Run the core framework tests (no external dependencies):
# Run all unit tests
uv run pytest tests/test_metrics.py -v
# Or exclude integration tests
uv run pytest -m "not integration" -v
Integration Tests
Integration tests with Vertex AI require Google Cloud credentials:
# Install integration test dependencies
uv sync --group integration-tests
# Set up credentials
export GOOGLE_CLOUD_PROJECT=your-project-id
gcloud auth application-default login
# Run integration tests
uv run pytest tests/test_integration_vertex_ai.py -v
See tests/INTEGRATION_TESTS.md for detailed setup instructions.
License
Apache License 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file teval-0.1.2.tar.gz.
File metadata
- Download URL: teval-0.1.2.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a4c2590c70ef70b5e34368bffd1b6955349d64275d47a5523bfb6481357630e
|
|
| MD5 |
1d7a2693b78a36fd9f019c32022edc71
|
|
| BLAKE2b-256 |
95f2fc151bc13a61202e474adac09357625cdc76da24c9f1ebcd0400c5a92bd4
|
File details
Details for the file teval-0.1.2-py3-none-any.whl.
File metadata
- Download URL: teval-0.1.2-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0440164f962f24afbecac11b9d77e1dd7f61133a8b9874cbd39f5e285219f35
|
|
| MD5 |
fd24eed30e0a7a3729317e491e34de2e
|
|
| BLAKE2b-256 |
2d91f2eb4b787aac46179cb42c5cd6dc1304000ebf649eba3f328e658b696be4
|