A modular evaluation framework for testing functions with YAML-based specifications

These details have not been verified by PyPI

Project links

Project description

VOWEL - YAML Based Evaluation Specification

Modular evaluation system. Reads and tests function evaluations from YAML files.

🚀 Installation

Install Directly

# Install in development mode (editable mode)
pip install -e .

# Or normal installation
pip install .

Install from PyPI

# Install in development mode (editable mode)
pip install vowel

# Or install with uv
uv add vowel

🎯 Usage

After installation, the vowel command is available system-wide:

# Normal mode - logfire disabled (fast and clean output)
vowel multi_evals.yml

# Debug mode - logfire enabled (detailed logging)
vowel multi_evals.yml --debug

# Test only specific functions (comma-separated)
vowel multi_evals.yml -f make_list,make_square

# Test specific functions (multiple -f flags)
vowel multi_evals.yml -f make_list -f make_square

# Verbose mode - show error details
vowel multi_evals.yml -v

# Use all options together
vowel tests.yml -f uppercase,lowercase --debug -v

# Help
vowel --help

CLI Options

--debug: Enables debug mode with detailed logging via logfire
-f, --filter: Tests only the specified function(s). Use comma-separated values or multiple flags: -f func1,func2 or -f func1 -f func2
-v, --verbose: Shows detailed output including error reasons
--help: Shows help message

📝 YAML Syntax

Basic Structure

function_name:
  evals: # Global evaluators (optional)
    EvaluatorName:
      # evaluator specific parameters
  dataset: # Test cases
    - case:
        input: <input> # For single-parameter functions
        # or
        inputs: [<arg1>, <arg2>, ...] # For multi-parameter functions
        expected: <expected_output> # Optional
        assertion: <custom_check> # Optional, case-specific assertion
        # case-specific evaluators

Function Naming

vowel allows you to use functions in 3 different ways:

# 1. Builtin functions (len, str, int, etc.)
len:
  dataset:
    - case:
        input: [1, 2, 3]
        expected: 3

# 2. Standard library (module.function format)
json.dumps:
  dataset:
    - case:
        input: { "key": "value" }
        expected: '{"key": "value"}'

os.path.join:
  dataset:
    - case:
        inputs: ["/home", "user", "file.txt"] # os.path.join("/home", "user", "file.txt")
        expected: "/home/user/file.txt"

# 3. Your own functions (via programmatic API)
# Function name in YAML is passed to run_evals with functions parameter
multiply:
  dataset:
    - case:
        inputs: [2, 3] # multiply(2, 3) - multi-parameter
        expected: 6

To use your own functions:

from vowel import run_evals

def multiply(x: int, y: int) -> int:
    return x * y

# With YAML file or dict
summary = run_evals(
    "evals.yml",  # or dict
    functions={"multiply": multiply}  # Pass function directly
)

Or with RunEvals Fluent API:

from vowel import RunEvals

def multiply(x: int, y: int) -> int:
    return x * y

# Cleaner and composable
summary = (
    RunEvals.from_file("evals.yml")
    .with_functions({"multiply": multiply})
    .debug()
    .run()
)

print(f"Passed: {summary.all_passed}")

For detailed information: API_USAGE.md and RUNEVALS_GUIDE.md

Input Formats

Important: The difference between input and inputs:

input: For single-parameter functions (input value is passed directly to the function)
inputs: For multi-parameter functions (list elements are unpacked as separate arguments)

# Single parameter - use 'input'
single_param:
  dataset:
    - case:
        input: 42 # Function: single_param(42)

# Multi-parameter - use 'inputs'
multi_param:
  dataset:
    - case:
        inputs: [10, 20, 30] # Function: multi_param(10, 20, 30)

# Single-parameter function but parameter is a list - use 'input'
list_param:
  dataset:
    - case:
        input: [1, 2, 3] # Function: list_param([1, 2, 3])

# Complex types - single parameter
complex_types:
  dataset:
    - case:
        input: [{ "name": "John", "age": 30 }, { "name": "Jane", "age": 25 }]

# Multi-parameter example - max function
max:
  dataset:
    - case:
        inputs: [5, 10, 3] # max(5, 10, 3)
        expected: 10

    - case:
        inputs: [-1, -5, -2] # max(-1, -5, -2)
        expected: -1

# Single-parameter example - len function
len:
  dataset:
    - case:
        input: [1, 2, 3, 4] # len([1, 2, 3, 4])
        expected: 4

    - case:
        input: "hello" # len("hello")
        expected: 5

Complete Example

make_square:
  evals:
    # Global evaluators - apply to all cases
    IsNumber:
      type: "int | float" # Output must be int or float
    PositiveCheck:
      assertion: "output > 0" # Output must be positive
    FastEnough:
      duration: 0.01 # Maximum 0.01 seconds

  dataset:
    # Test case 1
    - case:
        input: 5
        expected: 25 # 5*5 = 25

    # Test case 2 - case-specific evaluator
    - case:
        input: -3
        expected: 9
        duration: 100 # 100ms limit for this case

    # Test case 3 - only global evaluators
    - case:
        input: 10
        # no expected, only global assertions are tested

📊 Evaluators

1. Assertion Evaluator

Executes Python code. The output variable contains the function result.

function_name:
  evals:
    # Simple check
    PositiveNumber:
      assertion: "output > 0"

    # Complex check
    RangeCheck:
      assertion: "10 < output < 100"

    # List check
    ListLength:
      assertion: "len(output) == 5"

    # String check
    StartsWith:
      assertion: "output.startswith('hello')"

    # Type check
    IsList:
      assertion: "isinstance(output, list)"

Examples:

uppercase:
  evals:
    IsString:
      type: str
    AllCaps:
      assertion: "output.isupper()"
    SameLength:
      assertion: "len(input) == len(output)"
  dataset:
    - case:
        input: "hello"
        expected: "HELLO"

filter_positive:
  evals:
    IsList:
      type: list
    AllPositive:
      assertion: "all(x > 0 for x in output)"
  dataset:
    - case:
        input: [1, -2, 3, -4, 5]
        expected: [1, 3, 5]

2. Type Evaluator

Checks the type of the output. Supports union types.

function_name:
  evals:
    # Single type
    IsString:
      type: str

    # Union type
    IsNumber:
      type: "int | float"

    # List type
    IsList:
      type: list

    # Dict type
    IsDict:
      type: dict

Examples:

make_list:
  evals:
    IsList:
      type: list
  dataset:
    - case:
        input: 5
        expected: [5]

divide:
  evals:
    IsNumber:
      type: "int | float" # accepts int or float
  dataset:
    - case:
        inputs: [10, 2]
        expected: 5.0

3. Duration Evaluator

Checks the execution time of the function.

# Global level (seconds)
function_name:
  evals:
    FastEnough:
      duration: 0.01  # maximum 0.01 seconds (10ms)

# Case level (milliseconds)
function_name:
  dataset:
    - case:
        input: 100
        duration: 50  # maximum 50 milliseconds

Example:

fibonacci:
  evals:
    FastEnough:
      duration: 0.001 # 1ms limit
  dataset:
    - case:
        input: 10
        expected: 55

    - case:
        input: 20
        expected: 6765
        duration: 10 # 10ms limit for this case

4. Contains Input Evaluator

Checks if the input is contained in the output.

function_name:
  evals:
    ContainsInput:
      contains_input:
        case_sensitive: true # Case sensitive (default: true)
        as_strings: false # Convert to string and compare (default: false)

Examples:

wrap_string:
  evals:
    ContainsInput:
      contains_input:
        case_sensitive: true
  dataset:
    - case:
        input: "world"
        expected: "Hello, world!"

repeat_list:
  evals:
    ContainsInput:
      contains_input:
        as_strings: true # [1,2] → "[1, 2]" as string
  dataset:
    - case:
        input: [1, 2, 3]
        expected: [1, 2, 3, 1, 2, 3]

5. Expected Evaluator (Case Level)

Compares the expected output with the actual output.

function_name:
  dataset:
    - case:
        input: 5
        expected: 25 # output == 25

Examples:

add:
  dataset:
    - case:
        inputs: [2, 3]
        expected: 5

    - case:
        inputs: [10, -5]
        expected: 5

multiply:
  dataset:
    - case:
        inputs: [3, 4]
        expected: 12

6. Contains Evaluator (Case Level)

Checks if a specific value is contained in the output.

function_name:
  dataset:
    - case:
        input: "test"
        contains: "expected_substring"

Example:

generate_html:
  dataset:
    - case:
        input: "Title"
        contains: "<h1>Title</h1>" # This string must be in output

    - case:
        input: "Link"
        contains: "<a>" # This must also be in output

7. Pattern Matching Evaluator (Regex)

Validates that output matches a regular expression pattern. Works at both global and case levels.

# Global level - applies to all cases
function_name:
  evals:
    PatternName:
      pattern: "regex_pattern"
      case_sensitive: true  # Optional, default: true

# Case level - specific to one case
function_name:
  dataset:
    - case:
        input: "test"
        pattern: "regex_pattern"
        case_sensitive: false  # Optional

Examples:

# Validate email format
validate_email:
  evals:
    HasAtSign:
      pattern: "@"
    ValidDomain:
      pattern: "\\.(com|org|net)$"
  dataset:
    - case:
        input: "test@example.com"
        expected: "test@example.com"

    - case:
        input: "admin@test.org"
        expected: "admin@test.org"
        pattern: "\\.org$" # Case-specific pattern

# Format validation
format_id:
  evals:
    CorrectFormat:
      pattern: "^id: \\d+$"
      case_sensitive: true
  dataset:
    - case:
        input: 123
        expected: "id: 123"

    - case:
        input: 456
        expected: "ID: 456"
        pattern: "^ID: \\d+$" # Different pattern for this case

# Case insensitive matching
normalize_text:
  dataset:
    - case:
        input: "Hello"
        expected: "HELLO WORLD"
        pattern: "hello"
        case_sensitive: false # Matches "HELLO" too

    - case:
        input: "test"
        expected: "TEST123"
        pattern: "^[A-Z]+\\d+$" # Only uppercase letters + digits

# Multiple patterns
phone_format:
  evals:
    HasDigits:
      pattern: "\\d+"
  dataset:
    - case:
        input: "1234567890"
        expected: "+1 (123) 456-7890"
        pattern: "^\\+\\d+ \\(\\d{3}\\) \\d{3}-\\d{4}$"

    - case:
        input: "9876543210"
        expected: "987-654-3210"
        pattern: "^\\d{3}-\\d{3}-\\d{4}$"

Common Regex Patterns:

# Numbers only
pattern: "^\\d+$"

# Uppercase letters only
pattern: "^[A-Z]+$"

# Email format
pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"

# URL format
pattern: "^https?://.*"

# Contains specific word
pattern: "\\bword\\b"

# Starts with prefix
pattern: "^prefix"

# Ends with suffix
pattern: "suffix$"

8. Raises Evaluator (Exception Testing)

Tests that a function raises a specific exception. Similar to pytest's pytest.raises, this evaluator verifies both the exception type and optionally the exception message pattern. This is a case-level only evaluator.

function_name:
  dataset:
    - case:
        input: invalid_value
        raises: ExceptionType # Required: exception type name
        match: "pattern" # Optional: regex pattern for exception message

Important Notes:

raises is case-level only - cannot be used as a global evaluator
match can only be used together with raises
When raises is specified, the test expects an exception and will fail if the function returns normally
Global evaluators (type checks, assertions, etc.) are automatically skipped for exception cases

Examples:

# Basic exception testing
calculate_discount:
  evals:
    IsFloat:
      type: float
  dataset:
    - case:
        id: "valid_calculation"
        inputs: [100.0, 20.0]
        expected: 80.0

    - case:
        id: "negative_price"
        inputs: [-100.0, 20.0]
        raises: ValueError
        match: "must be positive" # Checks exception message

    - case:
        id: "invalid_discount"
        inputs: [100.0, 150.0]
        raises: ValueError # Just checks type, not message

# Division by zero
divide:
  evals:
    IsNumber:
      type: "int | float"
  dataset:
    - case:
        inputs: [10, 2]
        expected: 5.0

    - case:
        inputs: [10, 0]
        raises: ZeroDivisionError

# Type validation
parse_age:
  dataset:
    - case:
        input: "25"
        expected: 25

    - case:
        input: "invalid"
        raises: ValueError
        match: "invalid literal"

    - case:
        input: -5
        raises: ValueError
        match: "age must be positive"

# Key errors
get_config_value:
  dataset:
    - case:
        input: "api_key"
        expected: "secret_key_123"

    - case:
        input: "nonexistent_key"
        raises: KeyError
        match: "nonexistent_key"

# Multiple exception types
process_data:
  dataset:
    - case:
        input: { "valid": "data" }
        expected: "processed"

    - case:
        input: null
        raises: TypeError
        match: "NoneType"

    - case:
        input: []
        raises: ValueError
        match: "empty"

    - case:
        input: { "invalid": "format" }
        raises: KeyError

# Index errors
get_element:
  dataset:
    - case:
        inputs: [[1, 2, 3], 1]
        expected: 2

    - case:
        inputs: [[1, 2, 3], 10]
        raises: IndexError
        match: "out of range"

How it works:

When raises is present in a case, the framework wraps the function execution in a try-catch
If an exception is raised:
- Checks if exception type matches raises
- If match is provided, validates exception message against the regex pattern
- Global evaluators are skipped (they would fail on exception dict)
If no exception is raised when raises is specified, the test fails
If exception type doesn't match, the test fails and shows actual vs expected

Common Exception Types:

ValueError: Invalid value/argument
TypeError: Wrong type
KeyError: Missing dictionary key
IndexError: List/array index out of range
ZeroDivisionError: Division by zero
AttributeError: Missing attribute
FileNotFoundError: File doesn't exist
RuntimeError: Generic runtime error

9. LLM Judge Evaluator

Uses a Language Model to evaluate outputs based on a custom rubric. Ideal for semantic evaluation, quality assessment, and cases where rule-based checking is insufficient.

function_name:
  evals:
    JudgeName:
      rubric: "Evaluation criteria/question for the LLM"
      include: # Optional, what context to provide to LLM
        - input # Include the input
        - expected_output # Include expected output
      config: # Model configuration
        model: "provider:model_name" # Required (or set JUDGE_MODEL env var)
        temperature: 0.7 # Optional
        max_tokens: 2096 # Optional
        # ... other optional model parameters
        # parameters will be passed into
        # pydantic_ai.settings.ModelSettings ctor

Configuration:

rubric: Required - The evaluation criteria or question for the LLM
include: Optional - List of context variables. Valid options:
- input: Include function input
- expected_output: Include expected output
- Note: Output is always included automatically
config: Model configuration
- model: Required (unless JUDGE_MODEL environment variable is set)
- All other parameters are optional (temperature, max_tokens, top_p, etc.)

Examples:

# Basic semantic equivalence check
translate_to_english:
  evals:
    SemanticMatch:
      rubric: "Does the output have the same meaning as the expected output?"
      include:
        - expected_output
      config:
        model: "groq:qwen/qwen-2.5-72b-instruct"
        temperature: 0.0
  dataset:
    - case:
        input: "Bonjour"
        expected: "Hello"

    - case:
        input: "Comment allez-vous?"
        expected: "How are you?"

# Grammar and style checking
generate_response:
  evals:
    IsGrammaticallyCorrect:
      rubric: "Is the output grammatically correct and well-formatted?"
      config:
        model: "groq:qwen/qwen-2.5-72b-instruct"
        temperature: 0.0
        max_tokens: 512
  dataset:
    - case:
        input: "Write a greeting"
        expected: "Hello! How can I help you today?"

# Quality assessment with input context
summarize_text:
  evals:
    IsGoodSummary:
      rubric: "Is the output a good summary of the input text? Does it capture the main points?"
      include:
        - input
      config:
        model: "openai:gpt-4o-mini"
        temperature: 0.1
  dataset:
    - case:
        input: "Long article text here..."
        expected: "Brief summary..."

# Using environment variable for model
# Set: export JUDGE_MODEL="groq:qwen/qwen-2.5-72b-instruct"
check_correctness:
  evals:
    AnswerQuality:
      rubric: "Does the output correctly answer the question based on the input?"
      include:
        - input
        - expected_output
      config:
        temperature: 0.0 # model comes from JUDGE_MODEL env var
  dataset:
    - case:
        input: "What is 2+2?"
        expected: "4"

# Multiple criteria with different judges
write_code:
  evals:
    Correctness:
      rubric: "Is the code functionally correct?"
      include:
        - input
      config:
        model: "openai:gpt-4o"
        temperature: 0.0

    Readability:
      rubric: "Is the code well-structured and readable?"
      config:
        model: "openai:gpt-4o"
        temperature: 0.0
  dataset:
    - case:
        input: "Write a function to reverse a string"
        expected: "def reverse(s): return s[::-1]"

Supported Models:

OpenAI: openai:gpt-4o, openai:gpt-4o-mini, openai:gpt-4-turbo
Groq: groq:llama-3.3-70b-versatile, groq:qwen/qwen-2.5-72b-instruct
Anthropic: anthropic:claude-3-5-sonnet-20241022
Any model supported by pydantic-ai

Tips:

Use temperature: 0.0 for consistent evaluation
Be specific in your rubric - clear criteria = better evaluation
Use include: [input, expected_output] when the LLM needs full context
Set JUDGE_MODEL environment variable to avoid repeating model config
Combine with other evaluators for comprehensive testing

🎨 Advanced Examples

Using Multiple Evaluators

process_numbers:
  evals:
    # Global evaluators
    IsList:
      type: list
    NotEmpty:
      assertion: "len(output) > 0"
    AllPositive:
      assertion: "all(x > 0 for x in output)"
    Performance:
      duration: 0.01

  dataset:
    - case:
        input: [1, -2, 3, -4, 5]
        expected: [1, 3, 5] # Case-specific expected
        duration: 50 # Case-specific duration (ms)

Complex Test Scenario

json.loads:
  evals:
    IsDict:
      type: dict
    HasKey:
      assertion: "'name' in output"

  dataset:
    - case:
        input: '{"name": "John", "age": 30}'
        expected: { "name": "John", "age": 30 }

    - case:
        input: '{"name": "Jane"}'
        contains: "Jane" # converted to string with as_strings

str.split:
  evals:
    IsList:
      type: list

  dataset:
    - case:
        inputs: ["hello,world", ","]
        expected: ["hello", "world"]

    - case:
        inputs: ["a-b-c", "-"]
        expected: ["a", "b", "c"]
        assertion: "len(output) == 3" # Case-specific assertion

💡 Tips

1. Global vs Case Evaluators

# Global: Applies to all cases
evals:
  TypeCheck:
    type: int

# Case-specific: Applies only to that case
dataset:
  - case:
      input: 5
      expected: 25
      duration: 100

2. Testing Without Expected

# Test only with global evaluators
validate_format:
  evals:
    IsString:
      type: str
    HasPrefix:
      assertion: "output.startswith('prefix_')"

  dataset:
    - case:
        input: "test"
        # no expected, only the above checks are performed

3. Using Input in Assertions

double:
  evals:
    IsDouble:
      assertion: "output == input * 2"

  dataset:
    - case:
        input: 5
        # even without expected: 10, assertion checks it

4. Testing Multiple Functions

# test.yml
len:
  dataset:
    - case:
        input: [1, 2, 3]
        expected: 3

json.dumps:
  dataset:
    - case:
        input: { "key": "value" }
        expected: '{"key": "value"}'

make_list:
  dataset:
    - case:
        input: 5
        expected: [5]

# Test all
vowel test.yml

# Test only one
vowel test.yml -f len

# Test multiple (comma-separated)
vowel test.yml -f len,json.dumps

# Test multiple (separate flags)
vowel test.yml -f len -f json.dumps -f make_list

📚 Documentation

RUNEVALS_GUIDE.md - RunEvals fluent API
ASSERTION_CONTEXT.md - Assertion variables
EXAMPLES/ - Working examples here

📄 License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Mar 19, 2026

0.3.5

Feb 28, 2026

0.3.4

Feb 28, 2026

0.3.3

Feb 25, 2026

0.3.2

Feb 23, 2026

0.3.1

Feb 11, 2026

0.3.0

Feb 11, 2026

0.3.0b0 pre-release

Feb 11, 2026

0.2.6

Dec 1, 2025

This version

0.2.5

Nov 29, 2025

0.2.4

Nov 16, 2025

0.2.3

Nov 16, 2025

0.2.2

Nov 15, 2025

0.2.1

Nov 15, 2025

0.2.0

Nov 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vowel-0.2.5.tar.gz (36.3 kB view details)

Uploaded Nov 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vowel-0.2.5-py3-none-any.whl (29.0 kB view details)

Uploaded Nov 29, 2025 Python 3

File details

Details for the file vowel-0.2.5.tar.gz.

File metadata

Download URL: vowel-0.2.5.tar.gz
Upload date: Nov 29, 2025
Size: 36.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for vowel-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`e9bbec4c7553b6db788de90f7c13bf5b7895d90074f867c9c86c8650e5f8b375`
MD5	`0da7f8f697c682e074319220cec4d8ab`
BLAKE2b-256	`3784438c55de582cf3fc079eebce6d8d02bf463cd3732d004819acc5cde4b1bf`

See more details on using hashes here.

File details

Details for the file vowel-0.2.5-py3-none-any.whl.

File metadata

Download URL: vowel-0.2.5-py3-none-any.whl
Upload date: Nov 29, 2025
Size: 29.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for vowel-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fbf40879721f63eb38d8908294ff054f4d547d50d5d4e52297f56cd7a45ec751`
MD5	`c145da90f65b42d04e472141249028e4`
BLAKE2b-256	`e53eb2b0ccb4d5b662b725445bedaf1401210c31b5f29a052482bce6ccc3147b`

See more details on using hashes here.

vowel 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

VOWEL - YAML Based Evaluation Specification

🚀 Installation

Install Directly

Install from PyPI

🎯 Usage

CLI Options

📝 YAML Syntax

Basic Structure

Function Naming

Input Formats

Complete Example

📊 Evaluators

1. Assertion Evaluator

2. Type Evaluator

3. Duration Evaluator

4. Contains Input Evaluator

5. Expected Evaluator (Case Level)

6. Contains Evaluator (Case Level)

7. Pattern Matching Evaluator (Regex)

8. Raises Evaluator (Exception Testing)

9. LLM Judge Evaluator

🎨 Advanced Examples

Using Multiple Evaluators

Complex Test Scenario

💡 Tips

1. Global vs Case Evaluators

2. Testing Without Expected

3. Using Input in Assertions

4. Testing Multiple Functions

📚 Documentation

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes