Skip to main content

LLM Expect is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.

Project description

๐Ÿงช LLM Expect โ€” Lightweight Evaluation Framework for LLM Reliability

LLM Expect is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.

๐Ÿค– For AI Assistants: Read llm.txt for implementation patterns.

It provides a simple way to validate:

  • Schema correctness
  • Instruction adherence
  • Reference accuracy
  • Keyword / regex expectations

With optional support for LLM-as-Judge scoring.

Focus: Make LLM evaluation as easy as pytest. Nothing more. Nothing less.


Why LLM Expect?

If you're building with LLMs, you need a way to verify that your AI functions:

  • produce valid JSON
  • follow instructions consistently
  • don't regress when prompts or models change
  • behave consistently across environments
  • meet quality thresholds before deployment

LLM Expect gives you this with:

  • โœ” One decorator
  • โœ” One JSONL file
  • โœ” One evaluation call

No configuration. No complexity. No over-engineering.


Install

pip install llm-expect

Package Name: llm-expect on PyPI. Version: Currently v0.1.8.


Core Concept

You decorate any LLM function:

from llm_expect import llm_expect

@llm_expect(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
    ...

LLM Expect loads your dataset, runs the function against each example, and scores the results.


Running Examples

LLM Expect comes with a realistic example script that demonstrates how to evaluate functions using real LLM APIs (OpenAI, Anthropic, Gemini).

Prerequisites

  1. Install SDKs:

    pip install openai anthropic google-generativeai
    
  2. Set API Keys:

    export OPENAI_API_KEY="your-key-here"
    export ANTHROPIC_API_KEY="your-key-here"
    export GEMINI_API_KEY="your-key-here"
    

    Alternatively, use a .env file:

    1. Copy .env.example to .env:
      cp .env.example .env
      
    2. Edit .env and add your API keys.

Run the Examples

We provide specific examples for different testing scenarios:

1. Reference (Exact Match)

python examples/example_reference_openai.py

2. Summarization (Content Check)

python examples/example_summary_openai.py

3. Extraction (Schema Validation)

python examples/example_extraction_openai.py

4. Regex (Pattern Matching)

python examples/example_regex_openai.py

5. Safety (Refusal Check)

python examples/example_safety_openai.py

6. Judge (LLM Evaluation)

python examples/example_judge_openai.py
  1. Load the evaluation dataset from examples/eval_dataset.jsonl.
  2. Run evaluations on the respective models.
  3. Output pass/fail results and success rates.

๐Ÿ–ฅ๏ธ Rich CLI

LLM Expect includes a beautiful CLI for managing results:

# List recent runs
llm-expect runs list

# Show detailed results for a run
llm-expect runs show runs/2025-11-25_...

๐Ÿ“ JSONL Test Dataset Example

Save as tests.jsonl:

{"id": "math1", "input": "What is 2+2?", "expected": {"reference": "4"}}
{"id": "json1", "input": "Return JSON with name and age", "expected": {"schema": {"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "number"}}, "required": ["name", "age"]}}}
{"id": "hello1", "input": "Greet politely", "expected": {"contains": ["hello", "please"]}}
{"id": "regex1", "input": "Give a date", "expected": {"regex": "\d{4}-\d{2}-\d{2}" }}

Input Format:

  • String/Int/Float/Bool: Passed directly as the first argument.
  • Dict: Unpacked as **kwargs if the function accepts multiple arguments, otherwise passed as a single dict argument.
  • List/Tuple: Passed as a single argument (not unpacked).

Metric Inference: If tests=[] is omitted in the decorator, metrics are automatically inferred from the keys in the expected dictionary (e.g., reference -> accuracy, schema -> schema_fidelity).

Supported expectations:

1. Reference (Exact Match)

Checks if the output matches a reference string exactly (ignoring whitespace).

"expected": {"reference": "42"}

2. Contains (Keywords)

Checks if the output contains all specified keywords (case-insensitive).

"expected": {"contains": ["hello", "world"]}

3. Regex (Pattern Matching)

Checks if the output matches a regular expression.

"expected": {"regex": "^\\d{4}-\\d{2}-\\d{2}$"}

4. Schema (JSON Validation)

Validates that the output is valid JSON and conforms to a JSON Schema.

"expected": {
  "schema": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "age": {"type": "integer"}
    },
    "required": ["name", "age"]
  }
}

5. Safety (Harmful Content)

Checks for harmful content using a keyword list. Can be inverted.

"expected": {"safe": true}

6. Judge (LLM-as-a-Judge)

Uses an LLM to evaluate the output based on a custom prompt. Requires judge_provider to be configured.

"expected": {
  "judge": {
    "prompt": "Is this response polite and professional?"
  }
}

Judge Scoring:

  • Scale: 0.0 to 1.0 (Float).
  • Rubric: 5-point scale (Perfect=1.0, Good=0.8, Partial=0.6, Poor=0.4, None=0.0).
  • Default Threshold: 0.7.

7. Safety (Built-in)

Uses a hybrid approach:

  1. Refusal Detection: If the model refuses (e.g., "I cannot help"), it is considered Safe (Score 1.0).
  2. Keyword Matching: Checks for harmful keywords.
  3. LLM Judge: (Optional) Can use an LLM to evaluate safety if configured.

โš™๏ธ Configuration

LLM Expect supports configuration through decorator parameters and environment variables.

Decorator Parameters

@llm_expect(
    dataset="path/to/dataset.jsonl",       # Required: Path to JSONL dataset
    tests=["accuracy", "schema_fidelity"], # Optional: Metrics to evaluate (default: [])
    thresholds={"accuracy": 0.9},          # Optional: Pass/fail thresholds (default: 0.8)
    judge_provider="openai",               # Optional: LLM judge provider
    judge_model="gpt-5.1",                 # Optional: Judge model name
    sample_size=10,                        # Optional: Number of examples to sample
    shuffle=True,                          # Optional: Shuffle before sampling (default: False)
    cache=True,                            # Optional: Cache results (default: True)
    cache_dir=".llm_expect_cache",              # Optional: Cache directory
    results_dir="runs",                    # Optional: Results directory
    fail_fast=False,                       # Optional: Stop on first failure (default: False)
    timeout=60,                            # Optional: Function timeout in seconds
    save_results=True,                     # Optional: Save detailed results (default: True)
    parallel=False                         # Optional: Run tests in parallel (default: False)
)

Environment Variables

All configuration parameters can be set via environment variables with the LLM_EXPECT_ prefix:

Variable Type Description Default
LLM_EXPECT_TESTS List Comma-separated metrics (e.g., "accuracy,safety") []
LLM_EXPECT_THRESHOLD Float Global threshold for all metrics 0.8
LLM_EXPECT_THRESHOLD_ACCURACY Float Threshold for accuracy metric 0.8
LLM_EXPECT_THRESHOLD_SAFETY Float Threshold for safety metric 1.0
LLM_EXPECT_SAMPLE_SIZE Int Number of examples to sample All
LLM_EXPECT_SHUFFLE Bool Shuffle examples (true/false) false
LLM_EXPECT_CACHE Bool Enable caching true
LLM_EXPECT_CACHE_DIR String Cache directory path .llm_expect_cache
LLM_EXPECT_RESULTS_DIR String Results directory path runs
LLM_EXPECT_FAIL_FAST Bool Stop on first failure false
LLM_EXPECT_TIMEOUT Int Function timeout (seconds) 60

Judge Configuration

For LLM-as-judge metrics (instruction_adherence, safety, custom_judge):

Variable Description Default
LLM_EXPECT_JUDGE_MODEL Judge model name Provider-specific
LLM_EXPECT_JUDGE_API_KEY Judge API key From provider env var
LLM_EXPECT_JUDGE_BASE_URL Custom API base URL Provider default
LLM_EXPECT_JUDGE_TIMEOUT Judge request timeout 30
LLM_EXPECT_JUDGE_MAX_RETRIES Max retry attempts 3
LLM_EXPECT_JUDGE_TEMPERATURE Judge temperature 0.0

Provider API Keys:

  • OpenAI: OPENAI_API_KEY
  • Anthropic: ANTHROPIC_API_KEY
  • Bedrock: AWS_ACCESS_KEY_ID

Configuration Precedence:

  1. Decorator arguments (highest priority)
  2. Configuration file (pyproject.toml, llm_expect.json, etc.)
  3. Environment variables (lowest priority)

Defaults:

  • Provider: OpenAI (gpt-4)
  • Judge: OpenAI (gpt-4) if not specified.

๐Ÿงช Decorating an LLM Function

from llm_expect import llm_expect
import openai

@llm_expect(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    return {"response": response.choices[0].message.content}

๐Ÿ“Š Running Evaluations

results = generate.run_eval()

print("Passed:", results["passed"])
print("Success Rate:", results["summary"]["success_rate"])
print("Details saved to:", results["run_dir"])

Example output:

โœ” math1
โœ” json1
โœ– hello1 โ€” missing: please
โœ” regex1

Overall: 3/4 passed (75%)

๐Ÿงฑ Optional: LLM-as-Judge Scoring

Useful for long-form or fuzzy outputs.

@llm_expect(
    dataset="tests.jsonl",
    tests=["schema", "contains", "reference"],
    thresholds={"success_rate": 0.9},
    sample_size=None,
    cache=False,
    judge_provider=None,
    parallel=True,
)
def summarize(text: str) -> str:
    return llm_summarize(text)

Most tests require no API calls.


CI/CD Integration

- name: Run LLM Expect Tests
  run: |
    python -c "
    from my_llm import generate
    # run_eval() returns a dict with 'passed' boolean
    result = generate.run_eval()
    if not result['passed']:
        exit(1)
    "

Note: There is no llm-expect run CLI command. You must execute your Python script to run evaluations. The CLI is strictly for managing results and validating datasets.


Results Format

Each run produces:

runs/
โ””โ”€โ”€ 2025-11-21_12-01-44/
    โ”œโ”€โ”€ results.jsonl
    โ”œโ”€โ”€ summary.json
    โ””โ”€โ”€ metadata.json

โš™๏ธ Configuration Reference

Decorator Arguments

Argument Type Default Description
dataset str Required Path to JSONL file (relative or absolute).
tests list[str] [] Metrics to evaluate: ["accuracy", "schema_fidelity", "safety", "custom_judge"].
thresholds dict {"accuracy": 0.8} Pass/fail thresholds per metric.
judge_provider str None LLM judge provider: "openai", "anthropic", "bedrock".
judge_model str Provider default Specific model for the judge (e.g., "gpt-4").
sample_size int None (All) Number of examples to sample from the dataset.
shuffle bool False Whether to shuffle examples before sampling.
cache bool True Enable caching of results to avoid re-running passed tests.
cache_dir str ".llm_expect_cache" Directory for cache files.
results_dir str "runs" Directory to save detailed evaluation results.
fail_fast bool False Stop evaluation immediately on the first failure.
fail_fast bool False Stop evaluation immediately on the first failure.
timeout int 60 Timeout in seconds for the decorated function execution.
parallel bool False Run tests in parallel using ThreadPoolExecutor. Max workers = min(len(examples), 10).

Environment Variables

All configuration can be overridden by environment variables with the LLM_EXPECT_ prefix.

Variable Type Description
LLM_EXPECT_TESTS List Comma-separated metrics (e.g., "accuracy,safety").
LLM_EXPECT_THRESHOLD Float Global threshold applied to all metrics.
LLM_EXPECT_THRESHOLD_{METRIC} Float Specific threshold for a metric (e.g., LLM_EXPECT_THRESHOLD_SAFETY).
LLM_EXPECT_SAMPLE_SIZE Int Number of examples to sample.
LLM_EXPECT_SHUFFLE Bool Shuffle examples (true/false).
LLM_EXPECT_CACHE Bool Enable/disable caching.
LLM_EXPECT_CACHE_DIR Str Cache directory path.
LLM_EXPECT_RESULTS_DIR Str Results directory path.
LLM_EXPECT_FAIL_FAST Bool Stop on first failure.
LLM_EXPECT_TIMEOUT Int Function timeout in seconds.

Judge Configuration

For metrics that require an LLM judge (instruction_adherence, safety, custom_judge):

Variable Description Default
LLM_EXPECT_JUDGE_MODEL Judge model name Provider default (e.g., GPT-4)
LLM_EXPECT_JUDGE_API_KEY Judge API key OPENAI_API_KEY / ANTHROPIC_API_KEY
LLM_EXPECT_JUDGE_BASE_URL Custom API base URL Provider default
LLM_EXPECT_JUDGE_TIMEOUT Judge request timeout 30
LLM_EXPECT_JUDGE_MAX_RETRIES Max retry attempts 3
LLM_EXPECT_JUDGE_TEMPERATURE Judge temperature 0.0

๐Ÿ“ Results Folder Structure

LLM Expect automatically saves evaluation results in a session-based hierarchy:

runs/
โ””โ”€โ”€ 2025-11-23_a1b2c3d4/              # Session folder (date + session_id)
    โ”œโ”€โ”€ extract_correct/              # Function-specific results
    โ”‚   โ”œโ”€โ”€ results.jsonl             # Detailed test results (one per line)
    โ”‚   โ”œโ”€โ”€ summary.json              # Aggregated statistics
    โ”‚   โ”œโ”€โ”€ metadata.json             # Run configuration and info
    โ”‚   โ””โ”€โ”€ report.txt                # Human-readable report
    โ”œโ”€โ”€ extract_incorrect/
    โ”‚   โ””โ”€โ”€ ...
    โ””โ”€โ”€ judge_correct/
        โ””โ”€โ”€ ...

Session Grouping: All functions evaluated in the same script run share a session ID and are grouped under one master folder.

Files:

  • results.jsonl: Line-delimited JSON with each test result
  • summary.json: Success rate, metrics, timing stats
  • metadata.json: Function name, config, timestamp
  • report.txt: Formatted text report
  • report.html: Visual HTML report with charts and tables

Contributing

PRs welcome.


License

MIT License โ€” free and open source.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_expect-0.1.8.tar.gz (51.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_expect-0.1.8-py3-none-any.whl (41.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_expect-0.1.8.tar.gz.

File metadata

  • Download URL: llm_expect-0.1.8.tar.gz
  • Upload date:
  • Size: 51.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for llm_expect-0.1.8.tar.gz
Algorithm Hash digest
SHA256 359af8fd49f9620ec8deb50edfaaca7d81e1a55240e5892753495a3b722b942d
MD5 e968b8d9082efe179df98a89ed56b3de
BLAKE2b-256 3905949a82f8f3ba7a0b1ac8a651e883cfd893a1df4e35e9b39256df65a66f30

See more details on using hashes here.

File details

Details for the file llm_expect-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: llm_expect-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 41.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for llm_expect-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 9c99eb8a0872e7763c4bd66729b5a54da09086f6f66c827fe2459599df550220
MD5 ba523f0c86f59662e9c169036c31fc9c
BLAKE2b-256 4f8fca31a222942e05d81d00d5664657a28de20fdc1cee21645d366c57d48fa7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page