Skip to main content

Vald8 is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

๐Ÿงช Vald8 โ€” Lightweight Evaluation Framework for LLM Reliability

Vald8 is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.

๐Ÿค– For AI Assistants: Read LLM_INSTRUCTIONS.md for implementation patterns.

It provides a simple way to validate:

  • Schema correctness
  • Instruction adherence
  • Reference accuracy
  • Keyword / regex expectations

With optional support for LLM-as-Judge scoring.

Focus: Make LLM evaluation as easy as pytest. Nothing more. Nothing less.


Why Vald8?

If you're building with LLMs, you need a way to verify that your AI functions:

  • produce valid JSON
  • follow instructions consistently
  • don't regress when prompts or models change
  • behave consistently across environments
  • meet quality thresholds before deployment

Vald8 gives you this with:

  • โœ” One decorator
  • โœ” One JSONL file
  • โœ” One evaluation call

No configuration. No complexity. No over-engineering.


Install

pip install vald8

Core Concept

You decorate any LLM function:

from vald8 import vald8

@vald8(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
    ...

Vald8 loads your dataset, runs the function against each example, and scores the results.


Running Examples

Vald8 comes with a realistic example script that demonstrates how to evaluate functions using real LLM APIs (OpenAI, Anthropic, Gemini).

Prerequisites

  1. Install SDKs:

    pip install openai anthropic google-generativeai
    
  2. Set API Keys:

    export OPENAI_API_KEY="your-key-here"
    export ANTHROPIC_API_KEY="your-key-here"
    export GEMINI_API_KEY="your-key-here"
    

    Alternatively, use a .env file:

    1. Copy .env.example to .env:
      cp .env.example .env
      
    2. Edit .env and add your API keys.

Run the Examples

We provide specific examples for different testing scenarios:

1. Reference (Exact Match)

python examples/example_reference_openai.py

2. Summarization (Content Check)

python examples/example_summary_openai.py

3. Extraction (Schema Validation)

python examples/example_extraction_openai.py

4. Regex (Pattern Matching)

python examples/example_regex_openai.py

5. Safety (Refusal Check)

python examples/example_safety_openai.py

6. Judge (LLM Evaluation)

python examples/example_judge_openai.py
  1. Load the evaluation dataset from examples/eval_dataset.jsonl.
  2. Run evaluations on the respective models.
  3. Output pass/fail results and success rates.

๐Ÿ–ฅ๏ธ Rich CLI

Vald8 includes a beautiful CLI for managing results:

# List recent runs
vald8 runs list

# Show detailed results for a run
vald8 runs show runs/2025-11-25_...

๐Ÿ“ JSONL Test Dataset Example

Save as tests.jsonl:

{"id": "math1", "input": "What is 2+2?", "expected": {"reference": "4"}}
{"id": "json1", "input": "Return JSON with name and age", "expected": {"schema": {"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "number"}}, "required": ["name", "age"]}}}
{"id": "hello1", "input": "Greet politely", "expected": {"contains": ["hello", "please"]}}
{"id": "regex1", "input": "Give a date", "expected": {"regex": "\d{4}-\d{2}-\d{2}" }}

Supported expectations:

1. Reference (Exact Match)

Checks if the output matches a reference string exactly (ignoring whitespace).

"expected": {"reference": "42"}

2. Contains (Keywords)

Checks if the output contains all specified keywords (case-insensitive).

"expected": {"contains": ["hello", "world"]}

3. Regex (Pattern Matching)

Checks if the output matches a regular expression.

"expected": {"regex": "^\\d{4}-\\d{2}-\\d{2}$"}

4. Schema (JSON Validation)

Validates that the output is valid JSON and conforms to a JSON Schema.

"expected": {
  "schema": {
    "type": "object",
    "properties": {
      "name": {"type": "string"},
      "age": {"type": "integer"}
    },
    "required": ["name", "age"]
  }
}

5. Safety (Harmful Content)

Checks for harmful content using a keyword list. Can be inverted.

"expected": {"safe": true}

6. Judge (LLM-as-a-Judge)

Uses an LLM to evaluate the output based on a custom prompt. Requires judge_provider to be configured.

"expected": {
  "judge": {
    "prompt": "Is this response polite and professional?"
  }
}

โš™๏ธ Configuration

Vald8 supports configuration through decorator parameters and environment variables.

Decorator Parameters

@vald8(
    dataset="path/to/dataset.jsonl",       # Required: Path to JSONL dataset
    tests=["accuracy", "schema_fidelity"], # Optional: Metrics to evaluate (default: [])
    thresholds={"accuracy": 0.9},          # Optional: Pass/fail thresholds (default: 0.8)
    judge_provider="openai",               # Optional: LLM judge provider
    judge_model="gpt-5.1",                 # Optional: Judge model name
    sample_size=10,                        # Optional: Number of examples to sample
    shuffle=True,                          # Optional: Shuffle before sampling (default: False)
    cache=True,                            # Optional: Cache results (default: True)
    cache_dir=".vald8_cache",              # Optional: Cache directory
    results_dir="runs",                    # Optional: Results directory
    fail_fast=False,                       # Optional: Stop on first failure (default: False)
    timeout=60,                            # Optional: Function timeout in seconds
    save_results=True,                     # Optional: Save detailed results (default: True)
    parallel=False                         # Optional: Run tests in parallel (default: False)
)

Environment Variables

All configuration parameters can be set via environment variables with the VALD8_ prefix:

Variable Type Description Default
VALD8_TESTS List Comma-separated metrics (e.g., "accuracy,safety") []
VALD8_THRESHOLD Float Global threshold for all metrics 0.8
VALD8_THRESHOLD_ACCURACY Float Threshold for accuracy metric 0.8
VALD8_THRESHOLD_SAFETY Float Threshold for safety metric 1.0
VALD8_SAMPLE_SIZE Int Number of examples to sample All
VALD8_SHUFFLE Bool Shuffle examples (true/false) false
VALD8_CACHE Bool Enable caching true
VALD8_CACHE_DIR String Cache directory path .vald8_cache
VALD8_RESULTS_DIR String Results directory path runs
VALD8_FAIL_FAST Bool Stop on first failure false
VALD8_TIMEOUT Int Function timeout (seconds) 60

Judge Configuration

For LLM-as-judge metrics (instruction_adherence, safety, custom_judge):

Variable Description Default
VALD8_JUDGE_MODEL Judge model name Provider-specific
VALD8_JUDGE_API_KEY Judge API key From provider env var
VALD8_JUDGE_BASE_URL Custom API base URL Provider default
VALD8_JUDGE_TIMEOUT Judge request timeout 30
VALD8_JUDGE_MAX_RETRIES Max retry attempts 3
VALD8_JUDGE_TEMPERATURE Judge temperature 0.0

Provider API Keys:

  • OpenAI: OPENAI_API_KEY
  • Anthropic: ANTHROPIC_API_KEY
  • Bedrock: AWS_ACCESS_KEY_ID

๐Ÿงช Decorating an LLM Function

from vald8 import vald8
import openai

@vald8(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    return {"response": response.choices[0].message.content}

๐Ÿ“Š Running Evaluations

results = generate.run_eval()

print("Passed:", results["passed"])
print("Success Rate:", results["summary"]["success_rate"])
print("Details saved to:", results["run_dir"])

Example output:

โœ” math1
โœ” json1
โœ– hello1 โ€” missing: please
โœ” regex1

Overall: 3/4 passed (75%)

๐Ÿงฑ Optional: LLM-as-Judge Scoring

Useful for long-form or fuzzy outputs.

@vald8(
    dataset="tests.jsonl",
    tests=["schema", "contains", "reference"],
    thresholds={"success_rate": 0.9},
    sample_size=None,
    cache=False,
    judge_provider=None,
    parallel=True,
)
def summarize(text: str) -> str:
    return llm_summarize(text)

Most tests require no API calls.


CI/CD Integration

- name: Run Vald8 Tests
  run: |
    python -c "
    from my_llm import generate
    assert generate.run_eval()['passed']
    "

Results Format

Each run produces:

runs/
โ””โ”€โ”€ 2025-11-21_12-01-44/
    โ”œโ”€โ”€ results.jsonl
    โ”œโ”€โ”€ summary.json
    โ””โ”€โ”€ metadata.json

โš™๏ธ Configuration Reference

Decorator Arguments

Argument Type Default Description
dataset str Required Path to JSONL file (relative or absolute).
tests list[str] [] Metrics to evaluate: ["accuracy", "schema_fidelity", "safety", "custom_judge"].
thresholds dict {"accuracy": 0.8} Pass/fail thresholds per metric.
judge_provider str None LLM judge provider: "openai", "anthropic", "bedrock".
judge_model str Provider default Specific model for the judge (e.g., "gpt-4").
sample_size int None (All) Number of examples to sample from the dataset.
shuffle bool False Whether to shuffle examples before sampling.
cache bool True Enable caching of results to avoid re-running passed tests.
cache_dir str ".vald8_cache" Directory for cache files.
results_dir str "runs" Directory to save detailed evaluation results.
fail_fast bool False Stop evaluation immediately on the first failure.
timeout int 60 Timeout in seconds for the decorated function execution.

Environment Variables

All configuration can be overridden by environment variables with the VALD8_ prefix.

Variable Type Description
VALD8_TESTS List Comma-separated metrics (e.g., "accuracy,safety").
VALD8_THRESHOLD Float Global threshold applied to all metrics.
VALD8_THRESHOLD_{METRIC} Float Specific threshold for a metric (e.g., VALD8_THRESHOLD_SAFETY).
VALD8_SAMPLE_SIZE Int Number of examples to sample.
VALD8_SHUFFLE Bool Shuffle examples (true/false).
VALD8_CACHE Bool Enable/disable caching.
VALD8_CACHE_DIR Str Cache directory path.
VALD8_RESULTS_DIR Str Results directory path.
VALD8_FAIL_FAST Bool Stop on first failure.
VALD8_TIMEOUT Int Function timeout in seconds.

Judge Configuration

For metrics that require an LLM judge (instruction_adherence, safety, custom_judge):

Variable Description Default
VALD8_JUDGE_MODEL Judge model name Provider default (e.g., GPT-4)
VALD8_JUDGE_API_KEY Judge API key OPENAI_API_KEY / ANTHROPIC_API_KEY
VALD8_JUDGE_BASE_URL Custom API base URL Provider default
VALD8_JUDGE_TIMEOUT Judge request timeout 30
VALD8_JUDGE_MAX_RETRIES Max retry attempts 3
VALD8_JUDGE_TEMPERATURE Judge temperature 0.0

๐Ÿ“ Results Folder Structure

Vald8 automatically saves evaluation results in a session-based hierarchy:

runs/
โ””โ”€โ”€ 2025-11-23_a1b2c3d4/              # Session folder (date + session_id)
    โ”œโ”€โ”€ extract_correct/              # Function-specific results
    โ”‚   โ”œโ”€โ”€ results.jsonl             # Detailed test results (one per line)
    โ”‚   โ”œโ”€โ”€ summary.json              # Aggregated statistics
    โ”‚   โ”œโ”€โ”€ metadata.json             # Run configuration and info
    โ”‚   โ””โ”€โ”€ report.txt                # Human-readable report
    โ”œโ”€โ”€ extract_incorrect/
    โ”‚   โ””โ”€โ”€ ...
    โ””โ”€โ”€ judge_correct/
        โ””โ”€โ”€ ...

Session Grouping: All functions evaluated in the same script run share a session ID and are grouped under one master folder.

Files:

  • results.jsonl: Line-delimited JSON with each test result
  • summary.json: Success rate, metrics, timing stats
  • metadata.json: Function name, config, timestamp
  • report.txt: Formatted text report
  • report.html: Visual HTML report with charts and tables

Contributing

PRs welcome.


License

MIT License โ€” free and open source.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vald8-0.1.6.tar.gz (47.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vald8-0.1.6-py3-none-any.whl (40.7 kB view details)

Uploaded Python 3

File details

Details for the file vald8-0.1.6.tar.gz.

File metadata

  • Download URL: vald8-0.1.6.tar.gz
  • Upload date:
  • Size: 47.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for vald8-0.1.6.tar.gz
Algorithm Hash digest
SHA256 deadf09cd05ab6392c7aed848b72cd62071b58a7ad0299ecdebbca026e9efa37
MD5 e63453a6a3acb202f4b48785c9ad0662
BLAKE2b-256 d1f8d9c04467f1931c91208f79feb07581c8ceec686f84f301152deabc538fe3

See more details on using hashes here.

File details

Details for the file vald8-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: vald8-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 40.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for vald8-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 83e7476bbfcb0bca60058bf9682b6d09f9e5b9d983ec40ba8cc5c7313d8709d1
MD5 6cd3635ee24e9f9ae7af252909576dfb
BLAKE2b-256 e1f400b26bf6c735e9adef725f189b184b3f5f40128bbe09815589d160f86cb8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page