Skip to main content

Vald8 is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

🧪 Vald8 — Lightweight Evaluation Framework for LLM Reliability

Vald8 is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.

It provides a simple way to validate:

  • Schema correctness
  • Instruction adherence
  • Reference accuracy
  • Keyword / regex expectations

With optional support for LLM-as-Judge scoring.

Focus: Make LLM evaluation as easy as pytest. Nothing more. Nothing less.


🚀 Why Vald8?

If you're building with LLMs, you need a way to verify that your AI functions:

  • produce valid JSON
  • follow instructions consistently
  • don't regress when prompts or models change
  • behave consistently across environments
  • meet quality thresholds before deployment

Vald8 gives you this with:

  • ✔ One decorator
  • ✔ One JSONL file
  • ✔ One evaluation call

No configuration. No complexity. No over-engineering.


📦 Install

pip install vald8

🧩 Core Concept

You decorate any LLM function:

from vald8 import vald8

@vald8(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
    ...

Vald8 loads your dataset, runs the function against each example, and scores the results.


🚀 Running Examples

Vald8 comes with a realistic example script that demonstrates how to evaluate functions using real LLM APIs (OpenAI, Anthropic, Gemini).

Prerequisites

  1. Install SDKs:

    pip install openai anthropic google-generativeai
    
  2. Set API Keys:

    export OPENAI_API_KEY="your-key-here"
    export ANTHROPIC_API_KEY="your-key-here"
    export GEMINI_API_KEY="your-key-here"
    

Run the Example

python examples/basic_example.py

This script will:

  1. Load the evaluation dataset from examples/eval_dataset.jsonl.
  2. Run evaluations on OpenAI GPT-5.1, Claude 3.5, and Gemini 1.5 (skipping any missing SDKs/keys).
  3. Output pass/fail results and success rates for each model.

📁 JSONL Test Dataset Example

Save as tests.jsonl:

{"id": "math1", "input": "What is 2+2?", "expected": {"reference": "4"}}
{"id": "json1", "input": "Return JSON with name and age", "expected": {"schema": {"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "number"}}, "required": ["name", "age"]}}}
{"id": "hello1", "input": "Greet politely", "expected": {"contains": ["hello", "please"]}}
{"id": "regex1", "input": "Give a date", "expected": {"regex": "\d{4}-\d{2}-\d{2}" }}

Supported expectations:

  • "reference": "exact value"
  • "contains": ["word1", "word2"]
  • "regex": "pattern"
  • "schema": {...}

🧪 Decorating an LLM Function

from vald8 import vald8
import openai

@vald8(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    return {"response": response.choices[0].message.content}

📊 Running Evaluations

results = generate.run_eval()

print("Passed:", results["passed"])
print("Success Rate:", results["summary"]["success_rate"])
print("Details saved to:", results["run_dir"])

Example output:

✔ math1
✔ json1
✖ hello1 — missing: please
✔ regex1

Overall: 3/4 passed (75%)

🧱 Optional: LLM-as-Judge Scoring

Useful for long-form or fuzzy outputs.

@vald8(
    dataset="tests.jsonl",
    judge_provider="openai"   # or "anthropic", "local"
)
def summarize(text: str) -> str:
    return llm_summarize(text)

Most tests require no API calls.


🧩 CI/CD Integration

- name: Run Vald8 Tests
  run: |
    python -c "
    from my_llm import generate
    assert generate.run_eval()['passed']
    "

📁 Results Format

Each run produces:

runs/
└── 2025-11-21_12-01-44/
    ├── results.jsonl
    ├── summary.json
    └── metadata.json

🔧 Configuration Options

@vald8(
    dataset="tests.jsonl",
    tests=["schema", "contains", "reference"],
    thresholds={"success_rate": 0.9},
    sample_size=None,
    cache=False,
    judge_provider=None,
)

All parameters are optional.


🛠 Minimal Feature Set (v0.1)

Included:

  • ✔ Test decorator
  • ✔ JSONL dataset loader
  • ✔ Schema validation
  • ✔ Contains / reference / regex checks
  • ✔ Optional LLM-as-judge
  • ✔ Clear results + artifacts
  • ✔ Offline mode
  • ✔ CI/CD-ready
  • ✔ Zero-config defaults

🤝 Contributing

PRs welcome.


📜 License

MIT License — free and open source.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vald8-0.1.0.tar.gz (34.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vald8-0.1.0-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file vald8-0.1.0.tar.gz.

File metadata

  • Download URL: vald8-0.1.0.tar.gz
  • Upload date:
  • Size: 34.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vald8-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e5f26369f708c6e4fd71042ccbd0ace7e71c652bccfd39d3d9daca8209c6e74
MD5 acec1601b4381039c7115e66695a0887
BLAKE2b-256 6022085445adeb451cfe70b54bd400b6f32bec4ab7f915c374b5b9fae0f5906d

See more details on using hashes here.

File details

Details for the file vald8-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vald8-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vald8-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 efead191bf9be23e379fd6c45e7c98a04bb59b0a358260358c1a77f392ef81c1
MD5 20fb4a8699e4c7f2b076c9d03f113151
BLAKE2b-256 46cc807f7ab6ea44ff194f0168e9bbaf71f4b183bc629181baa284f5a0b1b380

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page