Vald8 is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
🧪 Vald8 — Lightweight Evaluation Framework for LLM Reliability
Vald8 is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.
It provides a simple way to validate:
- Schema correctness
- Instruction adherence
- Reference accuracy
- Keyword / regex expectations
With optional support for LLM-as-Judge scoring.
Focus: Make LLM evaluation as easy as pytest. Nothing more. Nothing less.
🚀 Why Vald8?
If you're building with LLMs, you need a way to verify that your AI functions:
- produce valid JSON
- follow instructions consistently
- don't regress when prompts or models change
- behave consistently across environments
- meet quality thresholds before deployment
Vald8 gives you this with:
- ✔ One decorator
- ✔ One JSONL file
- ✔ One evaluation call
No configuration. No complexity. No over-engineering.
📦 Install
pip install vald8
🧩 Core Concept
You decorate any LLM function:
from vald8 import vald8
@vald8(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
...
Vald8 loads your dataset, runs the function against each example, and scores the results.
🚀 Running Examples
Vald8 comes with a realistic example script that demonstrates how to evaluate functions using real LLM APIs (OpenAI, Anthropic, Gemini).
Prerequisites
-
Install SDKs:
pip install openai anthropic google-generativeai
-
Set API Keys:
export OPENAI_API_KEY="your-key-here" export ANTHROPIC_API_KEY="your-key-here" export GEMINI_API_KEY="your-key-here"
Run the Example
python examples/basic_example.py
This script will:
- Load the evaluation dataset from
examples/eval_dataset.jsonl. - Run evaluations on OpenAI GPT-5.1, Claude 3.5, and Gemini 1.5 (skipping any missing SDKs/keys).
- Output pass/fail results and success rates for each model.
📁 JSONL Test Dataset Example
Save as tests.jsonl:
{"id": "math1", "input": "What is 2+2?", "expected": {"reference": "4"}}
{"id": "json1", "input": "Return JSON with name and age", "expected": {"schema": {"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "number"}}, "required": ["name", "age"]}}}
{"id": "hello1", "input": "Greet politely", "expected": {"contains": ["hello", "please"]}}
{"id": "regex1", "input": "Give a date", "expected": {"regex": "\d{4}-\d{2}-\d{2}" }}
Supported expectations:
"reference": "exact value""contains": ["word1", "word2"]"regex": "pattern""schema": {...}
🧪 Decorating an LLM Function
from vald8 import vald8
import openai
@vald8(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
return {"response": response.choices[0].message.content}
📊 Running Evaluations
results = generate.run_eval()
print("Passed:", results["passed"])
print("Success Rate:", results["summary"]["success_rate"])
print("Details saved to:", results["run_dir"])
Example output:
✔ math1
✔ json1
✖ hello1 — missing: please
✔ regex1
Overall: 3/4 passed (75%)
🧱 Optional: LLM-as-Judge Scoring
Useful for long-form or fuzzy outputs.
@vald8(
dataset="tests.jsonl",
judge_provider="openai" # or "anthropic", "local"
)
def summarize(text: str) -> str:
return llm_summarize(text)
Most tests require no API calls.
🧩 CI/CD Integration
- name: Run Vald8 Tests
run: |
python -c "
from my_llm import generate
assert generate.run_eval()['passed']
"
📁 Results Format
Each run produces:
runs/
└── 2025-11-21_12-01-44/
├── results.jsonl
├── summary.json
└── metadata.json
🔧 Configuration Options
@vald8(
dataset="tests.jsonl",
tests=["schema", "contains", "reference"],
thresholds={"success_rate": 0.9},
sample_size=None,
cache=False,
judge_provider=None,
)
All parameters are optional.
🛠 Minimal Feature Set (v0.1)
Included:
- ✔ Test decorator
- ✔ JSONL dataset loader
- ✔ Schema validation
- ✔ Contains / reference / regex checks
- ✔ Optional LLM-as-judge
- ✔ Clear results + artifacts
- ✔ Offline mode
- ✔ CI/CD-ready
- ✔ Zero-config defaults
🤝 Contributing
PRs welcome.
📜 License
MIT License — free and open source.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vald8-0.1.0.tar.gz.
File metadata
- Download URL: vald8-0.1.0.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e5f26369f708c6e4fd71042ccbd0ace7e71c652bccfd39d3d9daca8209c6e74
|
|
| MD5 |
acec1601b4381039c7115e66695a0887
|
|
| BLAKE2b-256 |
6022085445adeb451cfe70b54bd400b6f32bec4ab7f915c374b5b9fae0f5906d
|
File details
Details for the file vald8-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vald8-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efead191bf9be23e379fd6c45e7c98a04bb59b0a358260358c1a77f392ef81c1
|
|
| MD5 |
20fb4a8699e4c7f2b076c9d03f113151
|
|
| BLAKE2b-256 |
46cc807f7ab6ea44ff194f0168e9bbaf71f4b183bc629181baa284f5a0b1b380
|