Vald8 is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.

These details have not been verified by PyPI

Project links

Project description

🧪 Vald8 — Lightweight Evaluation Framework for LLM Reliability

Vald8 is a minimalist, developer-first SDK for testing LLM-powered Python functions using structured JSONL datasets.

It provides a simple way to validate:

Schema correctness
Instruction adherence
Reference accuracy
Keyword / regex expectations

With optional support for LLM-as-Judge scoring.

Focus: Make LLM evaluation as easy as pytest. Nothing more. Nothing less.

🚀 Why Vald8?

If you're building with LLMs, you need a way to verify that your AI functions:

produce valid JSON
follow instructions consistently
don't regress when prompts or models change
behave consistently across environments
meet quality thresholds before deployment

Vald8 gives you this with:

✔ One decorator
✔ One JSONL file
✔ One evaluation call

No configuration. No complexity. No over-engineering.

📦 Install

pip install vald8

🧩 Core Concept

You decorate any LLM function:

from vald8 import vald8

@vald8(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
    ...

Vald8 loads your dataset, runs the function against each example, and scores the results.

🚀 Running Examples

Vald8 comes with a realistic example script that demonstrates how to evaluate functions using real LLM APIs (OpenAI, Anthropic, Gemini).

Prerequisites

Install SDKs:

pip install openai anthropic google-generativeai

Set API Keys:

export OPENAI_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"
export GEMINI_API_KEY="your-key-here"

Run the Example

python examples/basic_example.py

This script will:

Load the evaluation dataset from examples/eval_dataset.jsonl.
Run evaluations on OpenAI GPT-5.1, Claude 3.5, and Gemini 1.5 (skipping any missing SDKs/keys).
Output pass/fail results and success rates for each model.

📁 JSONL Test Dataset Example

Save as tests.jsonl:

{"id": "math1", "input": "What is 2+2?", "expected": {"reference": "4"}}
{"id": "json1", "input": "Return JSON with name and age", "expected": {"schema": {"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "number"}}, "required": ["name", "age"]}}}
{"id": "hello1", "input": "Greet politely", "expected": {"contains": ["hello", "please"]}}
{"id": "regex1", "input": "Give a date", "expected": {"regex": "\d{4}-\d{2}-\d{2}" }}

Supported expectations:

"reference": "exact value"
"contains": ["word1", "word2"]
"regex": "pattern"
"schema": {...}

🧪 Decorating an LLM Function

from vald8 import vald8
import openai

@vald8(dataset="tests.jsonl")
def generate(prompt: str) -> dict:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    return {"response": response.choices[0].message.content}

📊 Running Evaluations

results = generate.run_eval()

print("Passed:", results["passed"])
print("Success Rate:", results["summary"]["success_rate"])
print("Details saved to:", results["run_dir"])

Example output:

✔ math1
✔ json1
✖ hello1 — missing: please
✔ regex1

Overall: 3/4 passed (75%)

🧱 Optional: LLM-as-Judge Scoring

Useful for long-form or fuzzy outputs.

@vald8(
    dataset="tests.jsonl",
    judge_provider="openai"   # or "anthropic", "local"
)
def summarize(text: str) -> str:
    return llm_summarize(text)

Most tests require no API calls.

🧩 CI/CD Integration

- name: Run Vald8 Tests
  run: |
    python -c "
    from my_llm import generate
    assert generate.run_eval()['passed']
    "

📁 Results Format

Each run produces:

runs/
└── 2025-11-21_12-01-44/
    ├── results.jsonl
    ├── summary.json
    └── metadata.json

🔧 Configuration Options

@vald8(
    dataset="tests.jsonl",
    tests=["schema", "contains", "reference"],
    thresholds={"success_rate": 0.9},
    sample_size=None,
    cache=False,
    judge_provider=None,
)

All parameters are optional.

🛠 Minimal Feature Set (v0.1)

Included:

✔ Test decorator
✔ JSONL dataset loader
✔ Schema validation
✔ Contains / reference / regex checks
✔ Optional LLM-as-judge
✔ Clear results + artifacts
✔ Offline mode
✔ CI/CD-ready
✔ Zero-config defaults

🤝 Contributing

PRs welcome.

📜 License

MIT License — free and open source.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.6

Nov 26, 2025

0.1.5

Nov 24, 2025

0.1.4

Nov 24, 2025

0.1.3

Nov 24, 2025

0.1.2

Nov 24, 2025

0.1.1

Nov 24, 2025

This version

0.1.0

Nov 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vald8-0.1.0.tar.gz (34.6 kB view details)

Uploaded Nov 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vald8-0.1.0-py3-none-any.whl (33.4 kB view details)

Uploaded Nov 24, 2025 Python 3

File details

Details for the file vald8-0.1.0.tar.gz.

File metadata

Download URL: vald8-0.1.0.tar.gz
Upload date: Nov 24, 2025
Size: 34.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vald8-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7e5f26369f708c6e4fd71042ccbd0ace7e71c652bccfd39d3d9daca8209c6e74`
MD5	`acec1601b4381039c7115e66695a0887`
BLAKE2b-256	`6022085445adeb451cfe70b54bd400b6f32bec4ab7f915c374b5b9fae0f5906d`

See more details on using hashes here.

File details

Details for the file vald8-0.1.0-py3-none-any.whl.

File metadata

Download URL: vald8-0.1.0-py3-none-any.whl
Upload date: Nov 24, 2025
Size: 33.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for vald8-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`efead191bf9be23e379fd6c45e7c98a04bb59b0a358260358c1a77f392ef81c1`
MD5	`20fb4a8699e4c7f2b076c9d03f113151`
BLAKE2b-256	`46cc807f7ab6ea44ff194f0168e9bbaf71f4b183bc629181baa284f5a0b1b380`

See more details on using hashes here.

vald8 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧪 Vald8 — Lightweight Evaluation Framework for LLM Reliability

🚀 Why Vald8?

📦 Install

🧩 Core Concept

🚀 Running Examples

Prerequisites

Run the Example

📁 JSONL Test Dataset Example

🧪 Decorating an LLM Function

📊 Running Evaluations

🧱 Optional: LLM-as-Judge Scoring

🧩 CI/CD Integration

📁 Results Format

🔧 Configuration Options

🛠 Minimal Feature Set (v0.1)

🤝 Contributing

📜 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes