Agent-agnostic evaluation harness for data-quality repair agents.
Project description
dataforge-evals
dataforge-evals is an agent-agnostic evaluation harness for data-quality repair agents.
It gives any agent the same task, accepts only proposed cell fixes, and lets the grader compute exact precision, recall, F1, steps, failures, and free-tier quota usage. The harness can load DataForge's canonical Hospital, Flights, and Beers benchmark tasks when dataforge_07 is installed, while the import namespace remains dataforge for the 0.1 line.
The PyPI package is not published yet; use the source install instructions
below until release ownership is configured.
pip install -e ".[dev]"
dataforge-evals run --agent mock --dataset synthetic --trials 3
Install
From source (development)
python -m venv .venv
# Linux/macOS:
source .venv/bin/activate
# Windows PowerShell:
.\.venv\Scripts\Activate.ps1
pip install -e ".[dev]"
With canonical DataForge datasets
pip install -e "../data_quality_env"
dataforge-evals run --agent mock --dataset hospital --trials 3
Run a provider
set GROQ_API_KEY=...
dataforge-evals run --agent groq-llama-70b --dataset hospital --trials 3 --output reports/groq-hospital.md
Bounded Groq smoke test
Use a single synthetic trial to verify Groq wiring without turning the smoke check into a benchmark:
dataforge-evals run --agent groq-llama-70b --dataset synthetic --trials 1 --seed 0 --timeout-s 20 --output reports/groq-synthetic-smoke.md --output-json reports/groq-synthetic-smoke.json
For this smoke path, trials_completed=1 and Failures=none prove the
integration completed successfully. F1 is a quality signal for the model's
proposed repairs, not the API health check. The JSON report includes the
normalized proposed fixes for debugging; Markdown stays summary-only.
Built-in adapters
| Agent ID | Provider | Required Setup |
|---|---|---|
mock |
local deterministic oracle for tests | none |
groq-llama-70b |
Groq | GROQ_API_KEY |
gemini-flash |
Gemini | GEMINI_API_KEY |
cerebras-llama |
Cerebras | CEREBRAS_API_KEY |
openrouter |
OpenRouter | OPENROUTER_API_KEY |
local-ollama |
local Ollama OpenAI-compatible endpoint | Ollama server on localhost:11434 |
hf-local |
Hugging Face Transformers | optional HF_TOKEN; install .[hf] |
Evaluating the historical DataForge SFT checkpoint
Use hf-local for base-vs-SFT checks with the same exact-match grader used by
hosted providers:
pip install -e ".[hf]"
dataforge-evals run --agent hf-local --dataset synthetic --trials 1 \
--model-id Praneshrajan15/DataForge-0.5B-SFT \
--output reports/dataforge-sft-smoke.md \
--output-json reports/dataforge-sft-smoke.json
If --model-id is omitted, the adapter uses DATAFORGE_EVAL_MODEL, then the
authenticated HF_TOKEN user's DataForge-0.5B-SFT, then
Praneshrajan15/DataForge-0.5B-SFT.
Discover agents and datasets
dataforge-evals list-agents
dataforge-evals list-datasets
Custom CSV-pair evaluation
Bring your own dirty and clean CSV files:
dataforge-evals run --agent mock --dataset my-data \
--dirty-csv path/to/dirty.csv \
--clean-csv path/to/clean.csv \
--trials 3
The dirty and clean CSVs must have the same number of rows and columns. Column names are taken from the clean file.
Agent protocol
Any agent can plug in by implementing:
from dataforge_evals import AgentTask, Fix
class MyAgent:
name = "my-agent"
def run(self, task: AgentTask) -> list[Fix]:
return [Fix(row=0, column="Score", new_value="4.5", reason="example")]
Agents never report their own score. They return candidate fixes only. The grader is the only source of truth.
Normal agents receive a label-hidden AgentTask; only the built-in mock
oracle used by tests is marked to receive full ground truth.
What agents receive
task.name— dataset identifiertask.dirty_df— pandas DataFrame with data-quality issues (all values as strings)task.canonical_columns— ordered column names from the clean referencetask.metadata— provenance and descriptive metadata
What agents return
Either a list[Fix] or an AgentRunResult with usage accounting:
from dataforge_evals import AgentRunResult, Fix, Usage
return AgentRunResult(
fixes=[Fix(row=0, column="Score", new_value="4.5")],
usage=Usage(calls=1, prompt_tokens=500, completion_tokens=100, quota_units=0.001),
steps=1,
model="my-model-v1",
)
What is graded
A Fix is correct only when (row, column, new_value) exactly matches a ground-truth dirty-to-clean cell correction. Duplicate predictions for the same cell use last-write-wins normalization. A wrong value on the right cell counts as both a false positive and a false negative.
Quota accounting
Each report uses provider-normalized free-tier quota units rather than dollars. Built-in adapters record raw calls, prompt tokens, completion tokens, and quota units.
Provider-specific normalization (as of 2026-05-01):
| Provider | Free-tier basis | 1 quota unit = |
|---|---|---|
| Groq | 14,400 RPD | 1 request |
| Gemini | 1,500 RPD | 1 request |
| Cerebras | 1,000 RPD | 1 request |
| OpenRouter | Nominal 1,000 RPD | 1 request |
| Ollama | unlimited (local) | always 0 |
On HTTP 429, the adapter waits with exponential backoff and logs waiting N seconds for quota reset to stderr. It does not fall back to another provider because fallback would contaminate the comparison.
Reproducibility
Each report records:
dataforge-evalscommit hashdataforgesource commit hash when canonical datasets are loaded through DataForge- exact seeds
- provider model identifiers
- UTC run date
- dependency versions (pandas, pydantic, httpx, etc.)
- an explicit nondeterminism note
Deterministic and mock agents reproduce exactly from the recorded seeds. Hosted LLM providers may still change outputs because providers can update model weights, routing, safety systems, or tokenization without notice.
Reproducibility limitations
- Provider model identifiers (e.g.,
llama-3.3-70b-versatile) may point to different weights on different dates. - Token counts and quota units depend on provider-side tokenization, which can change.
- Network latency, rate limiting, and provider availability affect runtime measurements.
- Temperature 0 does not guarantee determinism across all providers.
Not a leaderboard by default
Only compare reports when dataset versions, seeds, provider model identifiers, run date, and prompt/adapter code are identical. Otherwise the report is an evaluation artifact, not a leaderboard row.
When dataforge-evals is the wrong tool
Do not use dataforge-evals if:
- Your agent operates on streaming data — the harness is batch-oriented and expects a complete dirty DataFrame.
- You need end-to-end pipeline evaluation — this tool evaluates cell-level repair accuracy, not detection, diagnosis, or pipeline orchestration.
- Your ground truth is fuzzy or approximate — the grader uses exact string match. If multiple correct values exist for a cell, you need a custom grader.
- You need sub-second latency benchmarking — the harness measures wall-clock time but is not designed as a latency benchmarking tool.
- Your data is > 100K rows — the harness loads the full DataFrame into memory and passes it to agents. For large-scale evaluation, sample first.
Development
make setup # pip install -e ".[dev]"
make lint # ruff check
make format # ruff format --check
make type # mypy --strict
make test # pytest
make test-cov # pytest with coverage
make smoke # end-to-end smoke test with mock agent
Environment Variables
Provider keys belong in a root .env file (gitignored) loaded with python-dotenv:
GROQ_API_KEYGEMINI_API_KEYCEREBRAS_API_KEYOPENROUTER_API_KEY
License
Apache-2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataforge_07_evals-0.1.0.tar.gz.
File metadata
- Download URL: dataforge_07_evals-0.1.0.tar.gz
- Upload date:
- Size: 39.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
323b6ad55c5c98c8d4b0ce0a22aaa35961246bdbe059815fe57d30cce04decce
|
|
| MD5 |
78f6015fc9a1513730413aa6be472236
|
|
| BLAKE2b-256 |
4109e7170520f9344916f5644a2491091a027a2cf7f1b16f2c6b767a77e19243
|
Provenance
The following attestation bundles were made for dataforge_07_evals-0.1.0.tar.gz:
Publisher:
publish-dataforge-evals.yml on Aegis15/dataforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataforge_07_evals-0.1.0.tar.gz -
Subject digest:
323b6ad55c5c98c8d4b0ce0a22aaa35961246bdbe059815fe57d30cce04decce - Sigstore transparency entry: 1804407839
- Sigstore integration time:
-
Permalink:
Aegis15/dataforge@d498b656734241e343673fafe1b11676b475bf60 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Aegis15
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-dataforge-evals.yml@d498b656734241e343673fafe1b11676b475bf60 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file dataforge_07_evals-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dataforge_07_evals-0.1.0-py3-none-any.whl
- Upload date:
- Size: 42.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e367297c97a522c4a0747dcd5eefae7d6690cdaf76b412efc6f7b4d0a8c9e61
|
|
| MD5 |
e34776b1c58428cbfd076ac210a5a7f8
|
|
| BLAKE2b-256 |
d83f089bd452babb2ae9607b57037eca771a7b6fb73511b627f5dbbec78cfe7e
|
Provenance
The following attestation bundles were made for dataforge_07_evals-0.1.0-py3-none-any.whl:
Publisher:
publish-dataforge-evals.yml on Aegis15/dataforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataforge_07_evals-0.1.0-py3-none-any.whl -
Subject digest:
5e367297c97a522c4a0747dcd5eefae7d6690cdaf76b412efc6f7b4d0a8c9e61 - Sigstore transparency entry: 1804408118
- Sigstore integration time:
-
Permalink:
Aegis15/dataforge@d498b656734241e343673fafe1b11676b475bf60 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Aegis15
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-dataforge-evals.yml@d498b656734241e343673fafe1b11676b475bf60 -
Trigger Event:
workflow_dispatch
-
Statement type: