Agent-agnostic evaluation harness for data-quality repair agents.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pranesh15

These details have not been verified by PyPI

Project links

Documentation

Project description

dataforge-evals

dataforge-evals is an agent-agnostic evaluation harness for data-quality repair agents.

It gives any agent the same task, accepts only proposed cell fixes, and lets the grader compute exact precision, recall, F1, steps, failures, and free-tier quota usage. The harness can load DataForge's canonical Hospital, Flights, and Beers benchmark tasks when dataforge_07 is installed, while the import namespace remains dataforge for the 0.1 line. The PyPI package is not published yet; use the source install instructions below until release ownership is configured.

pip install -e ".[dev]"
dataforge-evals run --agent mock --dataset synthetic --trials 3

Install

From source (development)

python -m venv .venv
# Linux/macOS:
source .venv/bin/activate
# Windows PowerShell:
.\.venv\Scripts\Activate.ps1

pip install -e ".[dev]"

With canonical DataForge datasets

pip install -e "../data_quality_env"
dataforge-evals run --agent mock --dataset hospital --trials 3

Run a provider

set GROQ_API_KEY=...
dataforge-evals run --agent groq-llama-70b --dataset hospital --trials 3 --output reports/groq-hospital.md

Bounded Groq smoke test

Use a single synthetic trial to verify Groq wiring without turning the smoke check into a benchmark:

dataforge-evals run --agent groq-llama-70b --dataset synthetic --trials 1 --seed 0 --timeout-s 20 --output reports/groq-synthetic-smoke.md --output-json reports/groq-synthetic-smoke.json

For this smoke path, trials_completed=1 and Failures=none prove the integration completed successfully. F1 is a quality signal for the model's proposed repairs, not the API health check. The JSON report includes the normalized proposed fixes for debugging; Markdown stays summary-only.

Built-in adapters

Agent ID	Provider	Required Setup
`mock`	local deterministic oracle for tests	none
`groq-llama-70b`	Groq	`GROQ_API_KEY`
`gemini-flash`	Gemini	`GEMINI_API_KEY`
`cerebras-llama`	Cerebras	`CEREBRAS_API_KEY`
`openrouter`	OpenRouter	`OPENROUTER_API_KEY`
`local-ollama`	local Ollama OpenAI-compatible endpoint	Ollama server on `localhost:11434`
`hf-local`	Hugging Face Transformers	optional `HF_TOKEN`; install `.[hf]`

Evaluating the historical DataForge SFT checkpoint

Use hf-local for base-vs-SFT checks with the same exact-match grader used by hosted providers:

pip install -e ".[hf]"
dataforge-evals run --agent hf-local --dataset synthetic --trials 1 \
  --model-id Praneshrajan15/DataForge-0.5B-SFT \
  --output reports/dataforge-sft-smoke.md \
  --output-json reports/dataforge-sft-smoke.json

If --model-id is omitted, the adapter uses DATAFORGE_EVAL_MODEL, then the authenticated HF_TOKEN user's DataForge-0.5B-SFT, then Praneshrajan15/DataForge-0.5B-SFT.

Discover agents and datasets

dataforge-evals list-agents
dataforge-evals list-datasets

Custom CSV-pair evaluation

Bring your own dirty and clean CSV files:

dataforge-evals run --agent mock --dataset my-data \
    --dirty-csv path/to/dirty.csv \
    --clean-csv path/to/clean.csv \
    --trials 3

The dirty and clean CSVs must have the same number of rows and columns. Column names are taken from the clean file.

Agent protocol

Any agent can plug in by implementing:

from dataforge_evals import AgentTask, Fix

class MyAgent:
    name = "my-agent"

    def run(self, task: AgentTask) -> list[Fix]:
        return [Fix(row=0, column="Score", new_value="4.5", reason="example")]

Agents never report their own score. They return candidate fixes only. The grader is the only source of truth. Normal agents receive a label-hidden AgentTask; only the built-in mock oracle used by tests is marked to receive full ground truth.

What agents receive

task.name â€” dataset identifier
task.dirty_df â€” pandas DataFrame with data-quality issues (all values as strings)
task.canonical_columns â€” ordered column names from the clean reference
task.metadata â€” provenance and descriptive metadata

What agents return

Either a list[Fix] or an AgentRunResult with usage accounting:

from dataforge_evals import AgentRunResult, Fix, Usage

return AgentRunResult(
    fixes=[Fix(row=0, column="Score", new_value="4.5")],
    usage=Usage(calls=1, prompt_tokens=500, completion_tokens=100, quota_units=0.001),
    steps=1,
    model="my-model-v1",
)

What is graded

A Fix is correct only when (row, column, new_value) exactly matches a ground-truth dirty-to-clean cell correction. Duplicate predictions for the same cell use last-write-wins normalization. A wrong value on the right cell counts as both a false positive and a false negative.

Quota accounting

Each report uses provider-normalized free-tier quota units rather than dollars. Built-in adapters record raw calls, prompt tokens, completion tokens, and quota units.

Provider-specific normalization (as of 2026-05-01):

Provider	Free-tier basis	1 quota unit =
Groq	14,400 RPD	1 request
Gemini	1,500 RPD	1 request
Cerebras	1,000 RPD	1 request
OpenRouter	Nominal 1,000 RPD	1 request
Ollama	unlimited (local)	always 0

On HTTP 429, the adapter waits with exponential backoff and logs waiting N seconds for quota reset to stderr. It does not fall back to another provider because fallback would contaminate the comparison.

Reproducibility

Each report records:

dataforge-evals commit hash
dataforge source commit hash when canonical datasets are loaded through DataForge
exact seeds
provider model identifiers
UTC run date
dependency versions (pandas, pydantic, httpx, etc.)
an explicit nondeterminism note

Deterministic and mock agents reproduce exactly from the recorded seeds. Hosted LLM providers may still change outputs because providers can update model weights, routing, safety systems, or tokenization without notice.

Reproducibility limitations

Provider model identifiers (e.g., llama-3.3-70b-versatile) may point to different weights on different dates.
Token counts and quota units depend on provider-side tokenization, which can change.
Network latency, rate limiting, and provider availability affect runtime measurements.
Temperature 0 does not guarantee determinism across all providers.

Not a leaderboard by default

Only compare reports when dataset versions, seeds, provider model identifiers, run date, and prompt/adapter code are identical. Otherwise the report is an evaluation artifact, not a leaderboard row.

When dataforge-evals is the wrong tool

Do not use dataforge-evals if:

Your agent operates on streaming data â€” the harness is batch-oriented and expects a complete dirty DataFrame.
You need end-to-end pipeline evaluation â€” this tool evaluates cell-level repair accuracy, not detection, diagnosis, or pipeline orchestration.
Your ground truth is fuzzy or approximate â€” the grader uses exact string match. If multiple correct values exist for a cell, you need a custom grader.
You need sub-second latency benchmarking â€” the harness measures wall-clock time but is not designed as a latency benchmarking tool.
Your data is > 100K rows â€” the harness loads the full DataFrame into memory and passes it to agents. For large-scale evaluation, sample first.

Development

make setup     # pip install -e ".[dev]"
make lint      # ruff check
make format    # ruff format --check
make type      # mypy --strict
make test      # pytest
make test-cov  # pytest with coverage
make smoke     # end-to-end smoke test with mock agent

Environment Variables

Provider keys belong in a root .env file (gitignored) loaded with python-dotenv:

GROQ_API_KEY
GEMINI_API_KEY
CEREBRAS_API_KEY
OPENROUTER_API_KEY

License

Apache-2.0.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pranesh15

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.1.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataforge_07_evals-0.1.0.tar.gz (39.3 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataforge_07_evals-0.1.0-py3-none-any.whl (42.2 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file dataforge_07_evals-0.1.0.tar.gz.

File metadata

Download URL: dataforge_07_evals-0.1.0.tar.gz
Upload date: Jun 12, 2026
Size: 39.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dataforge_07_evals-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`323b6ad55c5c98c8d4b0ce0a22aaa35961246bdbe059815fe57d30cce04decce`
MD5	`78f6015fc9a1513730413aa6be472236`
BLAKE2b-256	`4109e7170520f9344916f5644a2491091a027a2cf7f1b16f2c6b767a77e19243`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataforge_07_evals-0.1.0.tar.gz:

Publisher: publish-dataforge-evals.yml on Aegis15/dataforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataforge_07_evals-0.1.0.tar.gz
- Subject digest: 323b6ad55c5c98c8d4b0ce0a22aaa35961246bdbe059815fe57d30cce04decce
- Sigstore transparency entry: 1804407839
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: Aegis15/dataforge@d498b656734241e343673fafe1b11676b475bf60
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Aegis15
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-dataforge-evals.yml@d498b656734241e343673fafe1b11676b475bf60
- Trigger Event: workflow_dispatch

File details

Details for the file dataforge_07_evals-0.1.0-py3-none-any.whl.

File metadata

Download URL: dataforge_07_evals-0.1.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 42.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dataforge_07_evals-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5e367297c97a522c4a0747dcd5eefae7d6690cdaf76b412efc6f7b4d0a8c9e61`
MD5	`e34776b1c58428cbfd076ac210a5a7f8`
BLAKE2b-256	`d83f089bd452babb2ae9607b57037eca771a7b6fb73511b627f5dbbec78cfe7e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataforge_07_evals-0.1.0-py3-none-any.whl:

Publisher: publish-dataforge-evals.yml on Aegis15/dataforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataforge_07_evals-0.1.0-py3-none-any.whl
- Subject digest: 5e367297c97a522c4a0747dcd5eefae7d6690cdaf76b412efc6f7b4d0a8c9e61
- Sigstore transparency entry: 1804408118
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: Aegis15/dataforge@d498b656734241e343673fafe1b11676b475bf60
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Aegis15
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-dataforge-evals.yml@d498b656734241e343673fafe1b11676b475bf60
- Trigger Event: workflow_dispatch

dataforge-07-evals 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dataforge-evals

Install

From source (development)

With canonical DataForge datasets

Run a provider

Bounded Groq smoke test

Built-in adapters

Evaluating the historical DataForge SFT checkpoint

Discover agents and datasets

Custom CSV-pair evaluation

Agent protocol

What agents receive

What agents return

What is graded

Quota accounting

Reproducibility

Reproducibility limitations

Not a leaderboard by default

When dataforge-evals is the wrong tool

Development

Environment Variables

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance