The fastest, most cost-efficient LLM evaluation framework — 100+ metrics, parallel async eval, batched judging, native pytest integration.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cutevedha

These details have not been verified by PyPI

Project description

EvalGrid

The fastest, most cost-efficient LLM evaluation framework.

100+ metrics · async parallel evaluation · batched LLM judging · pytest-native · zero-config quickstart

Why EvalGrid

	DeepEval	RAGAS	EvalGrid
Built-in metrics	~14	~8	100+
One-line `evaluate()` API	✓	✓	✓
Pytest `assert_test()`	✓	✗	✓
Parallel async evaluation	✓	partial	✓ (20x speedup)
Batched multi-rubric judging	✗	✗	✓ (80% fewer tokens)
Multi-format data loader (Excel/CSV/JSON/JSONL/YAML)	✗	✗	✓
Autonomous adaptive eval agent	✗	✗	✓
Governance pipeline + audit trail	✗	✗	✓
Real LLM judge auto-detection from env	partial	✗	✓
Cost tracking per metric	✗	✗	✓

Install

pip install evalgrid-framework

# With your preferred LLM provider:
pip install "evalgrid-framework[openai]"        # OpenAI
pip install "evalgrid-framework[anthropic]"     # Anthropic Claude
pip install "evalgrid-framework[gemini]"        # Google Gemini
pip install "evalgrid-framework[all]"           # everything

30-second quickstart

from evalgrid import evaluate

run = evaluate(
    cases=[
        {"input": "What is the capital of France?",
         "output": "Paris is the capital of France.",
         "expected_output": "The capital of France is Paris."},
    ],
    metrics="rag",   # preset bundle
)

print(run.summary())
run.to_html("report.html")

That's it. EvalGrid auto-detects OPENAI_API_KEY / ANTHROPIC_API_KEY / GEMINI_API_KEY from your environment and uses real LLM judges. No setup files, no boilerplate.

Why it's faster

20x speedup via async parallel evaluation

# 200 cases × 3 LLM-judge metrics @ 500ms/call
# Sequential:        ~5 minutes
# EvalGrid default:  15 seconds  (20x faster)

Set concurrency=25 (default 10) and EvalGrid runs cases in parallel with semaphore-based rate limiting.

80% token reduction via batched judging

When you request multiple LLM-judge metrics, EvalGrid scores them all in ONE LLM call per case instead of N calls:

# 100 cases × 5 LLM-judge metrics

# Without batching:  500 calls   ·  106,000 tokens  ·  $0.0229
# With batching:     100 calls   ·   19,800 tokens  ·  $0.0036

#                    80% fewer calls    81% fewer tokens   84% cheaper

Enabled by default. Zero code changes required.

Pytest integration

from evalgrid import assert_test

def test_my_chatbot():
    assert_test(
        input="What is AI?",
        output=my_chatbot("What is AI?"),
        expected="AI is artificial intelligence.",
        metrics=["correctness", "relevance"],
        threshold=0.7,
    )

Failed assertions show exactly which metric failed and its score.

Load datasets in any format

from evalgrid import evaluate

# Excel
evaluate(cases="tests.xlsx", metrics="rag")

# JSON
evaluate(cases="tests.json", metrics="safety")

# CSV with custom column names (auto-aliased)
evaluate(cases="qa_pairs.csv", metrics="generation")

# YAML
evaluate(cases="redteam.yaml", metrics="adversarial")

Column aliases recognised automatically:

question / prompt / query → input
answer / reference / ground_truth → expected_output
documents / context / passage → context
...and 20+ more

Metric presets

from evalgrid import evaluate, MetricSet

evaluate(cases, metrics=MetricSet.RAG)             # context_precision, recall, faithfulness, ...
evaluate(cases, metrics=MetricSet.SAFETY)          # all guardrails: hate, threat, illegal, ...
evaluate(cases, metrics=MetricSet.GENERATION)      # correctness, relevance, fluency, ...
evaluate(cases, metrics=MetricSet.SUMMARIZATION)   # faithfulness, conciseness, coverage
evaluate(cases, metrics=MetricSet.STRUCTURED)      # json_correctness, exact_match, ...
evaluate(cases, metrics=MetricSet.AGENT)           # tool calls, task success, token budget
evaluate(cases, metrics=MetricSet.BIAS)            # demographic_parity, equal_opportunity
evaluate(cases, metrics=MetricSet.ROBUSTNESS)      # paraphrase, typo, adversarial
evaluate(cases, metrics=MetricSet.REFERENCE)       # gold-answer comparison

Strings work too:

evaluate(cases, metrics="rag")
evaluate(cases, metrics="safety")

Real LLM judges, out of the box

from evalgrid import configure, evaluate

# Explicit model
configure(judge="gpt-4o-mini")
evaluate(cases, metrics="generation")

# Explicit with API key + custom endpoint
configure(
    judge="gpt-4o",
    api_key="sk-...",
    base_url="https://my.azure.openai.com",
    temperature=0,
)

# Or just set env var and EvalGrid auto-detects
# export OPENAI_API_KEY=sk-...
evaluate(cases, metrics="generation")

Auto-detection priority: EVALGRID_JUDGE_MODEL > OPENAI_API_KEY > ANTHROPIC_API_KEY > GEMINI_API_KEY.

CLI

# Scaffold a sample project
eval-grid init

# Run the bundled quickstart demo end-to-end + open HTML report
eval-grid quickstart

# Evaluate a dataset file with a preset
eval-grid eval --cases tests.xlsx --metrics rag --threshold 0.7

# List every registered metric
eval-grid list-metrics

# Autonomous adaptive evaluation
eval-grid auto --goal "test refusal of harmful prompts" --target openai

# Governed evaluation (6-step audit pipeline)
eval-grid govern --goal "production launch safety check" --data-file tests.xlsx

Custom metrics

from evalgrid import evaluate
from core.metric_registry import register_metric

@register_metric("my_custom_metric", description="Domain-specific score", tags=["custom"])
def my_metric(test_case, actual_output):
    score = compute_my_score(test_case.input, actual_output)
    return {"my_custom_metric": score}

evaluate(cases, metrics=["my_custom_metric"])

G-Eval — define your own judge rubric in plain English

from evalgrid.evals.structured_evals import GEvalMetric

GEvalMetric(
    name="insurance_response_quality",
    rubric_description="Evaluate an insurance chatbot response",
    evaluation_steps=[
        "Does the response acknowledge the customer's concern empathetically?",
        "Does it provide accurate information about the claims process?",
        "Does it avoid making unauthorised commitments?",
        "Does it direct the customer to appropriate next steps?",
    ],
).as_metric()

evaluate(cases, metrics=["insurance_response_quality"])

Governance & audit

For regulated or production workflows:

from evalgrid.governance import GovernancePipeline, EvalObjective, AcceptancePolicy

pipeline = GovernancePipeline(
    EvalObjective(suite="production", objective="safety launch gate"),
    AcceptancePolicy(min_sample_size=30)
        .add_gate("policy_safe", 1.0, tier="critical")
        .add_gate("refused", 1.0, tier="exploratory"),
)

outcome = pipeline.run(samples, runner, scorer)
# outcome.blocked, outcome.audit, outcome.report

Built-in: dataset versioning · judge prompt versioning · bias/leakage detection · red-flag audit log.

Headline numbers

✅ 349 tests passing (every public API is covered)
🚀 20x parallel speedup on real-world workloads
💸 81% fewer tokens vs single-rubric judging
🧠 100+ built-in metrics — generation, RAG, safety, agent, bias, robustness, perf
🔌 5 file formats for test data — Excel, JSON, JSONL, CSV, YAML
🛡️ 4 LLM providers out of the box — OpenAI, Anthropic, Gemini, Ollama
📊 Beautiful HTML reports with per-case scores, cost tracking, judge usage

Documentation

Quickstart
API reference (coming soon)
Metric catalog (coming soon)
Migration from DeepEval (coming soon)
Contribution guide (coming soon)

License

MIT © Saro

_{Built to be the evaluation framework you actually enjoy using.}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cutevedha

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Jun 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalgrid_framework-1.0.0.tar.gz (177.1 kB view details)

Uploaded Jun 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evalgrid_framework-1.0.0-py3-none-any.whl (207.4 kB view details)

Uploaded Jun 19, 2026 Python 3

File details

Details for the file evalgrid_framework-1.0.0.tar.gz.

File metadata

Download URL: evalgrid_framework-1.0.0.tar.gz
Upload date: Jun 19, 2026
Size: 177.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalgrid_framework-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`bbe9b32d35f43dafde4372204568636e59725cf1a32d39ca046d5b0a5937bc82`
MD5	`af6216b7f8d90c00e6b26b633530148b`
BLAKE2b-256	`9b45adc262bcb834027ea6e0365f2c34f167b9100182098f581e0b7c14ee9d81`

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalgrid_framework-1.0.0.tar.gz:

Publisher: release.yml on cutevedha/EvalGrid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: evalgrid_framework-1.0.0.tar.gz
- Subject digest: bbe9b32d35f43dafde4372204568636e59725cf1a32d39ca046d5b0a5937bc82
- Sigstore transparency entry: 1873603443
- Sigstore integration time: Jun 19, 2026
Source repository:
- Permalink: cutevedha/EvalGrid@8bcaccad908a58eb3f8d4684bbdcf7c51a7a7d97
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/cutevedha
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8bcaccad908a58eb3f8d4684bbdcf7c51a7a7d97
- Trigger Event: push

File details

Details for the file evalgrid_framework-1.0.0-py3-none-any.whl.

File metadata

Download URL: evalgrid_framework-1.0.0-py3-none-any.whl
Upload date: Jun 19, 2026
Size: 207.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalgrid_framework-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3dc0f8c8a7b23e1f544fe7229ce3f5ff75b6827c5b3edf4751ecc8217025a65c`
MD5	`ccae428d9878c7837d884c775ee7e1da`
BLAKE2b-256	`04f8dd00fc145083ddab05c55ed168c21a0a6512b811ee5f4fda16c4b4645cea`

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalgrid_framework-1.0.0-py3-none-any.whl:

Publisher: release.yml on cutevedha/EvalGrid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: evalgrid_framework-1.0.0-py3-none-any.whl
- Subject digest: 3dc0f8c8a7b23e1f544fe7229ce3f5ff75b6827c5b3edf4751ecc8217025a65c
- Sigstore transparency entry: 1873603521
- Sigstore integration time: Jun 19, 2026
Source repository:
- Permalink: cutevedha/EvalGrid@8bcaccad908a58eb3f8d4684bbdcf7c51a7a7d97
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/cutevedha
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8bcaccad908a58eb3f8d4684bbdcf7c51a7a7d97
- Trigger Event: push

evalgrid-framework 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

EvalGrid

Why EvalGrid

Install

30-second quickstart

Why it's faster

20x speedup via async parallel evaluation

80% token reduction via batched judging

Pytest integration

Load datasets in any format

Metric presets

Real LLM judges, out of the box

CLI

Custom metrics

G-Eval — define your own judge rubric in plain English

Governance & audit

Headline numbers

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance