Skip to main content

The fastest, most cost-efficient LLM evaluation framework — 100+ metrics, parallel async eval, batched judging, native pytest integration.

Project description

EvalGrid

The fastest, most cost-efficient LLM evaluation framework.

100+ metrics · async parallel evaluation · batched LLM judging · pytest-native · zero-config quickstart

Python Tests Token Reduction Speedup License


Why EvalGrid

DeepEval RAGAS EvalGrid
Built-in metrics ~14 ~8 100+
One-line evaluate() API
Pytest assert_test()
Parallel async evaluation partial ✓ (20x speedup)
Batched multi-rubric judging (80% fewer tokens)
Multi-format data loader (Excel/CSV/JSON/JSONL/YAML)
Autonomous adaptive eval agent
Governance pipeline + audit trail
Real LLM judge auto-detection from env partial
Cost tracking per metric

Install

pip install evalgrid-framework

# With your preferred LLM provider:
pip install "evalgrid-framework[openai]"        # OpenAI
pip install "evalgrid-framework[anthropic]"     # Anthropic Claude
pip install "evalgrid-framework[gemini]"        # Google Gemini
pip install "evalgrid-framework[all]"           # everything

30-second quickstart

from evalgrid import evaluate

run = evaluate(
    cases=[
        {"input": "What is the capital of France?",
         "output": "Paris is the capital of France.",
         "expected_output": "The capital of France is Paris."},
    ],
    metrics="rag",   # preset bundle
)

print(run.summary())
run.to_html("report.html")

That's it. EvalGrid auto-detects OPENAI_API_KEY / ANTHROPIC_API_KEY / GEMINI_API_KEY from your environment and uses real LLM judges. No setup files, no boilerplate.


Why it's faster

20x speedup via async parallel evaluation

# 200 cases × 3 LLM-judge metrics @ 500ms/call
# Sequential:        ~5 minutes
# EvalGrid default:  15 seconds  (20x faster)

Set concurrency=25 (default 10) and EvalGrid runs cases in parallel with semaphore-based rate limiting.

80% token reduction via batched judging

When you request multiple LLM-judge metrics, EvalGrid scores them all in ONE LLM call per case instead of N calls:

# 100 cases × 5 LLM-judge metrics

# Without batching:  500 calls   ·  106,000 tokens  ·  $0.0229
# With batching:     100 calls   ·   19,800 tokens  ·  $0.0036

#                    80% fewer calls    81% fewer tokens   84% cheaper

Enabled by default. Zero code changes required.


Pytest integration

from evalgrid import assert_test

def test_my_chatbot():
    assert_test(
        input="What is AI?",
        output=my_chatbot("What is AI?"),
        expected="AI is artificial intelligence.",
        metrics=["correctness", "relevance"],
        threshold=0.7,
    )

Failed assertions show exactly which metric failed and its score.


Load datasets in any format

from evalgrid import evaluate

# Excel
evaluate(cases="tests.xlsx", metrics="rag")

# JSON
evaluate(cases="tests.json", metrics="safety")

# CSV with custom column names (auto-aliased)
evaluate(cases="qa_pairs.csv", metrics="generation")

# YAML
evaluate(cases="redteam.yaml", metrics="adversarial")

Column aliases recognised automatically:

  • question / prompt / queryinput
  • answer / reference / ground_truthexpected_output
  • documents / context / passagecontext
  • ...and 20+ more

Metric presets

from evalgrid import evaluate, MetricSet

evaluate(cases, metrics=MetricSet.RAG)             # context_precision, recall, faithfulness, ...
evaluate(cases, metrics=MetricSet.SAFETY)          # all guardrails: hate, threat, illegal, ...
evaluate(cases, metrics=MetricSet.GENERATION)      # correctness, relevance, fluency, ...
evaluate(cases, metrics=MetricSet.SUMMARIZATION)   # faithfulness, conciseness, coverage
evaluate(cases, metrics=MetricSet.STRUCTURED)      # json_correctness, exact_match, ...
evaluate(cases, metrics=MetricSet.AGENT)           # tool calls, task success, token budget
evaluate(cases, metrics=MetricSet.BIAS)            # demographic_parity, equal_opportunity
evaluate(cases, metrics=MetricSet.ROBUSTNESS)      # paraphrase, typo, adversarial
evaluate(cases, metrics=MetricSet.REFERENCE)       # gold-answer comparison

Strings work too:

evaluate(cases, metrics="rag")
evaluate(cases, metrics="safety")

Real LLM judges, out of the box

from evalgrid import configure, evaluate

# Explicit model
configure(judge="gpt-4o-mini")
evaluate(cases, metrics="generation")

# Explicit with API key + custom endpoint
configure(
    judge="gpt-4o",
    api_key="sk-...",
    base_url="https://my.azure.openai.com",
    temperature=0,
)

# Or just set env var and EvalGrid auto-detects
# export OPENAI_API_KEY=sk-...
evaluate(cases, metrics="generation")

Auto-detection priority: EVALGRID_JUDGE_MODEL > OPENAI_API_KEY > ANTHROPIC_API_KEY > GEMINI_API_KEY.


CLI

# Scaffold a sample project
eval-grid init

# Run the bundled quickstart demo end-to-end + open HTML report
eval-grid quickstart

# Evaluate a dataset file with a preset
eval-grid eval --cases tests.xlsx --metrics rag --threshold 0.7

# List every registered metric
eval-grid list-metrics

# Autonomous adaptive evaluation
eval-grid auto --goal "test refusal of harmful prompts" --target openai

# Governed evaluation (6-step audit pipeline)
eval-grid govern --goal "production launch safety check" --data-file tests.xlsx

Custom metrics

from evalgrid import evaluate
from core.metric_registry import register_metric

@register_metric("my_custom_metric", description="Domain-specific score", tags=["custom"])
def my_metric(test_case, actual_output):
    score = compute_my_score(test_case.input, actual_output)
    return {"my_custom_metric": score}

evaluate(cases, metrics=["my_custom_metric"])

G-Eval — define your own judge rubric in plain English

from evalgrid.evals.structured_evals import GEvalMetric

GEvalMetric(
    name="insurance_response_quality",
    rubric_description="Evaluate an insurance chatbot response",
    evaluation_steps=[
        "Does the response acknowledge the customer's concern empathetically?",
        "Does it provide accurate information about the claims process?",
        "Does it avoid making unauthorised commitments?",
        "Does it direct the customer to appropriate next steps?",
    ],
).as_metric()

evaluate(cases, metrics=["insurance_response_quality"])

Governance & audit

For regulated or production workflows:

from evalgrid.governance import GovernancePipeline, EvalObjective, AcceptancePolicy

pipeline = GovernancePipeline(
    EvalObjective(suite="production", objective="safety launch gate"),
    AcceptancePolicy(min_sample_size=30)
        .add_gate("policy_safe", 1.0, tier="critical")
        .add_gate("refused", 1.0, tier="exploratory"),
)

outcome = pipeline.run(samples, runner, scorer)
# outcome.blocked, outcome.audit, outcome.report

Built-in: dataset versioning · judge prompt versioning · bias/leakage detection · red-flag audit log.


Headline numbers

  • 349 tests passing (every public API is covered)
  • 🚀 20x parallel speedup on real-world workloads
  • 💸 81% fewer tokens vs single-rubric judging
  • 🧠 100+ built-in metrics — generation, RAG, safety, agent, bias, robustness, perf
  • 🔌 5 file formats for test data — Excel, JSON, JSONL, CSV, YAML
  • 🛡️ 4 LLM providers out of the box — OpenAI, Anthropic, Gemini, Ollama
  • 📊 Beautiful HTML reports with per-case scores, cost tracking, judge usage

Documentation


License

MIT © Saro


Built to be the evaluation framework you actually enjoy using.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalgrid_framework-1.0.0.tar.gz (177.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalgrid_framework-1.0.0-py3-none-any.whl (207.4 kB view details)

Uploaded Python 3

File details

Details for the file evalgrid_framework-1.0.0.tar.gz.

File metadata

  • Download URL: evalgrid_framework-1.0.0.tar.gz
  • Upload date:
  • Size: 177.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evalgrid_framework-1.0.0.tar.gz
Algorithm Hash digest
SHA256 bbe9b32d35f43dafde4372204568636e59725cf1a32d39ca046d5b0a5937bc82
MD5 af6216b7f8d90c00e6b26b633530148b
BLAKE2b-256 9b45adc262bcb834027ea6e0365f2c34f167b9100182098f581e0b7c14ee9d81

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalgrid_framework-1.0.0.tar.gz:

Publisher: release.yml on cutevedha/EvalGrid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evalgrid_framework-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for evalgrid_framework-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3dc0f8c8a7b23e1f544fe7229ce3f5ff75b6827c5b3edf4751ecc8217025a65c
MD5 ccae428d9878c7837d884c775ee7e1da
BLAKE2b-256 04f8dd00fc145083ddab05c55ed168c21a0a6512b811ee5f4fda16c4b4645cea

See more details on using hashes here.

Provenance

The following attestation bundles were made for evalgrid_framework-1.0.0-py3-none-any.whl:

Publisher: release.yml on cutevedha/EvalGrid

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page