Automated LLM evaluation pipeline generator

These details have not been verified by PyPI

Project links

Project description

EvalForge

Automated LLM evaluation pipeline generator.

Built by SubstrAI — Open-source GenAI frameworks for serverless infrastructure.

The Problem

Every team deploying LLMs builds evaluation pipelines from scratch. RAGAS and DeepEval are libraries — they don't generate infrastructure, schedule runs, detect drift, or route to human reviewers.

The Solution

Describe your use case → EvalForge generates the complete evaluation pipeline:

# evalforge.yaml
use_case:
  type: rag
  description: "Customer support chatbot"
  model:
    provider: bedrock
    model_id: anthropic.claude-3-haiku-20240307-v1:0

evaluation:
  metrics: auto  # auto-selects: faithfulness, relevancy, precision, recall, toxicity

evalforge run
# Faithfulness:      0.91 ✓ (threshold: 0.85)
# Answer Relevancy:  0.87 ✓ (threshold: 0.80)
# Context Precision: 0.78 ✓ (threshold: 0.75)
# Toxicity:          0.02 ✓ (threshold: 0.05)
# Overall: PASS (4/4 metrics passing)

Features

Use-case-driven metric selection — describe your app, get optimal metrics
6 use case types — RAG, summarization, classification, generation, chat, code
16+ built-in metrics — faithfulness, ROUGE, BLEU, toxicity, injection resistance, F1
Synthetic test data generation — adversarial, edge cases, domain-specific
Drift detection — alerts when quality degrades over time
Human-in-the-loop — route uncertain evaluations to reviewers
Scheduled pipelines — daily/weekly automated evaluation runs
Benchmark registry — compare against published benchmarks
One-command deploy — Step Functions + Lambda infrastructure

Installation

pip install substrai-evalforge

Quick Start

# Scaffold project
evalforge init my-eval --use-case rag

# Run evaluation
cd my-eval
evalforge run

# List available metrics
evalforge metrics --use-case rag

Python SDK

from evalforge import EvalPipeline

# Quick start for any use case
pipeline = EvalPipeline.for_use_case("rag")
results = pipeline.run()
print(results.summary())
print(f"All passing: {results.all_passing}")

Supported Use Cases & Auto-Selected Metrics

Use Case	Auto-Selected Metrics
rag	faithfulness, answer_relevancy, context_precision, context_recall, toxicity
summarization	rouge_l, bleu, coherence, conciseness, fluency
classification	accuracy, precision, recall, f1_score
generation	fluency, coherence, toxicity, bias_detection
chat	coherence, toxicity, injection_resistance, fluency
code	accuracy, coherence

License

MIT — see LICENSE

Author

Gaurav Kumar Sinha — Founder, SubstrAI

Email: gaurav@substrai.dev
GitHub: @substrai

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

May 13, 2026

0.4.0

May 13, 2026

This version

0.3.0

May 13, 2026

0.2.0

May 13, 2026

0.1.0

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

substrai_evalforge-0.3.0.tar.gz (36.1 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

substrai_evalforge-0.3.0-py3-none-any.whl (37.9 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file substrai_evalforge-0.3.0.tar.gz.

File metadata

Download URL: substrai_evalforge-0.3.0.tar.gz
Upload date: May 13, 2026
Size: 36.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for substrai_evalforge-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`69bf2626ce312765af45531d1b255f7e9d8b33a89e460d6a8927ee9f27d163f7`
MD5	`42e009fc5d2776be2668df6b2efcda27`
BLAKE2b-256	`f3a1ee839b4351d69e3419a767fb67ca098188814125b7635de588435e85f3c9`

See more details on using hashes here.

File details

Details for the file substrai_evalforge-0.3.0-py3-none-any.whl.

File metadata

Download URL: substrai_evalforge-0.3.0-py3-none-any.whl
Upload date: May 13, 2026
Size: 37.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for substrai_evalforge-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2f6de8a9bcf53574a268a40258c4503d469e85c82d649dc965804cf684ab9b10`
MD5	`4bcde1435f7f455753e80b484b4277bd`
BLAKE2b-256	`2dcb05817f2f1bcb8abdd587386ca42b276fb7646a024e0bad3b95aa04fcf09a`

See more details on using hashes here.

substrai-evalforge 0.3.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

EvalForge

The Problem

The Solution

Features

Installation

Quick Start

Python SDK

Supported Use Cases & Auto-Selected Metrics

License

Author

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes