Skip to main content

Automated LLM evaluation pipeline generator

Project description

EvalForge

Automated LLM evaluation pipeline generator.

Built by SubstrAI — Open-source GenAI frameworks for serverless infrastructure.

PyPI version License: MIT Python 3.9+

The Problem

Every team deploying LLMs builds evaluation pipelines from scratch. RAGAS and DeepEval are libraries — they don't generate infrastructure, schedule runs, detect drift, or route to human reviewers.

The Solution

Describe your use case → EvalForge generates the complete evaluation pipeline:

# evalforge.yaml
use_case:
  type: rag
  description: "Customer support chatbot"
  model:
    provider: bedrock
    model_id: anthropic.claude-3-haiku-20240307-v1:0

evaluation:
  metrics: auto  # auto-selects: faithfulness, relevancy, precision, recall, toxicity
evalforge run
# Faithfulness:      0.91 ✓ (threshold: 0.85)
# Answer Relevancy:  0.87 ✓ (threshold: 0.80)
# Context Precision: 0.78 ✓ (threshold: 0.75)
# Toxicity:          0.02 ✓ (threshold: 0.05)
# Overall: PASS (4/4 metrics passing)

Features

  • Use-case-driven metric selection — describe your app, get optimal metrics
  • 6 use case types — RAG, summarization, classification, generation, chat, code
  • 16+ built-in metrics — faithfulness, ROUGE, BLEU, toxicity, injection resistance, F1
  • Synthetic test data generation — adversarial, edge cases, domain-specific
  • Drift detection — alerts when quality degrades over time
  • Human-in-the-loop — route uncertain evaluations to reviewers
  • Scheduled pipelines — daily/weekly automated evaluation runs
  • Benchmark registry — compare against published benchmarks
  • One-command deploy — Step Functions + Lambda infrastructure

Installation

pip install substrai-evalforge

Quick Start

# Scaffold project
evalforge init my-eval --use-case rag

# Run evaluation
cd my-eval
evalforge run

# List available metrics
evalforge metrics --use-case rag

Python SDK

from evalforge import EvalPipeline

# Quick start for any use case
pipeline = EvalPipeline.for_use_case("rag")
results = pipeline.run()
print(results.summary())
print(f"All passing: {results.all_passing}")

Supported Use Cases & Auto-Selected Metrics

Use Case Auto-Selected Metrics
rag faithfulness, answer_relevancy, context_precision, context_recall, toxicity
summarization rouge_l, bleu, coherence, conciseness, fluency
classification accuracy, precision, recall, f1_score
generation fluency, coherence, toxicity, bias_detection
chat coherence, toxicity, injection_resistance, fluency
code accuracy, coherence

License

MIT — see LICENSE

Author

Gaurav Kumar Sinha — Founder, SubstrAI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

substrai_evalforge-0.4.0.tar.gz (42.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

substrai_evalforge-0.4.0-py3-none-any.whl (44.3 kB view details)

Uploaded Python 3

File details

Details for the file substrai_evalforge-0.4.0.tar.gz.

File metadata

  • Download URL: substrai_evalforge-0.4.0.tar.gz
  • Upload date:
  • Size: 42.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for substrai_evalforge-0.4.0.tar.gz
Algorithm Hash digest
SHA256 19a8cbd9ea16b018aabe9de06296b6bda0732f5cffa74f12772e019a004ee1d9
MD5 7255a5ff0b9f3a4f3e5a0d14cba07236
BLAKE2b-256 22844d56dc10dac6252153402389edf6e7bd6d8c7198159a1503fb261c7aa67a

See more details on using hashes here.

File details

Details for the file substrai_evalforge-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for substrai_evalforge-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1c3b5de7f0ef49f9e7cc0b6b845a63b954a8618acf4c2a4bc5e702581dfb07bd
MD5 b194b135e289993ac1d080e03bbf5267
BLAKE2b-256 287663c057acd00a549cb243631d81ca73aad22c27683efd512de2fdc108e971

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page