Skip to main content

Automated LLM evaluation pipeline generator

Project description

EvalForge

Automated LLM evaluation pipeline generator.

Built by SubstrAI — Open-source GenAI frameworks for serverless infrastructure.

PyPI version License: MIT Python 3.9+

The Problem

Every team deploying LLMs builds evaluation pipelines from scratch. RAGAS and DeepEval are libraries — they don't generate infrastructure, schedule runs, detect drift, or route to human reviewers.

The Solution

Describe your use case → EvalForge generates the complete evaluation pipeline:

# evalforge.yaml
use_case:
  type: rag
  description: "Customer support chatbot"
  model:
    provider: bedrock
    model_id: anthropic.claude-3-haiku-20240307-v1:0

evaluation:
  metrics: auto  # auto-selects: faithfulness, relevancy, precision, recall, toxicity
evalforge run
# Faithfulness:      0.91 ✓ (threshold: 0.85)
# Answer Relevancy:  0.87 ✓ (threshold: 0.80)
# Context Precision: 0.78 ✓ (threshold: 0.75)
# Toxicity:          0.02 ✓ (threshold: 0.05)
# Overall: PASS (4/4 metrics passing)

Features

  • Use-case-driven metric selection — describe your app, get optimal metrics
  • 6 use case types — RAG, summarization, classification, generation, chat, code
  • 16+ built-in metrics — faithfulness, ROUGE, BLEU, toxicity, injection resistance, F1
  • Synthetic test data generation — adversarial, edge cases, domain-specific
  • Drift detection — alerts when quality degrades over time
  • Human-in-the-loop — route uncertain evaluations to reviewers
  • Scheduled pipelines — daily/weekly automated evaluation runs
  • Benchmark registry — compare against published benchmarks
  • One-command deploy — Step Functions + Lambda infrastructure

Installation

pip install substrai-evalforge

Quick Start

# Scaffold project
evalforge init my-eval --use-case rag

# Run evaluation
cd my-eval
evalforge run

# List available metrics
evalforge metrics --use-case rag

Python SDK

from evalforge import EvalPipeline

# Quick start for any use case
pipeline = EvalPipeline.for_use_case("rag")
results = pipeline.run()
print(results.summary())
print(f"All passing: {results.all_passing}")

Supported Use Cases & Auto-Selected Metrics

Use Case Auto-Selected Metrics
rag faithfulness, answer_relevancy, context_precision, context_recall, toxicity
summarization rouge_l, bleu, coherence, conciseness, fluency
classification accuracy, precision, recall, f1_score
generation fluency, coherence, toxicity, bias_detection
chat coherence, toxicity, injection_resistance, fluency
code accuracy, coherence

License

MIT — see LICENSE

Author

Gaurav Kumar Sinha — Founder, SubstrAI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

substrai_evalforge-0.5.0.tar.gz (49.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

substrai_evalforge-0.5.0-py3-none-any.whl (51.2 kB view details)

Uploaded Python 3

File details

Details for the file substrai_evalforge-0.5.0.tar.gz.

File metadata

  • Download URL: substrai_evalforge-0.5.0.tar.gz
  • Upload date:
  • Size: 49.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for substrai_evalforge-0.5.0.tar.gz
Algorithm Hash digest
SHA256 cd27d7752bca95e0866f42df2c3d7cfed01d6012a3e4497c127d0d29c95ec58b
MD5 cca13508e745b6304baf1c2effe91846
BLAKE2b-256 fa293603b57607ff410287d1ead6874dca43fb7fa3b5f61a5eb2c6801aa0fa2d

See more details on using hashes here.

File details

Details for the file substrai_evalforge-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for substrai_evalforge-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e53da053b7c13d430ce0f4053bbbe08c9aa007e4720fd52e89a4efa05808198b
MD5 a96e61aa512b94d2d82bc79e2d1689b0
BLAKE2b-256 fd46e28f98718435cacb852d907a998a54e0e99b7e7537d2ef93299fcd1fd518

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page