Skip to main content

Automated LLM evaluation pipeline generator

Project description

EvalForge

Automated LLM evaluation pipeline generator.

Built by SubstrAI — Open-source GenAI frameworks for serverless infrastructure.

PyPI version License: MIT Python 3.9+

The Problem

Every team deploying LLMs builds evaluation pipelines from scratch. RAGAS and DeepEval are libraries — they don't generate infrastructure, schedule runs, detect drift, or route to human reviewers.

The Solution

Describe your use case → EvalForge generates the complete evaluation pipeline:

# evalforge.yaml
use_case:
  type: rag
  description: "Customer support chatbot"
  model:
    provider: bedrock
    model_id: anthropic.claude-3-haiku-20240307-v1:0

evaluation:
  metrics: auto  # auto-selects: faithfulness, relevancy, precision, recall, toxicity
evalforge run
# Faithfulness:      0.91 ✓ (threshold: 0.85)
# Answer Relevancy:  0.87 ✓ (threshold: 0.80)
# Context Precision: 0.78 ✓ (threshold: 0.75)
# Toxicity:          0.02 ✓ (threshold: 0.05)
# Overall: PASS (4/4 metrics passing)

Features

  • Use-case-driven metric selection — describe your app, get optimal metrics
  • 6 use case types — RAG, summarization, classification, generation, chat, code
  • 16+ built-in metrics — faithfulness, ROUGE, BLEU, toxicity, injection resistance, F1
  • Synthetic test data generation — adversarial, edge cases, domain-specific
  • Drift detection — alerts when quality degrades over time
  • Human-in-the-loop — route uncertain evaluations to reviewers
  • Scheduled pipelines — daily/weekly automated evaluation runs
  • Benchmark registry — compare against published benchmarks
  • One-command deploy — Step Functions + Lambda infrastructure

Installation

pip install substrai-evalforge

Quick Start

# Scaffold project
evalforge init my-eval --use-case rag

# Run evaluation
cd my-eval
evalforge run

# List available metrics
evalforge metrics --use-case rag

Python SDK

from evalforge import EvalPipeline

# Quick start for any use case
pipeline = EvalPipeline.for_use_case("rag")
results = pipeline.run()
print(results.summary())
print(f"All passing: {results.all_passing}")

Supported Use Cases & Auto-Selected Metrics

Use Case Auto-Selected Metrics
rag faithfulness, answer_relevancy, context_precision, context_recall, toxicity
summarization rouge_l, bleu, coherence, conciseness, fluency
classification accuracy, precision, recall, f1_score
generation fluency, coherence, toxicity, bias_detection
chat coherence, toxicity, injection_resistance, fluency
code accuracy, coherence

License

MIT — see LICENSE

Author

Gaurav Kumar Sinha — Founder, SubstrAI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

substrai_evalforge-0.3.0.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

substrai_evalforge-0.3.0-py3-none-any.whl (37.9 kB view details)

Uploaded Python 3

File details

Details for the file substrai_evalforge-0.3.0.tar.gz.

File metadata

  • Download URL: substrai_evalforge-0.3.0.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for substrai_evalforge-0.3.0.tar.gz
Algorithm Hash digest
SHA256 69bf2626ce312765af45531d1b255f7e9d8b33a89e460d6a8927ee9f27d163f7
MD5 42e009fc5d2776be2668df6b2efcda27
BLAKE2b-256 f3a1ee839b4351d69e3419a767fb67ca098188814125b7635de588435e85f3c9

See more details on using hashes here.

File details

Details for the file substrai_evalforge-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for substrai_evalforge-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2f6de8a9bcf53574a268a40258c4503d469e85c82d649dc965804cf684ab9b10
MD5 4bcde1435f7f455753e80b484b4277bd
BLAKE2b-256 2dcb05817f2f1bcb8abdd587386ca42b276fb7646a024e0bad3b95aa04fcf09a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page