Skip to main content

Automated LLM evaluation pipeline generator

Project description

EvalForge

Automated LLM evaluation pipeline generator.

Built by SubstrAI — Open-source GenAI frameworks for serverless infrastructure.

PyPI version License: MIT Python 3.9+

The Problem

Every team deploying LLMs builds evaluation pipelines from scratch. RAGAS and DeepEval are libraries — they don't generate infrastructure, schedule runs, detect drift, or route to human reviewers.

The Solution

Describe your use case → EvalForge generates the complete evaluation pipeline:

# evalforge.yaml
use_case:
  type: rag
  description: "Customer support chatbot"
  model:
    provider: bedrock
    model_id: anthropic.claude-3-haiku-20240307-v1:0

evaluation:
  metrics: auto  # auto-selects: faithfulness, relevancy, precision, recall, toxicity
evalforge run
# Faithfulness:      0.91 ✓ (threshold: 0.85)
# Answer Relevancy:  0.87 ✓ (threshold: 0.80)
# Context Precision: 0.78 ✓ (threshold: 0.75)
# Toxicity:          0.02 ✓ (threshold: 0.05)
# Overall: PASS (4/4 metrics passing)

Features

  • Use-case-driven metric selection — describe your app, get optimal metrics
  • 6 use case types — RAG, summarization, classification, generation, chat, code
  • 16+ built-in metrics — faithfulness, ROUGE, BLEU, toxicity, injection resistance, F1
  • Synthetic test data generation — adversarial, edge cases, domain-specific
  • Drift detection — alerts when quality degrades over time
  • Human-in-the-loop — route uncertain evaluations to reviewers
  • Scheduled pipelines — daily/weekly automated evaluation runs
  • Benchmark registry — compare against published benchmarks
  • One-command deploy — Step Functions + Lambda infrastructure

Installation

pip install substrai-evalforge

Quick Start

# Scaffold project
evalforge init my-eval --use-case rag

# Run evaluation
cd my-eval
evalforge run

# List available metrics
evalforge metrics --use-case rag

Python SDK

from evalforge import EvalPipeline

# Quick start for any use case
pipeline = EvalPipeline.for_use_case("rag")
results = pipeline.run()
print(results.summary())
print(f"All passing: {results.all_passing}")

Supported Use Cases & Auto-Selected Metrics

Use Case Auto-Selected Metrics
rag faithfulness, answer_relevancy, context_precision, context_recall, toxicity
summarization rouge_l, bleu, coherence, conciseness, fluency
classification accuracy, precision, recall, f1_score
generation fluency, coherence, toxicity, bias_detection
chat coherence, toxicity, injection_resistance, fluency
code accuracy, coherence

License

MIT — see LICENSE

Author

Gaurav Kumar Sinha — Founder, SubstrAI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

substrai_evalforge-0.1.0.tar.gz (22.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

substrai_evalforge-0.1.0-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file substrai_evalforge-0.1.0.tar.gz.

File metadata

  • Download URL: substrai_evalforge-0.1.0.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for substrai_evalforge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 004d337993be599b1ce9c6b1341846aca13a004da003ba9d86707410627b284c
MD5 18e37d627993dba4102cf667ceaf38a2
BLAKE2b-256 fddfb6b75bfe7cef06564b1b3441d98f42d700e15d366ad9078aaacd4588aac5

See more details on using hashes here.

File details

Details for the file substrai_evalforge-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for substrai_evalforge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cdb442e50c8bb0580d873ab2d8154ab6f83c0912773fbe733471a8bdb867ddb1
MD5 7e658b315698a6a4c78ad0b41a9f9026
BLAKE2b-256 9fe0f58898c5043fb0e999d9b5e997b53250f67e01bbf20306968c42e684b11d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page