Skip to main content

Automated LLM evaluation pipeline generator

Project description

EvalForge

Automated LLM evaluation pipeline generator.

Built by SubstrAI — Open-source GenAI frameworks for serverless infrastructure.

PyPI version License: MIT Python 3.9+

The Problem

Every team deploying LLMs builds evaluation pipelines from scratch. RAGAS and DeepEval are libraries — they don't generate infrastructure, schedule runs, detect drift, or route to human reviewers.

The Solution

Describe your use case → EvalForge generates the complete evaluation pipeline:

# evalforge.yaml
use_case:
  type: rag
  description: "Customer support chatbot"
  model:
    provider: bedrock
    model_id: anthropic.claude-3-haiku-20240307-v1:0

evaluation:
  metrics: auto  # auto-selects: faithfulness, relevancy, precision, recall, toxicity
evalforge run
# Faithfulness:      0.91 ✓ (threshold: 0.85)
# Answer Relevancy:  0.87 ✓ (threshold: 0.80)
# Context Precision: 0.78 ✓ (threshold: 0.75)
# Toxicity:          0.02 ✓ (threshold: 0.05)
# Overall: PASS (4/4 metrics passing)

Features

  • Use-case-driven metric selection — describe your app, get optimal metrics
  • 6 use case types — RAG, summarization, classification, generation, chat, code
  • 16+ built-in metrics — faithfulness, ROUGE, BLEU, toxicity, injection resistance, F1
  • Synthetic test data generation — adversarial, edge cases, domain-specific
  • Drift detection — alerts when quality degrades over time
  • Human-in-the-loop — route uncertain evaluations to reviewers
  • Scheduled pipelines — daily/weekly automated evaluation runs
  • Benchmark registry — compare against published benchmarks
  • One-command deploy — Step Functions + Lambda infrastructure

Installation

pip install substrai-evalforge

Quick Start

# Scaffold project
evalforge init my-eval --use-case rag

# Run evaluation
cd my-eval
evalforge run

# List available metrics
evalforge metrics --use-case rag

Python SDK

from evalforge import EvalPipeline

# Quick start for any use case
pipeline = EvalPipeline.for_use_case("rag")
results = pipeline.run()
print(results.summary())
print(f"All passing: {results.all_passing}")

Supported Use Cases & Auto-Selected Metrics

Use Case Auto-Selected Metrics
rag faithfulness, answer_relevancy, context_precision, context_recall, toxicity
summarization rouge_l, bleu, coherence, conciseness, fluency
classification accuracy, precision, recall, f1_score
generation fluency, coherence, toxicity, bias_detection
chat coherence, toxicity, injection_resistance, fluency
code accuracy, coherence

License

MIT — see LICENSE

Author

Gaurav Kumar Sinha — Founder, SubstrAI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

substrai_evalforge-0.2.0.tar.gz (29.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

substrai_evalforge-0.2.0-py3-none-any.whl (31.3 kB view details)

Uploaded Python 3

File details

Details for the file substrai_evalforge-0.2.0.tar.gz.

File metadata

  • Download URL: substrai_evalforge-0.2.0.tar.gz
  • Upload date:
  • Size: 29.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for substrai_evalforge-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f86ab90b27f242dcf25b702c3d26a89efaa6849f4360c1ae55a00dcd80065732
MD5 d0a98795abffad2f4c4a497096841b1a
BLAKE2b-256 145e3313b612a98583564dc9966d5d5e5a2b846f21994b41de2ad72ed793edba

See more details on using hashes here.

File details

Details for the file substrai_evalforge-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for substrai_evalforge-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a5636b42b4c7447a33ebb832210d648ff5f01d4cae4678bd13ef47cf58a7fde5
MD5 4ea73742d9e66465c922f055c29ebec6
BLAKE2b-256 902f3eb47c62ef80b16d28592e9d6fd26675bf8968bab35eda6347390ad9deaa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page