Skip to main content

Next-generation evaluation framework for LLM applications with research-grade validation and production-ready performance

Project description

EvalX: Next-Generation LLM Evaluation Framework

PyPI version Python versions License: MIT

EvalX is a comprehensive evaluation framework for Large Language Model applications that combines traditional metrics, LLM-as-judge evaluations, and intelligent agentic orchestration with research-grade validation.

🚀 Key Features

  • 🤖 Agentic Orchestration: Natural language instructions → automatic evaluation planning
  • 📊 Comprehensive Metrics: Traditional + LLM-as-judge + hybrid approaches
  • 🔬 Research-Grade Validation: Statistical analysis, confidence intervals, meta-evaluation
  • 🎨 Multimodal Support: Vision-language, code, audio evaluation
  • ⚡ Production Ready: Async processing, caching, CLI interface
  • 🎯 Adaptive Selection: AI-powered optimal metric selection

🏗️ Unique Innovations

Meta-Evaluation System

EvalX includes the industry's first meta-evaluation system that assesses the quality of evaluation metrics themselves:

  • Reliability assessment through test-retest analysis
  • Validity measurement against ground truth
  • Bias detection across demographic groups
  • Interpretability scoring

Adaptive Metric Selection

Automatically selects optimal metrics based on:

  • Task type and domain
  • Quality requirements (research vs. production)
  • Computational constraints
  • Fairness requirements

📦 Installation

pip install evalx

For development:

pip install evalx[dev]

For research features:

pip install evalx[research]

For production deployment:

pip install evalx[production]

🎯 Quick Start

Natural Language Evaluation

import evalx

# Create evaluation suite from natural language instruction
suite = evalx.EvaluationSuite.from_instruction(
    "Evaluate my chatbot responses for helpfulness and accuracy"
)

# Your data
data = [
    {
        "input": "What's the capital of France?",
        "output": "The capital of France is Paris.",
        "reference": "Paris is the capital city of France."
    }
]

# Run evaluation
results = await suite.evaluate_async(data)
print(results.summary())

Fine-Grained Control

from evalx import MetricSuite

# Create custom metric combination
suite = MetricSuite()
suite.add_traditional_metric("bleu_score")
suite.add_traditional_metric("semantic_similarity", threshold=0.8)
suite.add_llm_judge("accuracy", model="gpt-4")

results = suite.evaluate(data)

Research-Grade Analysis

from evalx import ResearchSuite

# Comprehensive statistical analysis
suite = ResearchSuite(
    metrics=["accuracy", "helpfulness", "bleu"],
    confidence_level=0.95,
    bootstrap_samples=1000
)

results = await suite.evaluate_research_grade(data)
print(f"Mean ± Std: {results.mean:.3f} ± {results.std:.3f}")
print(f"95% CI: [{results.confidence_interval[0]:.3f}, {results.confidence_interval[1]:.3f}]")

🎨 Multimodal Evaluation

from evalx.metrics.multimodal import MultimodalInput, ImageCaptionQualityMetric

# Image captioning evaluation
input_data = MultimodalInput(
    input_text="Describe this image",
    output_text="A beautiful sunset over the ocean",
    image="path/to/image.jpg"
)

metric = ImageCaptionQualityMetric()
result = metric.evaluate(input_data)

🔬 Meta-Evaluation

from evalx.meta_evaluation import MetaEvaluator

# Evaluate your metrics' quality
meta_evaluator = MetaEvaluator()
quality_report = meta_evaluator.evaluate_metric_quality(
    metric=my_metric,
    evaluation_data=test_data,
    ground_truth=human_ratings
)

print(f"Metric Quality: {quality_report.overall_quality:.3f}")
print(f"Reliability: {quality_report.reliability:.3f}")
print(f"Validity: {quality_report.validity:.3f}")
print(f"Bias Score: {quality_report.bias:.3f}")

🖥️ Command Line Interface

# Evaluate using natural language
evalx evaluate "Check my chatbot for helpfulness" --data data.json

# Research-grade evaluation
evalx research --data data.json --metrics accuracy helpfulness --confidence 0.95

# List available metrics
evalx metrics --list

📊 Supported Metrics

Traditional Metrics

  • BLEU: N-gram overlap with smoothing
  • ROUGE: Recall-oriented evaluation (ROUGE-1, ROUGE-2, ROUGE-L)
  • METEOR: Semantic matching with synonyms and stemming
  • BERTScore: Contextual embedding similarity
  • Semantic Similarity: Sentence transformer-based
  • Exact Match: String matching with normalization
  • Levenshtein: Edit distance with word/character level

LLM-as-Judge Metrics

  • Accuracy: Factual correctness assessment
  • Helpfulness: Response utility evaluation
  • Coherence: Logical consistency measurement
  • Groundedness: Source attribution verification
  • Relevance: Query-response alignment

Multimodal Metrics

  • Image-Text Alignment: CLIP-based similarity
  • Image Caption Quality: Comprehensive captioning assessment
  • Code Correctness: Syntax, execution, security analysis
  • Audio Quality: Signal processing metrics

🏆 Why EvalX?

Feature EvalX DeepEval LangChain Ragas
Meta-evaluation Unique
Statistical rigor Best Basic Basic Good
Multimodal support Comprehensive Limited Limited Limited
Adaptive selection Unique
Natural language interface Full
Production ready Complete Good Basic Good

📚 Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built for the AI evaluation community
  • Inspired by advances in LLM evaluation research
  • Designed for both researchers and practitioners

📞 Support


EvalX: Making AI evaluation comprehensive, reliable, and accessible.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalx-0.1.0.tar.gz (61.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalx-0.1.0-py3-none-any.whl (53.9 kB view details)

Uploaded Python 3

File details

Details for the file evalx-0.1.0.tar.gz.

File metadata

  • Download URL: evalx-0.1.0.tar.gz
  • Upload date:
  • Size: 61.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for evalx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ddff980e78c842ab2756a3f5d742b8cf45ef4a17b1d236723bfa1c8b5acc51ea
MD5 0cad71d71d39f1565c15eeb3e93fd697
BLAKE2b-256 87826d71ac0bfe49cea8c065e79f7ab9de8cfa673d9dc18af5361ab0d1db3764

See more details on using hashes here.

File details

Details for the file evalx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: evalx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 53.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for evalx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 975877945db63205b77be84cf05ba08f49a70c7e5baa424dd30f3b5986bc4a0c
MD5 49244520443c9b4f152742c12a1bdb83
BLAKE2b-256 4a0e4f77ab427973123e724054deeb7fed97d55efd71a9f6a27ae69394c1565a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page