Next-generation evaluation framework for LLM applications with research-grade validation and production-ready performance
Project description
EvalX: Next-Generation LLM Evaluation Framework
EvalX is a comprehensive evaluation framework for Large Language Model applications that combines traditional metrics, LLM-as-judge evaluations, and intelligent agentic orchestration with research-grade validation.
🚀 Key Features
- 🤖 Agentic Orchestration: Natural language instructions → automatic evaluation planning
- 📊 Comprehensive Metrics: Traditional + LLM-as-judge + hybrid approaches
- 🔬 Research-Grade Validation: Statistical analysis, confidence intervals, meta-evaluation
- 🎨 Multimodal Support: Vision-language, code, audio evaluation
- ⚡ Production Ready: Async processing, caching, CLI interface
- 🎯 Adaptive Selection: AI-powered optimal metric selection
🏗️ Unique Innovations
Meta-Evaluation System
EvalX includes the industry's first meta-evaluation system that assesses the quality of evaluation metrics themselves:
- Reliability assessment through test-retest analysis
- Validity measurement against ground truth
- Bias detection across demographic groups
- Interpretability scoring
Adaptive Metric Selection
Automatically selects optimal metrics based on:
- Task type and domain
- Quality requirements (research vs. production)
- Computational constraints
- Fairness requirements
📦 Installation
pip install evalx
For development:
pip install evalx[dev]
For research features:
pip install evalx[research]
For production deployment:
pip install evalx[production]
🎯 Quick Start
Natural Language Evaluation
import evalx
# Create evaluation suite from natural language instruction
suite = evalx.EvaluationSuite.from_instruction(
"Evaluate my chatbot responses for helpfulness and accuracy"
)
# Your data
data = [
{
"input": "What's the capital of France?",
"output": "The capital of France is Paris.",
"reference": "Paris is the capital city of France."
}
]
# Run evaluation
results = await suite.evaluate_async(data)
print(results.summary())
Fine-Grained Control
from evalx import MetricSuite
# Create custom metric combination
suite = MetricSuite()
suite.add_traditional_metric("bleu_score")
suite.add_traditional_metric("semantic_similarity", threshold=0.8)
suite.add_llm_judge("accuracy", model="gpt-4")
results = suite.evaluate(data)
Research-Grade Analysis
from evalx import ResearchSuite
# Comprehensive statistical analysis
suite = ResearchSuite(
metrics=["accuracy", "helpfulness", "bleu"],
confidence_level=0.95,
bootstrap_samples=1000
)
results = await suite.evaluate_research_grade(data)
print(f"Mean ± Std: {results.mean:.3f} ± {results.std:.3f}")
print(f"95% CI: [{results.confidence_interval[0]:.3f}, {results.confidence_interval[1]:.3f}]")
🎨 Multimodal Evaluation
from evalx.metrics.multimodal import MultimodalInput, ImageCaptionQualityMetric
# Image captioning evaluation
input_data = MultimodalInput(
input_text="Describe this image",
output_text="A beautiful sunset over the ocean",
image="path/to/image.jpg"
)
metric = ImageCaptionQualityMetric()
result = metric.evaluate(input_data)
🔬 Meta-Evaluation
from evalx.meta_evaluation import MetaEvaluator
# Evaluate your metrics' quality
meta_evaluator = MetaEvaluator()
quality_report = meta_evaluator.evaluate_metric_quality(
metric=my_metric,
evaluation_data=test_data,
ground_truth=human_ratings
)
print(f"Metric Quality: {quality_report.overall_quality:.3f}")
print(f"Reliability: {quality_report.reliability:.3f}")
print(f"Validity: {quality_report.validity:.3f}")
print(f"Bias Score: {quality_report.bias:.3f}")
🖥️ Command Line Interface
# Evaluate using natural language
evalx evaluate "Check my chatbot for helpfulness" --data data.json
# Research-grade evaluation
evalx research --data data.json --metrics accuracy helpfulness --confidence 0.95
# List available metrics
evalx metrics --list
📊 Supported Metrics
Traditional Metrics
- BLEU: N-gram overlap with smoothing
- ROUGE: Recall-oriented evaluation (ROUGE-1, ROUGE-2, ROUGE-L)
- METEOR: Semantic matching with synonyms and stemming
- BERTScore: Contextual embedding similarity
- Semantic Similarity: Sentence transformer-based
- Exact Match: String matching with normalization
- Levenshtein: Edit distance with word/character level
LLM-as-Judge Metrics
- Accuracy: Factual correctness assessment
- Helpfulness: Response utility evaluation
- Coherence: Logical consistency measurement
- Groundedness: Source attribution verification
- Relevance: Query-response alignment
Multimodal Metrics
- Image-Text Alignment: CLIP-based similarity
- Image Caption Quality: Comprehensive captioning assessment
- Code Correctness: Syntax, execution, security analysis
- Audio Quality: Signal processing metrics
🏆 Why EvalX?
| Feature | EvalX | DeepEval | LangChain | Ragas |
|---|---|---|---|---|
| Meta-evaluation | ✅ Unique | ❌ | ❌ | ❌ |
| Statistical rigor | ✅ Best | Basic | Basic | Good |
| Multimodal support | ✅ Comprehensive | Limited | Limited | Limited |
| Adaptive selection | ✅ Unique | ❌ | ❌ | ❌ |
| Natural language interface | ✅ Full | ❌ | ❌ | ❌ |
| Production ready | ✅ Complete | Good | Basic | Good |
📚 Documentation
🤝 Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built for the AI evaluation community
- Inspired by advances in LLM evaluation research
- Designed for both researchers and practitioners
📞 Support
EvalX: Making AI evaluation comprehensive, reliable, and accessible.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evalx-0.1.0.tar.gz.
File metadata
- Download URL: evalx-0.1.0.tar.gz
- Upload date:
- Size: 61.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddff980e78c842ab2756a3f5d742b8cf45ef4a17b1d236723bfa1c8b5acc51ea
|
|
| MD5 |
0cad71d71d39f1565c15eeb3e93fd697
|
|
| BLAKE2b-256 |
87826d71ac0bfe49cea8c065e79f7ab9de8cfa673d9dc18af5361ab0d1db3764
|
File details
Details for the file evalx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: evalx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 53.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
975877945db63205b77be84cf05ba08f49a70c7e5baa424dd30f3b5986bc4a0c
|
|
| MD5 |
49244520443c9b4f152742c12a1bdb83
|
|
| BLAKE2b-256 |
4a0e4f77ab427973123e724054deeb7fed97d55efd71a9f6a27ae69394c1565a
|