Next-generation evaluation framework for LLM applications with research-grade validation and production-ready performance

These details have not been verified by PyPI

Project links

Project description

EvalX: Next-Generation LLM Evaluation Framework

EvalX is a comprehensive evaluation framework for Large Language Model applications that combines traditional metrics, LLM-as-judge evaluations, and intelligent agentic orchestration with research-grade validation.

🚀 Key Features

🤖 Agentic Orchestration: Natural language instructions → automatic evaluation planning
📊 Comprehensive Metrics: Traditional + LLM-as-judge + hybrid approaches
🔬 Research-Grade Validation: Statistical analysis, confidence intervals, meta-evaluation
🎨 Multimodal Support: Vision-language, code, audio evaluation
⚡ Production Ready: Async processing, caching, CLI interface
🎯 Adaptive Selection: AI-powered optimal metric selection

🏗️ Unique Innovations

Meta-Evaluation System

EvalX includes the industry's first meta-evaluation system that assesses the quality of evaluation metrics themselves:

Reliability assessment through test-retest analysis
Validity measurement against ground truth
Bias detection across demographic groups
Interpretability scoring

Adaptive Metric Selection

Automatically selects optimal metrics based on:

Task type and domain
Quality requirements (research vs. production)
Computational constraints
Fairness requirements

📦 Installation

pip install evalx

For development:

pip install evalx[dev]

For research features:

pip install evalx[research]

For production deployment:

pip install evalx[production]

🎯 Quick Start

Natural Language Evaluation

import evalx

# Create evaluation suite from natural language instruction
suite = evalx.EvaluationSuite.from_instruction(
    "Evaluate my chatbot responses for helpfulness and accuracy"
)

# Your data
data = [
    {
        "input": "What's the capital of France?",
        "output": "The capital of France is Paris.",
        "reference": "Paris is the capital city of France."
    }
]

# Run evaluation
results = await suite.evaluate_async(data)
print(results.summary())

Fine-Grained Control

from evalx import MetricSuite

# Create custom metric combination
suite = MetricSuite()
suite.add_traditional_metric("bleu_score")
suite.add_traditional_metric("semantic_similarity", threshold=0.8)
suite.add_llm_judge("accuracy", model="gpt-4")

results = suite.evaluate(data)

Research-Grade Analysis

from evalx import ResearchSuite

# Comprehensive statistical analysis
suite = ResearchSuite(
    metrics=["accuracy", "helpfulness", "bleu"],
    confidence_level=0.95,
    bootstrap_samples=1000
)

results = await suite.evaluate_research_grade(data)
print(f"Mean ± Std: {results.mean:.3f} ± {results.std:.3f}")
print(f"95% CI: [{results.confidence_interval[0]:.3f}, {results.confidence_interval[1]:.3f}]")

🎨 Multimodal Evaluation

from evalx.metrics.multimodal import MultimodalInput, ImageCaptionQualityMetric

# Image captioning evaluation
input_data = MultimodalInput(
    input_text="Describe this image",
    output_text="A beautiful sunset over the ocean",
    image="path/to/image.jpg"
)

metric = ImageCaptionQualityMetric()
result = metric.evaluate(input_data)

🔬 Meta-Evaluation

from evalx.meta_evaluation import MetaEvaluator

# Evaluate your metrics' quality
meta_evaluator = MetaEvaluator()
quality_report = meta_evaluator.evaluate_metric_quality(
    metric=my_metric,
    evaluation_data=test_data,
    ground_truth=human_ratings
)

print(f"Metric Quality: {quality_report.overall_quality:.3f}")
print(f"Reliability: {quality_report.reliability:.3f}")
print(f"Validity: {quality_report.validity:.3f}")
print(f"Bias Score: {quality_report.bias:.3f}")

🖥️ Command Line Interface

# Evaluate using natural language
evalx evaluate "Check my chatbot for helpfulness" --data data.json

# Research-grade evaluation
evalx research --data data.json --metrics accuracy helpfulness --confidence 0.95

# List available metrics
evalx metrics --list

📊 Supported Metrics

Traditional Metrics

BLEU: N-gram overlap with smoothing
ROUGE: Recall-oriented evaluation (ROUGE-1, ROUGE-2, ROUGE-L)
METEOR: Semantic matching with synonyms and stemming
BERTScore: Contextual embedding similarity
Semantic Similarity: Sentence transformer-based
Exact Match: String matching with normalization
Levenshtein: Edit distance with word/character level

LLM-as-Judge Metrics

Accuracy: Factual correctness assessment
Helpfulness: Response utility evaluation
Coherence: Logical consistency measurement
Groundedness: Source attribution verification
Relevance: Query-response alignment

Multimodal Metrics

Image-Text Alignment: CLIP-based similarity
Image Caption Quality: Comprehensive captioning assessment
Code Correctness: Syntax, execution, security analysis
Audio Quality: Signal processing metrics

🏆 Why EvalX?

Feature	EvalX	DeepEval	LangChain	Ragas
Meta-evaluation	✅ Unique	❌	❌	❌
Statistical rigor	✅ Best	Basic	Basic	Good
Multimodal support	✅ Comprehensive	Limited	Limited	Limited
Adaptive selection	✅ Unique	❌	❌	❌
Natural language interface	✅ Full	❌	❌	❌
Production ready	✅ Complete	Good	Basic	Good

📚 Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built for the AI evaluation community
Inspired by advances in LLM evaluation research
Designed for both researchers and practitioners

📞 Support

EvalX: Making AI evaluation comprehensive, reliable, and accessible.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalx-0.1.0.tar.gz (61.1 kB view details)

Uploaded Jul 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evalx-0.1.0-py3-none-any.whl (53.9 kB view details)

Uploaded Jul 6, 2025 Python 3

File details

Details for the file evalx-0.1.0.tar.gz.

File metadata

Download URL: evalx-0.1.0.tar.gz
Upload date: Jul 6, 2025
Size: 61.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for evalx-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ddff980e78c842ab2756a3f5d742b8cf45ef4a17b1d236723bfa1c8b5acc51ea`
MD5	`0cad71d71d39f1565c15eeb3e93fd697`
BLAKE2b-256	`87826d71ac0bfe49cea8c065e79f7ab9de8cfa673d9dc18af5361ab0d1db3764`

See more details on using hashes here.

File details

Details for the file evalx-0.1.0-py3-none-any.whl.

File metadata

Download URL: evalx-0.1.0-py3-none-any.whl
Upload date: Jul 6, 2025
Size: 53.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for evalx-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`975877945db63205b77be84cf05ba08f49a70c7e5baa424dd30f3b5986bc4a0c`
MD5	`49244520443c9b4f152742c12a1bdb83`
BLAKE2b-256	`4a0e4f77ab427973123e724054deeb7fed97d55efd71a9f6a27ae69394c1565a`

See more details on using hashes here.

evalx 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

EvalX: Next-Generation LLM Evaluation Framework

🚀 Key Features

🏗️ Unique Innovations

Meta-Evaluation System

Adaptive Metric Selection

📦 Installation

🎯 Quick Start

Natural Language Evaluation

Fine-Grained Control

Research-Grade Analysis

🎨 Multimodal Evaluation

🔬 Meta-Evaluation

🖥️ Command Line Interface

📊 Supported Metrics

Traditional Metrics

LLM-as-Judge Metrics

Multimodal Metrics

🏆 Why EvalX?

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes