Monitoring, Evaluation, Reporting, Inspection, Testing framework for AI systems
Project description
MERIT: Monitoring, Evaluation, Reporting, Inspection, Testing
A comprehensive framework for evaluating, monitoring, and testing AI systems, particularly those powered by Large Language Models (LLMs). MERIT provides tools for performance monitoring, evaluation metrics, RAG system testing, and comprehensive reporting.
๐ Features
๐ Monitoring & Observability
- Real-time LLM monitoring with customizable metrics
- Performance tracking (latency, throughput, error rates)
- Cost monitoring and estimation
- Usage analytics and token volume tracking
- Multi-backend storage (SQLite, MongoDB, file-based)
- Live dashboard with interactive metrics
๐งช Evaluation & Testing
- RAG system evaluation with comprehensive metrics
- LLM performance testing with custom test sets
- Automated evaluation using LLM-based evaluators
- Test set generation for systematic testing
- Multi-model evaluation support
๐ Metrics & Analytics
- Correctness, Faithfulness, Relevance for RAG systems
- Coherence and Fluency metrics
- Context Precision evaluation
- Custom metric development framework
- Performance benchmarking
๐ง Integration & APIs
- Simple 3-line integration for existing applications
- REST API for remote monitoring
- CLI tools for configuration and execution
- Multiple AI provider support (OpenAI, Google, custom)
๐ฆ Installation
Basic Installation
pip install merit-ai
Full Installation with All Dependencies
pip install merit-ai[all]
Development Installation
git clone https://github.com/your-username/merit.git
cd merit
pip install -e .[dev]
๐ Quick Start
1. Simple Integration (3 Lines!)
from merit.monitoring.service import MonitoringService
# Initialize monitoring
monitor = MonitoringService()
# Log an interaction
monitor.log_simple_interaction({
'user_message': 'Hello, how are you?',
'llm_response': 'I am doing well, thank you!',
'latency': 0.5,
'model': 'gpt-3.5-turbo'
})
2. RAG System Evaluation
from merit.evaluation.evaluators.rag import RAGEvaluator
# Initialize evaluator
evaluator = RAGEvaluator()
# Evaluate RAG response
results = evaluator.evaluate(
query="What is machine learning?",
response="Machine learning is a subset of AI...",
context=["Document 1 content...", "Document 2 content..."]
)
print(f"Relevance: {results['relevance']}")
print(f"Faithfulness: {results['faithfulness']}")
3. CLI Usage
# Start evaluation with config file
merit start --config my_config.py
# Monitor your application
merit monitor --config monitoring_config.py
๐ Examples
Basic Chat Application Integration
from merit.monitoring.service import MonitoringService
from datetime import datetime
class ChatApp:
def __init__(self):
# Initialize MERIT monitoring
self.monitor = MonitoringService()
def process_message(self, user_message: str) -> str:
start_time = datetime.now()
# Your existing chat logic here
response = self.llm_client.chat(user_message)
end_time = datetime.now()
# Log interaction with MERIT
self.monitor.log_simple_interaction({
'user_message': user_message,
'llm_response': response,
'latency': (end_time - start_time).total_seconds(),
'model': 'gpt-3.5-turbo',
'timestamp': end_time.isoformat()
})
return response
Advanced RAG System with MERIT
from merit.evaluation.evaluators.rag import RAGEvaluator
from merit.monitoring.service import MonitoringService
class RAGSystem:
def __init__(self):
self.evaluator = RAGEvaluator()
self.monitor = MonitoringService()
def query(self, user_question: str):
# Retrieve relevant documents
documents = self.retriever.search(user_question)
# Generate response
response = self.llm.generate(user_question, documents)
# Evaluate with MERIT
evaluation = self.evaluator.evaluate(
query=user_question,
response=response,
context=[doc.content for doc in documents]
)
# Monitor performance
self.monitor.log_simple_interaction({
'query': user_question,
'response': response,
'evaluation_scores': evaluation,
'num_documents': len(documents)
})
return response, evaluation
๐๏ธ Project Structure
merit/
โโโ api/ # API clients (OpenAI, Google, etc.)
โโโ core/ # Core models and utilities
โโโ evaluation/ # Evaluation framework
โ โโโ evaluators/ # LLM and RAG evaluators
โ โโโ templates/ # Evaluation templates
โโโ knowledge/ # Knowledge base management
โโโ metrics/ # Metrics framework
โ โโโ rag.py # RAG-specific metrics
โ โโโ llm_measured.py # LLM-based metrics
โ โโโ monitoring.py # Monitoring metrics
โโโ monitoring/ # Monitoring service
โ โโโ collectors/ # Data collectors
โโโ storage/ # Storage backends
โโโ templates/ # Dashboard and report templates
โโโ testset_generation/ # Test set generation tools
๐ Available Metrics
RAG Metrics
- Correctness: Accuracy of generated responses
- Faithfulness: Adherence to source documents
- Relevance: Response relevance to query
- Coherence: Logical flow and consistency
- Fluency: Natural language quality
- Context Precision: Quality of retrieved context
Monitoring Metrics
- Latency: Response time tracking
- Throughput: Requests per second
- Error Rate: Failure percentage
- Cost: Token usage and cost estimation
- Usage: Model and feature usage patterns
๐ง Configuration
Basic Configuration File
# merit_config.py
from merit.config.models import MeritMainConfig
config = MeritMainConfig(
evaluation={
"evaluator": "rag",
"metrics": ["relevance", "faithfulness", "correctness"]
},
monitoring={
"storage_type": "sqlite",
"collection_interval": 60,
"retention_days": 30
}
)
Advanced Configuration
# advanced_config.py
config = MeritMainConfig(
evaluation={
"evaluator": "rag",
"metrics": ["relevance", "faithfulness", "correctness"],
"test_set": {
"path": "test_questions.json",
"size": 100
}
},
monitoring={
"storage_type": "mongodb",
"storage_config": {
"uri": "mongodb://localhost:27017",
"database": "merit_metrics"
},
"metrics": ["latency", "cost", "error_rate"],
"collection_interval": 30,
"retention_days": 90
},
knowledge_base={
"type": "vector_store",
"path": "./knowledge_base"
}
)
๐ฏ Use Cases
1. Production LLM Monitoring
Monitor your deployed LLM applications in real-time with performance metrics, cost tracking, and error monitoring.
2. RAG System Development
Evaluate and improve your RAG systems with comprehensive metrics and automated testing.
3. Model Comparison
Compare different models and configurations using standardized evaluation metrics.
4. Quality Assurance
Implement automated testing for LLM applications with custom test sets and evaluation criteria.
๐ค Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Development Setup
git clone https://github.com/your-username/merit.git
cd merit
pip install -e .[dev]
pytest tests/
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built with modern Python practices and Pydantic for type safety
- Inspired by the need for comprehensive AI system evaluation
- Designed for simplicity and ease of integration
๐ Support
- Issues: GitHub Issues
- Documentation: Full Documentation
- Discussions: GitHub Discussions
MERIT: Making AI systems more reliable, one evaluation at a time. ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file merit_ai-0.1.16.tar.gz.
File metadata
- Download URL: merit_ai-0.1.16.tar.gz
- Upload date:
- Size: 152.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d62ed059ab7e9d9fa000ef3acca1a4e38a019f044fce9bcaff0d5ba555dd487
|
|
| MD5 |
8d7420fc2775c8b3d0dbd4b366ce7b90
|
|
| BLAKE2b-256 |
dab50d485e0b0f83abf931ff096021ff2ebc9589aed10ea79acc1b67208fd258
|
File details
Details for the file merit_ai-0.1.16-py3-none-any.whl.
File metadata
- Download URL: merit_ai-0.1.16-py3-none-any.whl
- Upload date:
- Size: 176.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bedd592a30220ff97e1085bd70d6c31b300dd228db3bcb495add54b8d704eae5
|
|
| MD5 |
67dda8952193a6be54ad5852a54d1de3
|
|
| BLAKE2b-256 |
652c8f3456d287ed6e8b3275b405e4eac20b80a804aeaf03e9d129490c3a5496
|