Skip to main content

Monitoring, Evaluation, Reporting, Inspection, Testing framework for AI systems

Project description

MERIT: Monitoring, Evaluation, Reporting, Inspection, Testing

Python 3.8+ License: MIT Version

A comprehensive framework for evaluating, monitoring, and testing AI systems, particularly those powered by Large Language Models (LLMs). MERIT provides tools for performance monitoring, evaluation metrics, RAG system testing, and comprehensive reporting.

๐Ÿš€ Features

๐Ÿ“Š Monitoring & Observability

  • Real-time LLM monitoring with customizable metrics
  • Performance tracking (latency, throughput, error rates)
  • Cost monitoring and estimation
  • Usage analytics and token volume tracking
  • Multi-backend storage (SQLite, MongoDB, file-based)
  • Live dashboard with interactive metrics

๐Ÿงช Evaluation & Testing

  • RAG system evaluation with comprehensive metrics
  • LLM performance testing with custom test sets
  • Automated evaluation using LLM-based evaluators
  • Test set generation for systematic testing
  • Multi-model evaluation support

๐Ÿ“ˆ Metrics & Analytics

  • Correctness, Faithfulness, Relevance for RAG systems
  • Coherence and Fluency metrics
  • Context Precision evaluation
  • Custom metric development framework
  • Performance benchmarking

๐Ÿ”ง Integration & APIs

  • Simple 3-line integration for existing applications
  • REST API for remote monitoring
  • CLI tools for configuration and execution
  • Multiple AI provider support (OpenAI, Google, custom)

๐Ÿ“ฆ Installation

Basic Installation

pip install merit-ai

Full Installation with All Dependencies

pip install merit-ai[all]

Development Installation

git clone https://github.com/your-username/merit.git
cd merit
pip install -e .[dev]

๐Ÿš€ Quick Start

1. Simple Integration (3 Lines!)

from merit.monitoring.service import MonitoringService

# Initialize monitoring
monitor = MonitoringService()

# Log an interaction
monitor.log_simple_interaction({
    'user_message': 'Hello, how are you?',
    'llm_response': 'I am doing well, thank you!',
    'latency': 0.5,
    'model': 'gpt-3.5-turbo'
})

2. RAG System Evaluation

from merit.evaluation.evaluators.rag import RAGEvaluator

# Initialize evaluator
evaluator = RAGEvaluator()

# Evaluate RAG response
results = evaluator.evaluate(
    query="What is machine learning?",
    response="Machine learning is a subset of AI...",
    context=["Document 1 content...", "Document 2 content..."]
)

print(f"Relevance: {results['relevance']}")
print(f"Faithfulness: {results['faithfulness']}")

3. CLI Usage

# Start evaluation with config file
merit start --config my_config.py

# Monitor your application
merit monitor --config monitoring_config.py

๐Ÿ“š Examples

Basic Chat Application Integration

from merit.monitoring.service import MonitoringService
from datetime import datetime

class ChatApp:
    def __init__(self):
        # Initialize MERIT monitoring
        self.monitor = MonitoringService()
    
    def process_message(self, user_message: str) -> str:
        start_time = datetime.now()
        
        # Your existing chat logic here
        response = self.llm_client.chat(user_message)
        
        end_time = datetime.now()
        
        # Log interaction with MERIT
        self.monitor.log_simple_interaction({
            'user_message': user_message,
            'llm_response': response,
            'latency': (end_time - start_time).total_seconds(),
            'model': 'gpt-3.5-turbo',
            'timestamp': end_time.isoformat()
        })
        
        return response

Advanced RAG System with MERIT

from merit.evaluation.evaluators.rag import RAGEvaluator
from merit.monitoring.service import MonitoringService

class RAGSystem:
    def __init__(self):
        self.evaluator = RAGEvaluator()
        self.monitor = MonitoringService()
    
    def query(self, user_question: str):
        # Retrieve relevant documents
        documents = self.retriever.search(user_question)
        
        # Generate response
        response = self.llm.generate(user_question, documents)
        
        # Evaluate with MERIT
        evaluation = self.evaluator.evaluate(
            query=user_question,
            response=response,
            context=[doc.content for doc in documents]
        )
        
        # Monitor performance
        self.monitor.log_simple_interaction({
            'query': user_question,
            'response': response,
            'evaluation_scores': evaluation,
            'num_documents': len(documents)
        })
        
        return response, evaluation

๐Ÿ—๏ธ Project Structure

merit/
โ”œโ”€โ”€ api/                    # API clients (OpenAI, Google, etc.)
โ”œโ”€โ”€ core/                   # Core models and utilities
โ”œโ”€โ”€ evaluation/             # Evaluation framework
โ”‚   โ”œโ”€โ”€ evaluators/        # LLM and RAG evaluators
โ”‚   โ””โ”€โ”€ templates/         # Evaluation templates
โ”œโ”€โ”€ knowledge/              # Knowledge base management
โ”œโ”€โ”€ metrics/                # Metrics framework
โ”‚   โ”œโ”€โ”€ rag.py            # RAG-specific metrics
โ”‚   โ”œโ”€โ”€ llm_measured.py   # LLM-based metrics
โ”‚   โ””โ”€โ”€ monitoring.py     # Monitoring metrics
โ”œโ”€โ”€ monitoring/             # Monitoring service
โ”‚   โ””โ”€โ”€ collectors/        # Data collectors
โ”œโ”€โ”€ storage/               # Storage backends
โ”œโ”€โ”€ templates/             # Dashboard and report templates
โ””โ”€โ”€ testset_generation/    # Test set generation tools

๐Ÿ“Š Available Metrics

RAG Metrics

  • Correctness: Accuracy of generated responses
  • Faithfulness: Adherence to source documents
  • Relevance: Response relevance to query
  • Coherence: Logical flow and consistency
  • Fluency: Natural language quality
  • Context Precision: Quality of retrieved context

Monitoring Metrics

  • Latency: Response time tracking
  • Throughput: Requests per second
  • Error Rate: Failure percentage
  • Cost: Token usage and cost estimation
  • Usage: Model and feature usage patterns

๐Ÿ”ง Configuration

Basic Configuration File

# merit_config.py
from merit.config.models import MeritMainConfig

config = MeritMainConfig(
    evaluation={
        "evaluator": "rag",
        "metrics": ["relevance", "faithfulness", "correctness"]
    },
    monitoring={
        "storage_type": "sqlite",
        "collection_interval": 60,
        "retention_days": 30
    }
)

Advanced Configuration

# advanced_config.py
config = MeritMainConfig(
    evaluation={
        "evaluator": "rag",
        "metrics": ["relevance", "faithfulness", "correctness"],
        "test_set": {
            "path": "test_questions.json",
            "size": 100
        }
    },
    monitoring={
        "storage_type": "mongodb",
        "storage_config": {
            "uri": "mongodb://localhost:27017",
            "database": "merit_metrics"
        },
        "metrics": ["latency", "cost", "error_rate"],
        "collection_interval": 30,
        "retention_days": 90
    },
    knowledge_base={
        "type": "vector_store",
        "path": "./knowledge_base"
    }
)

๐ŸŽฏ Use Cases

1. Production LLM Monitoring

Monitor your deployed LLM applications in real-time with performance metrics, cost tracking, and error monitoring.

2. RAG System Development

Evaluate and improve your RAG systems with comprehensive metrics and automated testing.

3. Model Comparison

Compare different models and configurations using standardized evaluation metrics.

4. Quality Assurance

Implement automated testing for LLM applications with custom test sets and evaluation criteria.

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

git clone https://github.com/your-username/merit.git
cd merit
pip install -e .[dev]
pytest tests/

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built with modern Python practices and Pydantic for type safety
  • Inspired by the need for comprehensive AI system evaluation
  • Designed for simplicity and ease of integration

๐Ÿ“ž Support


MERIT: Making AI systems more reliable, one evaluation at a time. ๐Ÿš€

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

merit_ai-0.1.16.tar.gz (152.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

merit_ai-0.1.16-py3-none-any.whl (176.8 kB view details)

Uploaded Python 3

File details

Details for the file merit_ai-0.1.16.tar.gz.

File metadata

  • Download URL: merit_ai-0.1.16.tar.gz
  • Upload date:
  • Size: 152.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for merit_ai-0.1.16.tar.gz
Algorithm Hash digest
SHA256 5d62ed059ab7e9d9fa000ef3acca1a4e38a019f044fce9bcaff0d5ba555dd487
MD5 8d7420fc2775c8b3d0dbd4b366ce7b90
BLAKE2b-256 dab50d485e0b0f83abf931ff096021ff2ebc9589aed10ea79acc1b67208fd258

See more details on using hashes here.

File details

Details for the file merit_ai-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: merit_ai-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 176.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for merit_ai-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 bedd592a30220ff97e1085bd70d6c31b300dd228db3bcb495add54b8d704eae5
MD5 67dda8952193a6be54ad5852a54d1de3
BLAKE2b-256 652c8f3456d287ed6e8b3275b405e4eac20b80a804aeaf03e9d129490c3a5496

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page