Skip to main content

AI Testing Framework

Project description

Merit

License: MIT Python 3.12+

Merit is a Python testing framework for AI projects. It follows pytest syntax and culture while introducing components essential for testing AI software: metrics, typed datasets, semantic predicates (LLM-as-a-Judge), and OTEL traces.


Installation

uv add appmerit

Merit 101

Follow pytest habits...

  • Create 'merit_*.py' files
  • Write 'def merit_*' functions
  • Use 'merit.resource' instead of 'pytest.fixture'
  • Add 'assert' expressions within the functions
  • Run 'uv run merit test'

...while leveraging Merit APIs.

  • Use 'with metrics()' context to turn failed assertions into quality metrics
  • Use 'has_facts()' and other semantic predicates for asserting natural language
  • Access OTEL span data and assert it with 'follows_policy()' predicate
  • Parse datasets into clearly typed and validated data objects

Example

import merit
from merit import Case, Metric, metrics
from merit.predicates import has_unsupported_facts, follows_policy

from pydantic import BaseModel

@merit.sut
def store_chatbot(prompt: str) -> str:
    return call_llm(prompt)

@merit.metric
def accuracy():
    metric = Metric()
    yield metric

    assert metric.mean > 0.8
    yield metric.mean

class Refs(BaseModel):
    kb: str
    expected_tool: str | None = None

cases = [
    Case(sut_input_values={"prompt": "When are you open?"}, references=Refs(kb="Store hours: 9 AM - 6 PM, Monday-Saturday. Closed Sundays.")),
    Case(sut_input_values={"prompt": "Return policy?"}, references=Refs(kb="30-day returns with receipt.")),
    Case(sut_input_values={"prompt": "How much for the Nike Air Max?"}, references=Refs(kb="Nike Air Max: $129.99", expected_tool="offer_product")),
]

@merit.iter_cases(cases)
@merit.repeat(3)
async def merit_chatbot_no_hallucinations(
    case: Case[Refs], 
    store_chatbot, 
    accuracy: Metric, 
    trace_context):
    """AI agent relies on knowledge base and tool calls for transactional questions"""
    response = store_chatbot(**case.sut_input_values)
    
    # Verify the answer don't have any unsupported facts
    with metrics([accuracy]):
        assert not await has_unsupported_facts(response, case.references.kb)
    
    # Verify tool was called when expected
    if expected_tool := case.references.expected_tool:
        spans = trace_context.get_sut_spans(store_chatbot)
        tool_called = spans[1].attributes.get("llm.request.functions.0.name")

        assert tool_called == expected_tool

Run it:

merit test

Output:

Merit Test Runner
=================

Collected 1 test

test_example.py::merit_chatbot_responds ✓

==================== 1 passed in 0.08s ====================

Documentation

Full documentation: docs.appmerit.com

Getting Started:

Usage:

Concepts:

API Reference:


Contributing

We welcome contributions! To get started:

  1. Fork the repository
  2. Clone your fork: git clone https://github.com/YOUR_USERNAME/merit.git
  3. Create a branch: git checkout -b your-feature-name
  4. Install dependencies: uv sync
  5. Make your changes
  6. Run tests: uv run merit test
  7. Run lints: uv run ruff check .
  8. Submit a pull request

For more details, see CONTRIBUTING.md.

Development Setup:

# Clone the repository
git clone https://github.com/appMerit/merit.git
cd merit

# Install dependencies
uv sync

# Run tests
uv run merit test

# Run lints
uv run ruff check .
uv run mypy .

License

This project is licensed under the MIT License - see the LICENSE file for details.


Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

appmerit-0.1.1.tar.gz (523.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

appmerit-0.1.1-py3-none-any.whl (63.0 kB view details)

Uploaded Python 3

File details

Details for the file appmerit-0.1.1.tar.gz.

File metadata

  • Download URL: appmerit-0.1.1.tar.gz
  • Upload date:
  • Size: 523.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for appmerit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7e98bc5ccdbe645a28ba58096fa653ac1433c40bb158b61d04a2ecaf067dc2fc
MD5 ae2ed18e7760a4896f9fbff4acbf4683
BLAKE2b-256 ba66180da03ae6c8c331f030eb9fa2132adb455624d6418180ca196398ee487f

See more details on using hashes here.

Provenance

The following attestation bundles were made for appmerit-0.1.1.tar.gz:

Publisher: publish.yml on appMerit/merit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file appmerit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: appmerit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 63.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for appmerit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 68246c97a4e466934c043fcf06efe0a19a6279b4767cc84576d1b3410b05034a
MD5 ca84d824cce2b37777e8bf56b43a1225
BLAKE2b-256 44c3cacbb920c0c1608c7943b2b5513efc59d388e4f9920e759baf3f4396b276

See more details on using hashes here.

Provenance

The following attestation bundles were made for appmerit-0.1.1-py3-none-any.whl:

Publisher: publish.yml on appMerit/merit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page