Minimal evaluation framework for LLM testing with local and cloud providers

These details have not been verified by PyPI

Project links

Project description

microeval

A lightweight evaluation framework for LLM testing. Supports local models (Ollama) and cloud providers (OpenAI, AWS Bedrock, Groq). Run evaluations via CLI or web UI, compare models and prompts, and track results.

Quick Start

1. Install

Clone the repository:

git clone https://github.com/boscoh/microeval
cd microeval

Install uv if not already installed:

curl -LsSf https://astral.sh/uv/install.sh | sh

Install dependencies:

uv sync

2. Configure AI Service

Ollama (Local Models)

Ollama installed and running:

ollama pull llama3.2
ollama serve

Models: llama3.2, llama3.1, qwen2.5, mistral, mixtral, etc.

OpenAI

Set OPENAI_API_KEY:

echo "OPENAI_API_KEY=your-api-key-here" > .env

AWS Bedrock

For local development, the easiest approach is to set up an AWS profile via ~/.aws/config and then specify it with AWS_PROFILE:

aws configure --profile your-profile-name
# or manually edit ~/.aws/config

echo "AWS_PROFILE=your-profile-name" >> .env

Alternatively, you can set credentials directly:

echo "AWS_ACCESS_KEY_ID=your-access-key" >> .env
echo "AWS_SECRET_ACCESS_KEY=your-secret-key" >> .env
echo "AWS_DEFAULT_REGION=us-east-1" >> .env

Note: AWS profiles are recommended for local dev since key rotation is handled automatically.

Groq

Set GROQ_API_KEY:

echo "GROQ_API_KEY=your-api-key-here" > .env

3. Run Your First Evaluation

Try the sample evaluation:

uv run microeval ui sample-evals

Open http://localhost:8000 to see the web UI.

Tutorial: Building Your First Evaluation

This tutorial walks you through creating an evaluation from scratch. We'll build a text summarization evaluator.

Step 1: Create Your Evaluation Directory

Each evaluation project lives in its own directory with four subdirectories:

mkdir -p my-evals/{prompts,queries,runs,results}

This creates:

my-evals/
├── prompts/    # System prompts (instructions for the LLM)
├── queries/    # Test cases (input/output pairs)
├── runs/       # Run configurations (which model, prompt, query to use)
└── results/    # Generated results (created automatically)

Step 2: Write a System Prompt

Create a prompt file that tells the LLM how to behave:

cat > my-evals/prompts/summarizer.txt << 'EOF'
You are a helpful assistant that summarizes text concisely.

## Instructions
- Summarize the given text in 2-3 sentences
- Capture the key points and main ideas
- Use clear, simple language

## Output Format
Return only the summary, no preamble or explanation.
EOF

Prompts are .txt files in the prompts/ directory. The filename (without extension) becomes the prompt_ref.

Step 3: Create a Query (Test Case)

Create a query file with input/output pairs for testing:

cat > my-evals/queries/pangram.yaml << 'EOF'
---
input: >-
  The quick brown fox jumps over the lazy dog. This sentence is famous
  because it contains every letter of the English alphabet at least once.
  It has been used for centuries to test typewriters, fonts, and keyboards.
  The phrase was first used in the late 1800s and remains popular today
  for testing purposes.
output: >-
  The sentence "The quick brown fox jumps over the lazy dog" is a pangram
  containing every letter of the alphabet. It has been used since the late
  1800s to test typewriters, fonts, and keyboards.
EOF

Query structure:

input - The text sent to the LLM (user message)
output - The expected/ideal response (used by evaluators like equivalence)

The filename (without extension) becomes the query_ref.

Step 4: Create a Run Configuration

Create a run configuration that ties everything together:

cat > my-evals/runs/summarize-gpt4o.yaml << 'EOF'
---
query_ref: pangram
prompt_ref: summarizer
service: openai
model: gpt-4o
repeat: 3
temperature: 0.5
evaluators:
- word_count
- coherence
- equivalence
EOF

Run configuration fields:

Field	Description
`query_ref`	Name of the query file (without `.yaml`)
`prompt_ref`	Name of the prompt file (without `.txt`)
`service`	LLM provider: `openai`, `bedrock`, `ollama`, or `groq`
`model`	Model name (e.g., `gpt-4o`, `llama3.2`)
`repeat`	Number of times to run the evaluation
`temperature`	Sampling temperature (0.0 = deterministic)
`evaluators`	List of evaluators to run

Step 5: Run the Evaluation

Option A: Web UI

uv run microeval ui my-evals

Navigate to http://localhost:8000, go to the Runs tab, and click the run button.

Option B: CLI

uv run microeval run my-evals

This runs all configurations in my-evals/runs/.

Step 6: View Results

Results are saved to my-evals/results/ as YAML files:

---
texts:
- "The sentence 'The quick brown fox...' is notable for..."
- "The phrase 'The quick brown fox...' contains every letter..."
- "The quick brown fox jumps over the lazy dog is a famous..."
evaluations:
- name: word_count
  values: [1.0, 1.0, 1.0]
  average: 1.0
  standard_deviation: 0.0
- name: coherence
  values: [0.95, 0.92, 0.98]
  average: 0.95
  standard_deviation: 0.03
- name: equivalence
  values: [0.88, 0.91, 0.85]
  average: 0.88
  standard_deviation: 0.03

Result structure:

texts - All generated responses from each run
evaluations - Scores from each evaluator with statistics

In the Web UI, use the Graph tab to visualize and compare results across different runs.

Evaluators

Evaluators score responses on a 0.0-1.0 scale:

Evaluator	Description	How it Works
`coherence`	Logical flow and clarity	LLM scores structure and consistency
`equivalence`	Semantic similarity to expected	LLM compares meaning with query output
`word_count`	Response length validation	Algorithmic check (no LLM call)

Word Count Configuration

Add these optional fields to your run config:

min_words: 50    # Minimum word count
max_words: 200   # Maximum word count
target_words: 100  # Target word count (scores based on distance)

Creating Custom Evaluators

Create a class in microeval/evaluator.py:

class MyCustomEvaluator:
    def __init__(self, run_config: RunConfig):
        self.run_config = run_config

    async def evaluate(self, response_text: str) -> Dict[str, Any]:
        # Your evaluation logic here
        score = 1.0  # Calculate your score (0.0 to 1.0)
        return {
            "score": score,
            "text": "Evaluation details",
            "elapsed_ms": 0,
            "token_count": 0,
        }

self.evaluators = {
    "coherence": CoherenceEvaluator(chat_client, run_config),
    "equivalence": EquivalenceEvaluator(chat_client, run_config),
    "word_count": WordCountEvaluator(run_config),
    "mycustom": MyCustomEvaluator(run_config),  # Add this
}

Update the static method EvaluationRunner.evaluators() to include your evaluator name.
Use in your run config:

evaluators:
- coherence
- mycustom

Comparing Models and Prompts

A key use case is comparing different models or prompts on the same test cases.

Compare Multiple Models

Create multiple run configs with the same query and prompt but different models:

my-evals/runs/
├── summarize-gpt4o.yaml      # service: openai, model: gpt-4o
├── summarize-claude.yaml     # service: bedrock, model: anthropic.claude-3-sonnet
├── summarize-llama.yaml      # service: ollama, model: llama3.2
└── summarize-groq.yaml       # service: groq, model: llama-3.3-70b-versatile

Run all:

uv run evalstarter run my-evals

Compare results in the Graph view.

Compare Multiple Prompts

Create different prompts and run configs:

my-evals/prompts/
├── summarizer-basic.txt      # Simple instructions
├── summarizer-detailed.txt   # Detailed step-by-step
└── summarizer-expert.txt     # Expert persona

my-evals/runs/
├── test-basic.yaml           # prompt_ref: summarizer-basic
├── test-detailed.yaml        # prompt_ref: summarizer-detailed
└── test-expert.yaml          # prompt_ref: summarizer-expert

Web UI Guide

Start the UI:

uv run evalstarter ui my-evals

Tabs

Tab	Purpose
Runs	Create, edit, and execute run configurations
Queries	Define and edit test cases (input/output pairs)
Prompts	Write and manage system prompts
Graph	Visualize evaluation results and compare runs

Workflow

Prompts → Write your system prompt
Queries → Define your test case
Runs → Configure which model, prompt, and query to use
Runs → Click the run button to execute
Graph → View and compare results

CLI Commands

uv run microeval ui [EVALS_DIR]       # Start web UI (default: evals-consultant)
uv run microeval run EVALS_DIR        # Run all evaluations in directory
uv run microeval chat SERVICE         # Interactive chat (openai, bedrock, ollama, groq)
uv run microeval demo                 # Create sample-evals and run if not exists

Demo

Create sample evaluations and launch the UI:

uv run microeval demo
# Creates sample-evals directory and opens the web UI at http://localhost:8000

The demo includes evaluations for all supported services (OpenAI, Bedrock, Ollama, Groq) using the same prompt and test case for easy comparison.

Interactive Chat

Test LLM providers directly:

uv run microeval chat openai
uv run microeval chat ollama
uv run microeval chat bedrock
uv run microeval chat groq

Project Structure

.
├── README.md
├── pyproject.toml
├── .env                             # API keys (create from .env.example)
├── microeval/                        # Main package
│   ├── cli.py                       # CLI entry point (ui, run, chat)
│   ├── server.py                    # Web server and API
│   ├── runner.py                    # Evaluation runner
│   ├── evaluator.py                 # Evaluation logic
│   ├── chat_client.py               # LLM provider clients
│   ├── chat.py                      # Interactive chat
│   ├── schemas.py                   # Pydantic models
│   ├── config.json                  # Model configuration
│   ├── index.html                   # Web UI
│   ├── graph.py                     # Metrics visualization
│   └── yaml_utils.py                # YAML helpers
├── sample-evals/                    # Example evaluation project
│   ├── prompts/
│   ├── queries/
│   ├── runs/
│   └── results/
└── uv.lock

Services and Models

Default models configured in microeval/config.json:

Service	Default Model
openai	gpt-4o
bedrock	amazon.nova-pro-v1:0
ollama	llama3.2
groq	llama-3.3-70b-versatile

Tips and Best Practices

Prompt Engineering

Start with simple prompts and iterate
Use clear section headers (## Instructions, ## Output Format)
Specify output format explicitly
Test with temperature: 0.0 first for deterministic results

Evaluation Design

Use repeat: 3 or higher to account for model variability
Include equivalence when you have a known-good answer
Use coherence for open-ended responses
Create multiple query files to test different scenarios

Comparing Results

Keep one variable constant when comparing (e.g., same prompt, different models)
Use the Graph tab to visualize trends
Check standard deviation to understand consistency

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Mar 17, 2026

0.4.4

Feb 14, 2026

0.4.3

Feb 14, 2026

0.4.2

Feb 14, 2026

0.4.1

Feb 14, 2026

0.4.0

Feb 13, 2026

0.3.1

Feb 12, 2026

0.3.0

Feb 11, 2026

0.2.0

Feb 10, 2026

0.1.5

Dec 24, 2025

0.1.4

Dec 20, 2025

0.1.3

Dec 19, 2025

0.1.2

Dec 19, 2025

0.1.1

Dec 17, 2025

This version

0.1.0

Dec 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

microeval-0.1.0.tar.gz (51.8 kB view details)

Uploaded Dec 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

microeval-0.1.0-py3-none-any.whl (42.9 kB view details)

Uploaded Dec 15, 2025 Python 3

File details

Details for the file microeval-0.1.0.tar.gz.

File metadata

Download URL: microeval-0.1.0.tar.gz
Upload date: Dec 15, 2025
Size: 51.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.5

File hashes

Hashes for microeval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`22a51fb676bb81470877ae75c8b3aa7d6074fdbcf49a8a60afa19c09e7b9f365`
MD5	`d4b144680ef970693a41bbbebdbcded2`
BLAKE2b-256	`ba5523505f5eac5a3036126dec0b5e97409e35d643ed839aec696cb5e88820a2`

See more details on using hashes here.

File details

Details for the file microeval-0.1.0-py3-none-any.whl.

File metadata

Download URL: microeval-0.1.0-py3-none-any.whl
Upload date: Dec 15, 2025
Size: 42.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.5

File hashes

Hashes for microeval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ace3c70e2467fae35b2b680d52b6f087d0ffda378e03b21e66d77a38b0058351`
MD5	`00438c39ba916789757d3ee16ce1f62c`
BLAKE2b-256	`de7ce9f2760f1734036b15b13c027d9309e70aee3d61cb9b3d4f84bfcddb9274`

See more details on using hashes here.

microeval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

microeval

Quick Start

1. Install

2. Configure AI Service

3. Run Your First Evaluation

Tutorial: Building Your First Evaluation

Step 1: Create Your Evaluation Directory

Step 2: Write a System Prompt

Step 3: Create a Query (Test Case)

Step 4: Create a Run Configuration

Step 5: Run the Evaluation

Step 6: View Results

Evaluators

Word Count Configuration

Creating Custom Evaluators

Comparing Models and Prompts

Compare Multiple Models

Compare Multiple Prompts

Web UI Guide

Tabs

Workflow

CLI Commands

Demo

Interactive Chat

Project Structure

Services and Models

Tips and Best Practices

Prompt Engineering

Evaluation Design

Comparing Results

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes