Skip to main content

Robot Framework-based test harness for systematically testing LLMs

Project description

robotframework-chat

A Robot Framework-based test harness for systematically testing Large Language Models (LLMs) using LLMs as both the system under test and as automated graders. Test results are archived to SQL and visualized in Apache Superset dashboards.


Quick Start

Prerequisites

  • Python 3.11+ and astral-uv for dependency management
  • Docker for containerized code execution, LLM testing, and the Superset stack
  • Ollama (optional) for local LLM testing

Installation (Linux / macOS)

make install                # Install all dependencies
pre-commit install          # Install pre-commit hooks
ollama pull phi4:14b         # Pull default LLM model (optional)

Installation (Windows)

The tasks.py script provides a cross-platform alternative to the Makefile. It requires only Python and uv — no make, bash, or Unix tools needed.

uv run python tasks.py install      # Install all dependencies
uv run pre-commit install           # Install pre-commit hooks
ollama pull qwen3.5:27b             # Pull default LLM model (optional)
uv run python tasks.py help         # List all available targets

Note: Docker-based tests require Docker Desktop for Windows with the WSL 2 backend enabled.

Running Tests

# Linux / macOS
make robot                  # Run all Robot Framework test suites
make robot-math             # Run math tests
make robot-docker           # Run Docker tests
make robot-safety           # Run safety tests

# All platforms (including Windows)
uv run python tasks.py robot        # Run all suites
uv run python tasks.py robot-math   # Run math tests
uv run python tasks.py robot-dryrun # Validate tests (dry run)
uv run python tasks.py check        # Lint + typecheck + coverage

Superset Dashboard

# Linux / macOS
cp .env.example .env        # Configure environment
make docker-up              # Start PostgreSQL + Redis + Superset
make bootstrap              # First-time Superset initialization

# Windows — tasks.py copies .env automatically if missing
uv run python tasks.py docker-up

Open http://localhost:8088 to view the dashboard.


Ollama Configuration

Pulling Models

The default model is phi4:14b (set via DEFAULT_MODEL in .env). Pull additional models depending on how many you want to test against:

Starter (3 models):

ollama pull phi4:14b
ollama pull llama3.2:latest
ollama pull gemma2:latest

Standard (4–5 models):

ollama pull phi4:14b
ollama pull llama3.2:latest
ollama pull gemma2:latest
ollama pull mistral:latest
ollama pull qwen3.5:27b

Full fleet — pull all models from config/test_suites.yaml:

make cron-sync-models        # Pulls any master models missing locally

Loading Multiple Models Simultaneously

By default Ollama keeps up to 3 models loaded in memory (3 × number of GPUs, or 3 for CPU inference). To load more models concurrently, configure these Ollama server environment variables:

Variable Default Description
OLLAMA_MAX_LOADED_MODELS 3 × GPUs (or 3) Max models resident in memory at once
OLLAMA_NUM_PARALLEL 1 Parallel requests per loaded model
OLLAMA_MAX_QUEUE 512 Max queued requests before rejecting

Memory note: each loaded model consumes VRAM/RAM proportional to its size. A 7B Q4 model uses ~4 GB; a 27B model uses ~16 GB. Setting OLLAMA_NUM_PARALLEL > 1 multiplies context memory per model.

Linux (systemd):

sudo systemctl edit ollama.service

Add under [Service]:

[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=5"
Environment="OLLAMA_NUM_PARALLEL=2"

Then restart:

sudo systemctl restart ollama

macOS:

launchctl setenv OLLAMA_MAX_LOADED_MODELS 5
launchctl setenv OLLAMA_NUM_PARALLEL 2

Restart the Ollama application after setting these.

Windows:

Set OLLAMA_MAX_LOADED_MODELS and OLLAMA_NUM_PARALLEL as system environment variables, then restart Ollama.

VRAM Sizing Guide

Models Loaded Recommended VRAM Example Hardware
3 (default) 24 GB RTX 4090, M2 Pro
4 32 GB 2× RTX 4080, M2 Max
5+ 48+ GB 2× RTX 4090, M3 Ultra

Actual requirements depend on model sizes and quantization levels.

Auto-Discovery and Multi-Model Testing

The test harness auto-discovers available models at startup and skips tests for models that are not installed — you will never get failures from missing models.

make discover-local-models   # List models available on all configured nodes
make run-local-models        # Run all test suites against every discovered model

# Windows
uv run python scripts/run_local_models.py --discover-models
uv run python scripts/run_local_models.py

Use ITERATIONS for continuous testing:

make run-local-models ITERATIONS=-1   # Run forever
make run-local-models ITERATIONS=0    # Stop on first error

Multi-Node Setup (Optional)

To distribute tests across multiple machines running Ollama, set OLLAMA_NODES_LIST in .env:

OLLAMA_NODES_LIST=localhost,gpu-server-1,gpu-server-2

Or edit the nodes list in config/test_suites.yaml directly. Check node status with:

make discover-local-nodes

Project Environment Variables

Variable Default Description
LLM_PROVIDER ollama Provider backend (ollama or openai)
OLLAMA_ENDPOINT http://localhost:11434 Ollama API endpoint
DEFAULT_MODEL phi4:14b Model used for standard test runs
OLLAMA_TIMEOUT 5400 Request timeout in seconds (90 min)
OLLAMA_NODES_LIST localhost Comma-separated Ollama hostnames

Example Test

*** Test Cases ***
LLM Can Do Basic Math
    ${answer}=    Ask LLM    What is 2 + 2?
    ${score}    ${reason}=    Grade Answer    What is 2 + 2?    4    ${answer}
    Should Be Equal As Integers    ${score}    1

Core Philosophy

  • LLMs are software — test them like software
  • Determinism before intelligence — structured, machine-verifiable evaluation first
  • Constrained grading — scores, categories, pass/fail; no prose from the evaluation layer
  • Modular by design — composable pieces; new providers and graders plug in without rewriting core
  • Robot Framework as the orchestration layer — readable, keyword-driven tests
  • Every test run is archived — listeners always active, results flow to SQL
  • CI-native, regression-focused — if it can't run unattended, it's not done

See ai/AGENTS.md for the full philosophy.


Documentation

Document Description
docs/TEST_DATABASE.md Database schema and usage
docs/GITLAB_CI_SETUP.md CI/CD setup guide
docs/GRAFANA_SUPERSET_SETUP.md Superset visualization stack setup (Grafana deferred to v2+)
docs/SUPERSET_EXPORT_GUIDE.md Superset dashboard export, import, and backup
Ollama Configuration Multi-model loading, VRAM sizing, and multi-node setup

Contributing

  1. Read ai/DEV.md for the development workflow and TDD discipline
  2. Follow the code style guidelines in ai/AGENTS.md
  3. Add tests for new features (see ai/CLAUDE.md for grading tiers)
  4. Run pre-commit run --all-files before committing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotframework_chat-1.4.3.tar.gz (851.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

robotframework_chat-1.4.3-py3-none-any.whl (84.3 kB view details)

Uploaded Python 3

File details

Details for the file robotframework_chat-1.4.3.tar.gz.

File metadata

  • Download URL: robotframework_chat-1.4.3.tar.gz
  • Upload date:
  • Size: 851.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for robotframework_chat-1.4.3.tar.gz
Algorithm Hash digest
SHA256 3584d39afdd1a79fbd1f82baca8c503e35396d2fb33bcc1596316e6e18ad3a32
MD5 c96d18108d76f757bae9985ed200ded2
BLAKE2b-256 ad04498437d65eb9517a1b23702281be552af6b238131a1198b940528e60f45a

See more details on using hashes here.

Provenance

The following attestation bundles were made for robotframework_chat-1.4.3.tar.gz:

Publisher: pypi-publish.yml on tkarcheski/robotframework-chat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file robotframework_chat-1.4.3-py3-none-any.whl.

File metadata

File hashes

Hashes for robotframework_chat-1.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1177b884570990adebfa11cdc9b92ce1b98bb07f2f4061ccc1cef9b9e37fe7ba
MD5 bf5cc962ae4c78256fd6a5ada02d1eaa
BLAKE2b-256 481d066220f7400f4680a523f4d950bc44908c99fe6fbae95690b3ef214dc2ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for robotframework_chat-1.4.3-py3-none-any.whl:

Publisher: pypi-publish.yml on tkarcheski/robotframework-chat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page