Skip to main content

Robot Framework-based test harness for systematically testing LLMs

Project description

robotframework-chat

A Robot Framework-based test harness for systematically testing Large Language Models (LLMs) using LLMs as both the system under test and as automated graders. Test results are archived to SQL and visualized in Apache Superset dashboards.


Quick Start

Prerequisites

  • Python 3.11+ and astral-uv for dependency management
  • Docker for containerized code execution, LLM testing, and the Superset stack
  • Ollama (optional) for local LLM testing

Installation (Linux / macOS)

make install                # Install all dependencies
pre-commit install          # Install pre-commit hooks
ollama pull llama3          # Pull default LLM model (optional)

Installation (Windows)

The tasks.py script provides a cross-platform alternative to the Makefile. It requires only Python and uv — no make, bash, or Unix tools needed.

uv run python tasks.py install      # Install all dependencies
uv run pre-commit install           # Install pre-commit hooks
ollama pull llama3                  # Pull default LLM model (optional)
uv run python tasks.py help         # List all available targets

Note: Docker-based tests require Docker Desktop for Windows with the WSL 2 backend enabled.

Running Tests

# Linux / macOS
make robot                  # Run all Robot Framework test suites
make robot-math             # Run math tests
make robot-docker           # Run Docker tests
make robot-safety           # Run safety tests

# All platforms (including Windows)
uv run python tasks.py robot        # Run all suites
uv run python tasks.py robot-math   # Run math tests
uv run python tasks.py robot-dryrun # Validate tests (dry run)
uv run python tasks.py check        # Lint + typecheck + coverage

Superset Dashboard

# Linux / macOS
cp .env.example .env        # Configure environment
make docker-up              # Start PostgreSQL + Redis + Superset
make bootstrap              # First-time Superset initialization

# Windows — tasks.py copies .env automatically if missing
uv run python tasks.py docker-up

Open http://localhost:8088 to view the dashboard.


Example Test

*** Test Cases ***
LLM Can Do Basic Math
    ${answer}=    Ask LLM    What is 2 + 2?
    ${score}    ${reason}=    Grade Answer    What is 2 + 2?    4    ${answer}
    Should Be Equal As Integers    ${score}    1

Core Philosophy

  • LLMs are software — test them like software
  • Determinism before intelligence — structured, machine-verifiable evaluation first
  • Constrained grading — scores, categories, pass/fail; no prose from the evaluation layer
  • Modular by design — composable pieces; new providers and graders plug in without rewriting core
  • Robot Framework as the orchestration layer — readable, keyword-driven tests
  • Every test run is archived — listeners always active, results flow to SQL
  • CI-native, regression-focused — if it can't run unattended, it's not done

See ai/AGENTS.md for the full philosophy.


Documentation

Document Description
docs/TEST_DATABASE.md Database schema and usage
docs/GITLAB_CI_SETUP.md CI/CD setup guide
docs/GRAFANA_SUPERSET_SETUP.md Superset visualization stack setup (Grafana deferred to v2+)
docs/SUPERSET_EXPORT_GUIDE.md Superset dashboard export, import, and backup

Contributing

  1. Read ai/DEV.md for the development workflow and TDD discipline
  2. Follow the code style guidelines in ai/AGENTS.md
  3. Add tests for new features (see ai/CLAUDE.md for grading tiers)
  4. Run pre-commit run --all-files before committing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotframework_chat-1.0.0.tar.gz (703.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

robotframework_chat-1.0.0-py3-none-any.whl (57.9 kB view details)

Uploaded Python 3

File details

Details for the file robotframework_chat-1.0.0.tar.gz.

File metadata

  • Download URL: robotframework_chat-1.0.0.tar.gz
  • Upload date:
  • Size: 703.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for robotframework_chat-1.0.0.tar.gz
Algorithm Hash digest
SHA256 bdee97e8912a0fb76a9bed376ecbfd325448d4eaf93a24011335cb87e076a0ec
MD5 c6ef506fdc5de69372c5b1c8e4ab60a9
BLAKE2b-256 bebd7e7151affd31f3236b1ca811144f363ae220d70b1251f9efc58fadefebdf

See more details on using hashes here.

Provenance

The following attestation bundles were made for robotframework_chat-1.0.0.tar.gz:

Publisher: pypi-publish.yml on tkarcheski/robotframework-chat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file robotframework_chat-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for robotframework_chat-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 be5f697faa9e656126ea7162318dc0a0db104623ff11c9841a4cee246acd5833
MD5 6d4f62c9ad55ae3124779d525b5992b2
BLAKE2b-256 3647c363a902330ab8bd0ece9de586991459f44fec81ec1ba5ffdf26dc942809

See more details on using hashes here.

Provenance

The following attestation bundles were made for robotframework_chat-1.0.0-py3-none-any.whl:

Publisher: pypi-publish.yml on tkarcheski/robotframework-chat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page