Interactive LLM agentic evaluation TUI for local and cloud models

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

scottblydotcom

These details have not been verified by PyPI

Project description

Hermia

Structured behavioral eval for local LLMs. The model binary is not the unit of analysis — the inference stack is.

You selected a model by benchmark score. That benchmark ran on somebody else's hardware, their driver stack, their runtime version. Not yours.

A ROCm update can flip a security test from PASS to FAIL. Hermia catches it — because it runs on your stack, not a cloud proxy.

What It Does

Hermia runs structured behavioral evaluation against local Ollama models and scores results for correctness across security, reasoning, and tool-use dimensions. Results map directly to established AI security frameworks so findings have documented provenance — not just "it seemed fine."

Live system metrics (CPU, RAM, GPU, VRAM, tokens/sec) run alongside every eval. Cold-load benchmarking measures actual model load time from a clean VRAM state, not cached inference. Because "how fast is it really" is a different question than "how fast is it after it's already warm."

v0.2 scope: structural eval with deterministic orchestration (fixed sampling — temperature=0, seed=42 — and fixed message construction) against Ollama-compatible local endpoints, with multi-turn corpus cases for context-carry and boundary-persistence testing. Reproducibility of the model's output still depends on the backend; that is what Hermia measures. LLM-as-judge intent scoring lands in v0.3.

Fleet mode (--fleet FILE) runs headless multi-host eval from a YAML config — same test suite, multiple Ollama endpoints evaluated concurrently (default: up to 4 hosts in parallel). Compare CUDA vs. Metal on the same model. See where your inference stack diverges. Entries that share the same host are evaluated sequentially so a single GPU node is never asked to hold two models simultaneously (VRAM-safe). Control parallelism with --max-concurrency N. Per-test timeout is configurable via --test-timeout SECONDS or per-host test_timeout: in the fleet YAML. See the fleet-YAML format for the file schema.

Why Hermia Exists

Garak is built by NVIDIA — you know, the company currently valued at roughly the GDP of a medium-sized country. It has hundreds of probes, years of community contributions, serious research backing, and a team of people whose full-time job is this. You should use it.

Hermia is built in a consultancy lab. Different scale. Genuinely different problem.

Garak asks: is this model vulnerable to known attack patterns?

Hermia asks: does this model behave correctly on your inference stack — and what is your hardware actually doing while it runs?

Will it refuse a forbidden action — consistently, not just when it feels like it?
Does it maintain a security boundary when a structured workflow nudges toward crossing it?
Will it leak a system prompt credential if the user asks cleverly enough?
Does it correctly route a request that looks safe but isn't?

These aren't hypothetical. They're the questions a security practitioner asks before deploying a model in an environment where it has real tools and real permissions.

Garak scans for vulnerabilities. Hermia evaluates behavioral correctness against structured pass/fail criteria mapped to frameworks you can actually cite in a risk assessment. They do different things. Run both.

The practitioner origin is a feature, not a bug — this was built by a security consultant who runs models across a distributed inference fleet, cares about hardware costs, and needs evals that work without sending data to a cloud API. If that sounds like you, Hermia was built for your context.

Framework Coverage

Framework	What Hermia Maps To
OWASP LLM Top 10 (2025)	LLM01 prompt injection (direct + indirect), LLM02 sensitive information disclosure, LLM06 excessive agency / scope escalation, LLM07 system-prompt leakage
MITRE ATLAS 6.0.0 (2026.05)	AML.T0051.000/.001 direct + indirect prompt injection, AML.T0056 extract system prompt, AML.T0057 LLM data leakage, AML.T0068 prompt obfuscation, AML.T0099 tool data poisoning
CSA MAESTRO	L1 foundation model robustness, L3 agent framework routing and lane evasion
NIST AI RMF	Measure function: MEASURE 2.5 validity & reliability, MEASURE 2.7 security & resilience

Eval Dimensions

Dimension	What It Tests
`security`	Injection resistance, credential protection, scope escalation refusal, system prompt extraction resistance, structured field injection
`tool-use`	Valid tool invocation, correct tool selection, dependency-aware multi-step chaining
`reasoning`	Multi-step decomposition, error recovery and fallback planning, partial failure handling
`constraint`	Exact schema compliance, numeric correctness, adversarial input robustness
`routing`	Request classification, lane routing evasion detection
`memory`	Cross-turn context retention
`multi-turn`	Deterministic multi-turn conversations — context carry across turns, safety-boundary persistence under social engineering
`domain`	Home automation agent, structured data extraction

Requirements

Python 3.11+
Ollama running locally (ollama serve)
At least one model pulled: ollama pull llama3.2 or any compatible model

In the default local setup, no cloud API keys are required and no data leaves your machine. (Point a fleet host at a remote or cloud endpoint via the openai-compat transport and prompts are sent to that endpoint — and a key may be required. See Run against a remote host.)

Hardware Support

Platform	GPU	Status
Linux	AMD ROCm (gfx900 / RX series)	✅ Tested
Linux	NVIDIA CUDA (sm_89 / RTX series)	✅ Tested*
macOS	Apple Silicon (M1 / M2 / M3 / M4)	✅ Tested
Linux	Intel iGPU	⚠️ Best-effort
Linux / macOS	CPU-only (no discrete GPU)	✅ Supported
Windows	Any	❌ Not yet

*NVIDIA metrics tested on Linux eval client. Windows Ollama servers are supported as fleet targets (point a fleet YAML entry's host: at the Windows box); running Hermia itself on Windows is not yet supported.

Install

Recommended (via pipx):

pipx install hermia

Or via Homebrew (macOS):

brew install scottblydotcom/tap/hermia

Or with pip:

pip install hermia

Or from source:

git clone https://github.com/scottblydotcom/hermia
cd hermia
pip install -e .

Or via Docker (headless fleet mode):

mkdir -p results && chmod 777 results  # container writes as uid 1000, not your host user
docker run --rm --network host \
  -v $PWD/fleets:/workspace/fleets:ro \
  -v $PWD/results:/workspace/results \
  ghcr.io/scottblydotcom/hermia:latest \
  --fleet fleets/local.yaml

See Docker usage for macOS / Windows networking (host.docker.internal) and volume-mount details.

Quickstart

# Start Ollama if it isn't running
ollama serve

# Launch Hermia
hermia

Hermia opens a TUI. Select a model from the list, choose which eval dimensions to run, and press Run. Results appear live alongside system metrics. Each run writes results/eval_TIMESTAMP.jsonl and results/eval_TIMESTAMP.csv.

New here? docs/getting-started.md is the 5-minute zero-to-first-eval path.

See docs/usage.md for the full reference: result interpretation, --repeat N consistency scoring, fleet mode, regression detection, and Postgres export.

Roadmap

v0.2 — Fleet + TUI (a.k.a. Endpoint Bus; shipping): Headless fleet mode for multi-host eval from a YAML config; full-featured TUI for launch/configure/run/inspect; backend stack tagging by GPU arch, runtime version, and execution path (GPU vs spill). Configurable per-test timeout for thinking-mode models.

v0.3 — Eval Bus (target ~2026-08): Hermia becomes the platform other tools build into. Probe adapters for Garak, PyRIT, and HarmBench pull their results into Hermia's hardware-correlated, framework-mapped view alongside Hermia's own test cases. LLM-as-judge scoring; a Sink interface — a pluggable output destination (Prometheus, webhook, S3) that results can be written to.

See docs/roadmap.md for the full plan.

Project Status

v0.2.0 — stable and tested. The core eval suite, fleet mode, TUI, audit report, and findings analysis pipeline are all shipping. Cross-stack reproducibility evidence (Metal × CUDA × ROCm) is being captured as an ongoing dataset, published on a rolling basis across the v0.2.x series rather than as a single launch snapshot. The security pipeline (gitleaks, trivy, bandit, pip-audit, ruff, mypy) is more rigorous than a research tool strictly needs to be. That was intentional.

Available on PyPI: pipx install hermia

Name

Hermia = Hermes (Greek messenger god, trickster, patron of travelers — thief of Apollo's cattle) + Pythia (the Oracle of Delphi, who spoke for Apollo).

The tool steals answers from the Oracle and tells you which one to trust.

Documentation

Getting Started — 5-minute zero-to-first-eval guide
Usage Reference — full walkthrough: install, run, interpret results, fleet mode, regression detection, Postgres export
Roadmap — v0.2 fleet + TUI, v0.3 eval bus, full backlog
GUARDS Framework — six-dimension standard for LLM system-prompt guardrail construction (Goal/Unit/Actions/Response/Detect/Stop)

Security

Hermia only reads from Ollama — /api/tags, /api/generate, /api/ps, and /api/version. It never calls the model-upload / /api/create endpoints, so it does not itself exercise the code paths behind the model-upload CVEs (CVE-2026-7482, CVE-2026-5757). Your Ollama server can still be vulnerable — keep it patched and restricted per the checklist below.

Protect your Ollama instance:

Run Ollama bound to 127.0.0.1 (the default) — never expose port 11434 publicly
Keep Ollama upgraded; 0.17.1+ patches CVE-2026-7482 (CVSS 9.1, heap memory disclosure via crafted GGUF upload, nicknamed "Bleeding Llama")
CVE-2026-5757 (same attack class, no upstream patch as of May 2026) — restrict /api/create access at the network or firewall layer
Fleet deployments: use fleet-YAML auth.bearer.key_env blocks (see usage.md) or a Tailscale overlay to prevent unauthenticated access to remote Ollama endpoints

Hermia surfaces known Ollama version vulnerabilities at run time in the preflight log as SEC ⚠ warnings.

Contributing

Contributions welcome. Please read AGENTS.md before opening a PR — it covers the behavioral rules, module boundary table, and review gate sequence this project enforces.

See CONTRIBUTING.md for full details on how to get involved.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

scottblydotcom

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jul 6, 2026

0.1.3

May 28, 2026

0.1.2

May 28, 2026

0.1.1

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hermia-0.2.0.tar.gz (118.0 kB view details)

Uploaded Jul 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hermia-0.2.0-py3-none-any.whl (146.4 kB view details)

Uploaded Jul 6, 2026 Python 3

File details

Details for the file hermia-0.2.0.tar.gz.

File metadata

Download URL: hermia-0.2.0.tar.gz
Upload date: Jul 6, 2026
Size: 118.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for hermia-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d3f52c3d42627381228fedc0c9844adcdf8a77f2a9cb39c74e145ab6dcf590c6`
MD5	`d483d7e42d603cd7f301ee7ef439ee21`
BLAKE2b-256	`907eb22b70658d94df4ff9148ffa1f4ab4cbb76c173acd54bee481cf75470614`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hermia-0.2.0.tar.gz:

Publisher: publish.yml on scottblydotcom/hermia

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hermia-0.2.0.tar.gz
- Subject digest: d3f52c3d42627381228fedc0c9844adcdf8a77f2a9cb39c74e145ab6dcf590c6
- Sigstore transparency entry: 2084236562
- Sigstore integration time: Jul 6, 2026
Source repository:
- Permalink: scottblydotcom/hermia@13513d1618c11880e4cf4425ac70a01781975e7f
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/scottblydotcom
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@13513d1618c11880e4cf4425ac70a01781975e7f
- Trigger Event: push

File details

Details for the file hermia-0.2.0-py3-none-any.whl.

File metadata

Download URL: hermia-0.2.0-py3-none-any.whl
Upload date: Jul 6, 2026
Size: 146.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for hermia-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3621b9e02bed473182836c84293ae81466c8baefac1df8446bb1f2da2f2af5dd`
MD5	`bf58c9921a2336cb3a94cf59adce6eb7`
BLAKE2b-256	`bf179eacde07d4b4a9fe2e060e41dc1d2a3f18b7d2b44df240588622854e4954`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hermia-0.2.0-py3-none-any.whl:

Publisher: publish.yml on scottblydotcom/hermia

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hermia-0.2.0-py3-none-any.whl
- Subject digest: 3621b9e02bed473182836c84293ae81466c8baefac1df8446bb1f2da2f2af5dd
- Sigstore transparency entry: 2084236585
- Sigstore integration time: Jul 6, 2026
Source repository:
- Permalink: scottblydotcom/hermia@13513d1618c11880e4cf4425ac70a01781975e7f
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/scottblydotcom
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@13513d1618c11880e4cf4425ac70a01781975e7f
- Trigger Event: push

hermia 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Hermia

What It Does

Why Hermia Exists

Framework Coverage

Eval Dimensions

Requirements

Hardware Support

Install

Quickstart

Roadmap

Project Status

Name

Documentation

Security

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance