Evaluate LLMs against behavioral specifications (AGENTS.md, Claude.md, custom rules)

Project description

llm-behavioral-eval

Spec-agnostic LLM evaluation engine. Measures how well any LLM follows behavioral specifications (AGENTS.md, CLAUDE.md, .cursorrules, etc.).

Quick Start

pip install llm-behavioral-eval

# Evaluate any spec directory
behavioral-eval --spec ./my-project --suite core_principles --count 20 --real-llm

# Full evaluation with LLM judge
behavioral-eval --spec ./dann-specs/project --suite all --real-llm --judge-provider deepseek

# Heuristic mode (no API cost for judge)
behavioral-eval --spec ./dann-specs/project --suite all --real-llm --no-judge

# A/B comparison between two models
behavioral-eval --spec ./dann-specs/project --arena llama-home ollama-home --count 30 --real-llm

Features

5 test suites: core_principles, rubric_dimensions, roles, variants, concrete
LLM Judge: external LLM scores responses per-dimension (1-5) with justifications
Concrete verification: executable coding tasks with real assertion testing
Consistency: --repetitions N measures model stability
A/B Arena: compare two models head-to-head with statistical significance
Heatmaps: per-dimension score breakdowns
Spec-agnostic: evaluates any behavioral specification directory

Project details

Release history Release notifications | RSS feed

1.0.1

Jun 27, 2026

This version

1.0.0

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_behavioral_eval-1.0.0.tar.gz (19.6 kB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_behavioral_eval-1.0.0-py3-none-any.whl (25.9 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file llm_behavioral_eval-1.0.0.tar.gz.

File metadata

Download URL: llm_behavioral_eval-1.0.0.tar.gz
Upload date: Jun 27, 2026
Size: 19.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for llm_behavioral_eval-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e0901829f873f354062c0993db942d217cce533054c5c526bef32626e050547a`
MD5	`71902d9eedb5f7807b0e42d3bf308ab6`
BLAKE2b-256	`c5df8c72c61720b3feea7def87ec28c7e5d86020a61be35e92b3e0f11aa7d0de`

See more details on using hashes here.

File details

Details for the file llm_behavioral_eval-1.0.0-py3-none-any.whl.

File metadata

Download URL: llm_behavioral_eval-1.0.0-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 25.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for llm_behavioral_eval-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7502a04d8beedcc7b1eec023b7660e2b72ff408de947cdb5e67d8cb70ce56363`
MD5	`dbae64bb7e035f89b2f000592845b951`
BLAKE2b-256	`3cfaf8f2d7b03b34f7a1641e6c36733918e9dd2ce0d721785d75fed049820ce6`

See more details on using hashes here.

llm-behavioral-eval 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

llm-behavioral-eval

Quick Start

Features

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes