Skip to main content

Evaluate LLMs against behavioral specifications (AGENTS.md, Claude.md, custom rules)

Project description

llm-behavioral-eval

Spec-agnostic LLM evaluation engine. Measures how well any LLM follows behavioral specifications (AGENTS.md, CLAUDE.md, .cursorrules, etc.).

Quick Start

pip install llm-behavioral-eval

# Evaluate any spec directory
behavioral-eval --spec ./my-project --suite core_principles --count 20 --real-llm

# Full evaluation with LLM judge
behavioral-eval --spec ./dann-specs/project --suite all --real-llm --judge-provider deepseek

# Heuristic mode (no API cost for judge)
behavioral-eval --spec ./dann-specs/project --suite all --real-llm --no-judge

# A/B comparison between two models
behavioral-eval --spec ./dann-specs/project --arena llama-home ollama-home --count 30 --real-llm

Features

  • 5 test suites: core_principles, rubric_dimensions, roles, variants, concrete
  • LLM Judge: external LLM scores responses per-dimension (1-5) with justifications
  • Concrete verification: executable coding tasks with real assertion testing
  • Consistency: --repetitions N measures model stability
  • A/B Arena: compare two models head-to-head with statistical significance
  • Heatmaps: per-dimension score breakdowns
  • Spec-agnostic: evaluates any behavioral specification directory

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_behavioral_eval-1.0.0.tar.gz (19.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_behavioral_eval-1.0.0-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_behavioral_eval-1.0.0.tar.gz.

File metadata

  • Download URL: llm_behavioral_eval-1.0.0.tar.gz
  • Upload date:
  • Size: 19.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for llm_behavioral_eval-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e0901829f873f354062c0993db942d217cce533054c5c526bef32626e050547a
MD5 71902d9eedb5f7807b0e42d3bf308ab6
BLAKE2b-256 c5df8c72c61720b3feea7def87ec28c7e5d86020a61be35e92b3e0f11aa7d0de

See more details on using hashes here.

File details

Details for the file llm_behavioral_eval-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_behavioral_eval-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7502a04d8beedcc7b1eec023b7660e2b72ff408de947cdb5e67d8cb70ce56363
MD5 dbae64bb7e035f89b2f000592845b951
BLAKE2b-256 3cfaf8f2d7b03b34f7a1641e6c36733918e9dd2ce0d721785d75fed049820ce6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page